An Introduction to WebVTT and <track>

By Ian Devlin

Introduction

Web Video Text Tracks, more commonly known as WebVTT, is a file format that allows us to mark up external text tracks. Using it in conjunction with HTML5's <track> element means we can associate information such as subtitles, captions and descriptions for a media resource such as audio or video, and display them synchronised with the media resource.

Being able to add textual information in this way allows us to make our media content more accessible to those who are perhaps unable to listen to a video's dialogue due to an auditory impairment or simply because the dialogue is in a language that the listener doesn't understand.

This article introduces the WebVTT file format, the various options available, and how it can be used with the <track> element to add subtitles to a video.

File Format

A WebVTT file is a simple text file, encoded as UTF-8, with a .vtt file extension. It follows a specific format as defined by the specification. It may sound stressful having to learn a new file format, but don't worry — the VTT file format has been kept deliberately very simple.

Note: to use WebVTT files on your server, you may have to make the content type explicitly known, for example with an .htaccess file on an Apache server you could do:

<Files mysubtitle.vtt>
 ForceType text/vtt;charset=utf-8
</Files>

A WebVTT file begins with the following, in this order:

An optional BOM character
The string: WEBVTT
A space or tab character followed by any number of characters that are not a line feed or carriage return
Two or more "WEBVTT line terminators" (a carriage return, a line feed, or both a carriage return and a line feed)

An example of this is as follows:

WEBVTT

Cue-1
00:00:15.000 --> 00:00:18.000
At the left we can see...

What comes next actually defines the textual content and is the important bit.

WebVTT Cues

The content of a WebVTT file consists of zero or more "WebVTT cues", each of which is separated by two or more WebVTT line terminators.

A WebVTT cue allows you to specify some text for a particular part of a media file (e.g. a subtitle) and the timestamp range of the media file that the text in question applies to. You can also assign a unique identifier to a WebVTT cue, which is a simple string that cannot contain the substring: "-->", nor any of the WebVTT line terminators. Each cue takes the following form:

[idstring]
[hh:]mm:ss.msmsms --> [hh:]mm:ss.msmsms
Text string

Since idstrings are optional, you may not want to include them in your code, to cut down on verbosity. However, they can also be useful for file organization, and manipulating WebVTT with script.

The timestamp follows a standard format, where the hour part [hh:] is optional, and where the milliseconds are separated from the seconds by a dot (.) rather than a colon (:). The second part of the timestamp range must be greater than the first part of the timestamp range. Timestamps for different cues can overlap, if you want, but you can't have two subsequent line terminators or the string "-->" in cue data..

The actual text string associated with the timestamp can be a single line of text, or multiple lines. Any text following the specified timestamp will be associated with that timestamp until either a new cue is found or the file ends.

Here are some WebVTT cue examples:

Cue-8
00:00:52.000 --> 00:00:54.000
I don't think so. You?

Cue-9
00:00:55.167 --> 00:00:57.042
I'm Ok.

It is also possible to specify some settings on a cue by cue basis using WebVTT cue settings — we will look at these next.

WebVTT Cue Settings

There are a number of settings that can be set per cue, and these are specified after the timestamp range value:

[idstring]
[hh:]mm:ss.msmsms --> [hh:]mm:ss.msmsms [cue settings]
Text string

These cue settings allow you to specify the position and alignment of the cue text, and the following options are available:

Setting	Value(s)	Function
vertical	rl \|\| lr	Aligns text vertically to the left (lr) or right (rl) (e.g. for Japanese subtitles)
line	[-][0 or more]	References a particular line number that the cue is to be displayed on. Line numbers are based on the size of the first line of the cue. A negative number counts from the bottom of the frame, positive numbers from the top
	[0-100]%	Percentage value indicating the position relative to the top of the frame
position	[0-100]%	Percentage value indicating the position relative to the edge of the frame where the text begins (e.g. the left edge in English)
size	[0-100]%	Percentage value indicating the size of the cue box. The value is given as a percentage of the width of the frame
align	start \|\| middle \|\| end	Specifies the alignment of the text within the cue. The keywords are relative to the text direction

Note: if no cue settings are set, the positioning default to the middle, at the bottom of the frame.

Let's look at a quick example of how some of these might be used:

Cue-8
00:00:52.000 --> 00:00:54.000 align:start size:15%
I don't think so. You?

Cue-9
00:00:55.167 --> 00:00:57.042 align:end line:10%
I'm Ok.

In this short example, the cue for "Cue-8" is aligned to the start of the line, with the cue box size set to 15%, whilst the cue for "Cue-9" is aligned to the end of the line, and positioned vertically 10% from the top of the frame.

WebVTT Cue Components

In addition to all this, you can use "WebVTT cue components" to add further information to the actual cue text itself. These components are similar to HTML elements, and can be used to add semantics and styling to the actual text strings. A list of the different components available is given below:

Value	Meaning
c	Specifies a (CSS) class, which follows the `c`, e.g. `<c.className>Cue text</c>`
i	Specifies italic text
b	Specifies bold text
u	Specifies underlined text
ruby	Specifies something similar to HTML5's ruby element. Within this component, one or more occurrences of a `rt` element are allowed. (The HTML5 `<ruby>` element in words of one syllable or less)
v	Specifies a voice label (if provided) that the cue text is being "spoken in", e.g. `<v Ian>This is useful for adding subtitles</v>`. Note that the voice label won't be displayed. It's just there as a styling hook.

An example of some of the components in action can be seen below:

Cue-8
00:00:52.000 --> 00:00:54.000 align:start size:15%
<v Emo>I don't think so. <c.question>You?</c></v>

Cue-9
00:00:55.167 --> 00:00:57.042 align:end line:10%
<v Proog>I'm Ok.</v>

This example specifies two different voices for the cue text, Emo and Proog respectively. In addition, a CSS class of question is specified in the first cue text, which can then be used for styling purposes. A class such as this can be styled in the usual way via CSS attached or defined in the calling HTML page.

Note that to style cue text in CSS, you need to use a special pseudo-element selector, for example:

video::cue(v[voice="Emo"]) { color:lime }

It is also possible to add timestamps to cue text, indicating that different parts occur at different times. An example of this is shown below:

Cue-8
00:00:52.000 --> 00:00:54.000
<c>I don't think so.</c> <00:00:53.500><c>You?</c>

This will cause all the text to be displayed at the same time, but do note that in supporting browsers you will be able to use the :past and :future pseudo classes to style text differently depending if it is in the future or past. For example:

video::cue(c:past) { color:yellow }

So as you can see, the WebVTT file provides you with many options, allowing you a lot of control over how any text (especially video subtitles) might appear. But how can you actually make your text tracks appear alongside your media, and what else can you do with it? We'll look at this next.

Note: There is a Live WebVTT validator available, for when you want to check whether your WebVTT files are written correctly.

Using the `<track>` element

HTML5's <track> element allows you to link external track files with a particular resource. The <track> element takes a number of attributes, which are listed below:

Name	Value(s)	Description
kind	subtitles	Indicates that the resource specified by `src` is to be used as subtitles.
	captions	Indicates that the resource specified by `src` is to be used as captions. Captions contain more than just dialogue, they can also contain musical queues, sound effects and other audio information.
	descriptions	Indicates that the resource specified by `src` is to be used as descriptions. These contain textual descriptions intended for audio when the visual component is unavailable.
	chapters	Indicates that the resource specified by `src` is to be used as chapter navigation.
	metadata	Indicates that the resource specified by `src` is to be used as metadata.
src	URL	Specifies the resource to use
srclang	Language code	Specifies the language of the resource contained in the `src` attribute
label	Free text	Specifies a unique label for this element
default	n/a	If present, indicates that this element is enabled by default if the user's settings do not specify anything else

The <track> element is specified as a child of an <audio> or <video> element, and there can of course be more than one <track> element defined: each one may provide subtitles for different languages and/or different kinds of text tracks. An example of a video that has subtitles and chapters defined for it in both English and German is given below:

<video controls>
  <source src="elephants-dream.mp4" type="video/mp4">
  <source src="elephants-dream.webm" type="video/webm">
  <track label="English subtitles" kind="subtitles" srclang="en"
         src="elephants-dream-subtitles-en.vtt" default>
  <track label="Deutsche Untertitel" kind="subtitles" srclang="de"
         src="elephants-dream-subtitles-de.vtt">
  <track label="English chapters" kind="chapters" srclang="en"
         src="elephants-dream-chapters-en.vtt">
</video>

Browser Support

Unfortunately at the moment browser support for WebVTT and the <track> element is poor: it is currently only supported by Internet Explorer 10 and Chrome 16+.

You can enable parsing of the track element in Chrome (via chrome:flags and "enable <track> element"), which enables your WebVTT subtitles to be rendered, although no choosing between languages is allowed when multiple track elements (with kind="subtitles") exist. The one which has the default attribute is chosen, although this is not mandatory.

Internet Explorer 10 supports WebVTT and the <track> element, but it is only in beta mode and available on Windows 8 only.

For now the only way to get cross browser support is to use a JavaScript polyfill.

Polyfills

There are a number of <track> polyfills available at the moment, but many of them don't support WebVTT — they support the older WebSRT format, the precursor to WebVTT. Listed below are some polyfills that do support WebVTT:

Playr by Julien Villetorte — supports subtitles, captions, and chapters
Captionator by Christopher Giffard — supports subtitles
MediaElementJS by John Dyer — supports subtitles

All of these support HTML5 <video>, but not HTML5 <audio>, but I think that they could be easily adapted to do so in some way.

Personally I prefer to use Playr as it supports more than just subtitles, and it's also one of the easier polyfills to use: let's look at an example of how to implement it.

WebVTT/`<track>` Polyfill Example

Playr is written by Julien 'delphiki' Villetorte and is incredibly simple to use, once you have your WebVTT file(s) and video of course.

Using Playr

There only a few steps required to get Playr up and running:

Download Playr from Github

Include the JavaScript and CSS files in your webpage, like so:

<link rel="stylesheet" href="playr/playr.css" />
<script src="playr/playr.js"></script>

Add the class playr_video to your <video> element

And that's it! Playr will take over playing your video and parse any <track> elements that it contains. As mentioned earlier, Playr supports subtitles, chapters and captions (which get treated in the same way as subtitles). My example code adds English and German subtitles to a video, as well as navigational English chapters.

The <video> element in my example looks like this:

<video preload="metadata" controls class="playr_video">
  <source src="elephants-dream.mp4" type="video/mp4">
  <source src="elephants-dream.webm" type="video/webm">
  <track label="English subtitles" kind="subtitles" srclang="en"
         src="elephants-dream-subtitles-en.vtt" default>
  <track label="Deutsch subtitles" kind="subtitles" srclang="de"
         src="elephants-dream-subtitles-de.vtt">
  <track label="English chapters" kind="chapters" srclang="en"
         src="elephants-dream-chapters-en.vtt">
</video>

Displaying a Transcript

In addition to supplying subtitles and chapters, I have also included a small JavaScript file, transcript.js, which defines a function loadTranscriptFile. This takes a WebVTT subtitles (or captions) file as an argument, parses it (using code taken from Playr), and displays the text on screen, in an element with an id of transcript.

If the WebVTT subtitle text contains the "voice" tag, the name entered is also displayed alongside the text.

Summary

The introduction of WebVTT and the HTML5 <track> element makes it much easier for web authors to make their video and audio more accessible to those who, for whatever reason, are unable to access the content in the way it is usually presented.

Whilst browser support is still nascent, a number of JavaScript polyfills allow us to make our media more accessible now, in a way that will be understood by browsers once support for WebVTT increases.

Accessibility is something that we, as web authors, should be thinking about when serving media content to our users. The more users who can access our content the better, right?

Ian Devlin

Web developer for pixolith in Düsseldorf, Germany, and author of HTML5 Multimedia.

This article is licensed under a Creative Commons Attribution 3.0 Unported license.

Comments

Yaroslav

Wednesday, June 20, 2012

I'm very disappointed! I do not see any reason to reinvent yet another subtitle(or other video overlapping thing) format when there are exists a number of other formats. For instance, there is SSA/ASS subtitle format. It is already implemented in plenty of video renderers and devices. It has enough features to be used in 99.999% case. I know it probably has some issues but it would be better to fix them than reinvent yet another nobody needed "standard".
Chris Mills

Thursday, June 21, 2012

@Karl Dubost - thanks karlcow. I've added it to the article.
Chris Mills

Thursday, June 21, 2012

@Yaroslav there are many reasons why a new format has been created. In my mind, the two main ones are:

1. The other options are more complicated/verbose, and it was felt a simple format was needed, especially considering text track content may well be written by non-developers. For example, see the Matroska page about SSA/ASS - http://matroska.org/technical/specs/subtitles/ssa.html

2. Other formats are specific for subtitles or captions, etc. It was felt that a format was needed that was general enough to contain many different types of text track - subtitles, captions, descriptions, chapters, etc.

I disagree with needlessly reinventing the wheel just as much as you do, but I believe that in this case there were good reasons for it.
Ian Devlin

Thursday, June 21, 2012

Thanks for that Karl, a very useful addition.
Arif hossain

Friday, June 22, 2012

How can i do this work?
Ian Devlin

Monday, June 25, 2012

What do you mean Arif?

No new comments accepted.