Getting Started With the Track Element

HTML5 Rocks

Getting started with the HTML5 track element

The track element provides a simple, standardized way to add subtitles, captions, screen reader descriptions and chapters to video and audio.

Tracks can also be used for other kinds of timed metadata. The source data for each track element is a text file made up of a list of timed cues, and cues can include data in formats such as JSON or CSV. This is extremely powerful, enabling deep linking and media navigation via text search, for example, or DOM manipulation and other behaviour synchronised with media playback.

The track element is currently available in Internet Explorer 10 and Google Chrome. Firefox support is not yet implemented.

Below is a simple example of a video with a track element. Play it to see subtitles in English:

This demo requires a browser such as Google Chrome that supports the track element.

Code for a video element with subtitles in English and German might look like this:

<video src="foo.ogv">
  <track kind="subtitles" label="English subtitles" src="subtitles_en.vtt" srclang="en" default></track>
  <track kind="subtitles" label="Deutsche Untertitel" src="subtitles_de.vtt" srclang="de"></track>

In this example, the video element would display a selector giving the user a choice of subtitle languages. (Though at the time of writing, this hasn't yet been implemented).

Note that the track element cannot be used from a file:// URL. To see tracks in action, put files on a web server.

Each track element has an attribute for kind, which can be subtitles, captions, descriptions, chapters or metadata. The track element src attribute points to a text file that holds data for timed track cues, which can potentially be in any format a browser can parse. Chrome supports WebVTT, which looks like this:


00:00:10.000 --> 00:00:12.500
Left uninspired by the crust of railroad earth

00:00:13.200 --> 00:00:16.900
that touched the lead to the pages of your manuscript.

Each item in a track file is called a cue. Each cue has a start time and end time separated by an arrow, with cue text in the line below. Cues can optionally also have IDs: 'railroad' and 'manuscript' in the examples above. Cues are separated by an empty line.

Cue times are in hours:minutes:seconds:milliseconds format! Parsing is strict. Numbers must be zero padded if necessary: hours, minutes and seconds must have two digits (00 for a zero value) and milliseconds must have three digits (000 for zero). There is an excellent WebVTT validator at, which checks for errors in time formatting, and problems such as non-sequential times.

The following demo shows how subtitles can be searched in order to navigate within a video.

Using HTML and JSON in cues

The text of a cue in a WebVTT file can span multiple lines, as long as none of the lines are blank. That means cues can include HTML:


00:01:15.200 --> 00:02:18.800
<p>Multi-celled organisms have different types of cells that perform specialised functions.</p>
<p>Most life that can be seen with the naked eye is multi-cellular.</p>
<p>These organisms are though to have evolved around 1 billion years ago with plants, animals and fungi having independent evolutionary paths.</p>

Why stop there? Cues can also use JSON:


00:01:15.200 --> 00:02:18.800
"title": "Multi-celled organisms",
"description": "Multi-celled organisms have different types of cells that perform specialised functions.
  Most life that can be seen with the naked eye is multi-cellular. These organisms are though to
  have evolved around 1 billion years ago with plants, animals and fungi having independent
  evolutionary paths.",
"src": "multiCell.jpg",
"href": ""

00:02:18.800 --> 00:03:01.600
"title": "Insects",
"description": "Insects are the most diverse group of animals on the planet with estimates for the total
  number of current species range from two million to 50 million. The first insects appeared around
  400 million years ago, identifiable by a hard exoskeleton, three-part body, six legs, compound eyes
  and antennae.",
"src": "insects.jpg",
"href": ""

The ability to use structured data in cues makes the track element extremely powerful and flexible. A web app can listen for cue events, extract the text of each cue as it fires, parse the data and then use the results to make DOM changes (or perform other JavaScript or CSS tasks) synchronised with media playback. This technique is used to synchronise video playback and map marker position in the demo at

Tracks can also add value to audio and video by making search easier, more powerful and more precise.

Cues include text that can be indexed, and a start time that signifies the temporal 'location' of content within media. Cues could even include data about the position of items within a video frame. Combined with media fragment URIs, tracks could provide a powerful mechanism for finding and navigating to content within audio and video. For example, imagine a search for 'Etta James' which returns results that link directly to points within videos where her name is mentioned in cue text.

The Tree Of Life demo is a simple example of how a metadata track can be used to enable navigation via subtitle search–and also shows how timed metadata can enable manipulation of the DOM synchronised with media playback.

Getting at tracks and cues with JavaScript

Audio and video elements have a textTracks property that returns a TextTrackList, each member of which is a TextTrack corresponding to a <track> element:

var videoElement = document.querySelector("video");
var textTracks = videoElement.textTracks; // one for each track element
var textTrack = textTracks[0]; // corresponds to the first track element
var kind = textTrack.kind // e.g. "subtitles"
var mode = textTrack.mode // e.g. "disabled", hidden" or "showing"

Each TextTrack, in turn, has a cues property that returns a TextTrackCueList, each member of which is an individual cue. Cue data can be accessed with properties such as startTime, endTime and text (used to get the text content of the cue):

var cues = textTrack.cues;
var cue = cues[0]; // corresponds to the first cue in a track src file
var cueId = // corresponds to the cue id set in the WebVTT file
var cueText = cue.text; // "The Web is always changing", for example (or some JSON!)

Sometimes it makes sense to get at TextTrack objects via the HTMLTrackElement:

var trackElements = document.querySelectorAll("track");
// for each track element
for (var i = 0; i < trackElements.length; i++) {
  trackElements[i].addEventListener("load", function() {
    var textTrack = this.track; // gotcha: "this" is an HTMLTrackElement, not a TextTrack object
    var isSubtitles = textTrack.kind === "subtitles"; // for example...
    // for each cue
    for (var j = 0; j < textTrack.cues.length; ++j) {
      var cue = textTrack.cues[j];
      // do something

As in this example, TextTrack properties are accessed via a track element's track property, not the element itself.

TextTracks are accessible once the load event has fired–and not before.

Track and cue events

The two types of cue event are:

  • enter and exit events fired for cues
  • cuechange events fired for tracks.

In the previous example, cue event listeners could have been added like this:

cue.onenter = function(){
  // do something

cue.onexit = function(){
  // do something else

Be aware that the enter and exit events are only fired when cues are entered or exited via playback. If the user drags the timeline slider manually, a cuechange event will be fired for the track at the new time, but enter and exits events will not be fired. It's possible to get around this by listening for the cuechange track event, then getting the active cues. (Note that there may be more than one active cue.)

The following example gets the current cue, when the cue changes, and attempts to create an object by parsing the cue text:

textTrack.oncuechange = function (){
  // "this" is a textTrack
  var cue = this.activeCues[0]; // assuming there is only one active cue
  var obj = JSON.parse(cue.text);
  // do something

Not just for video

Don't forget that tracks can be used with audio as well as video–and that you don't need audio, video or track elements in HTML markup to take advantage of their APIs. The TextTrack API documentation has a nice example of this, showing a neat way to implement audio 'sprites':

var sfx = new Audio('sfx.wav');
var track = sfx.addTextTrack('metadata'); // previously implemented as addTrack()

// Add cues for sounds we care about.
track.addCue(new TextTrackCue(12.783, 13.612, 'dog bark')); // startTime, endTime, text
track.addCue(new TextTrackCue(13.612, 15.091, 'kitten mew'));

function playSound(id) {
  sfx.currentTime = track.getCueById(id).startTime;;

playSound('dog bark');
playSound('kitten mew');

You can see a more elaborate example in action at

The addTextTrack method takes three parameters: kind (for example, 'metadata', as above), label (for example, 'Sous-titres français) and language (for example, 'fr').

The example above also uses addCue, which takes a TextTrackCue object, the constructor for which takes an id (e.g. 'dog bark'), a startTime, an endTime, the cue text a webVTT cue settings argument (for positioning, size and alignment) and a boolean pauseOnExit flag (for example, to pause playback after asking a question in an educational video).

Note that startTime and endTime use floating point values in seconds, and not the hours:minutes:seconds:milliseconds format used by WebVTT.

Cues can also be removed with removeCue(), which takes a cue as its argument, for example:

var videoElement = document.querySelector("video");
var track = videoElement.textTracks[0];
var activeCue = track.activeCues[0];

If you try this out, you'll notice that a rendered cue is removed as soon as the code is called.

Tracks have a mode attribute, which can be "disabled", "hidden" or "showing" (note that these string values were orginally implemented as enums). This can be useful if you want to use track events but turn off default rendering–play the following video to see an example of this (built by Eric Bidelman):


This example uses the getCueAsHTML() method, which returns an HTML version of each cue, converting from WebVTT format to an HTML DocumentFragment using the WebVTT cue text parsing and DOM construction rules. Use the text property of a cue if you just want to get the raw text value of the cue as it is in the src file.

In this context, it can be useful to use the getCueAsHTML() method, which returns an HTML version of each cue, converting from WebVTT format to an HTML DocumentFragment using the WebVTT cue text parsing and DOM construction rules. Use the text property of a cue if you just want to get the raw text value of the cue as it is in the src file.

More on markup

Markup can be added to the timestamp line of a cue to specify text direction, alignment and position. Cue text can be marked up to specify voice (for example, to give the name of speakers) and to add formatting. Subtitles and captions can be manipulated with CSS, like this:

::cue {
  color: #444;
  font: 1em sans-serif;
::cue .warning {
  color: red;
  font: bold;

Silvia Pfeiffer's HTML5 Video Accessibility slides give more examples of working with markup–as well as showing how to build chapter tracks for navigation and description tracks for screen readers.

And finally...

Storing cue data in text files, rather than encoding them in-band in audio or video files, makes subtitling and captioning straightforward–and can improve accessibility, searchability and data portability.

The track element also enables the use of timed metadata and dynamic content linked to media playback, which in turn adds value to the audio and video elements.

Given its power, flexibility and simplicity, the track element is a big step towards making media on the Web more open and more dynamic.

Learn more