ML-based Captions are Crucial for Streaming Media

Automatic speech recognition and other ML technologies are enabling streaming providers to realize tremendous efficiencies in the media captions workflow.

Thanks to the latest advancements in ML, video service providers can ensure that when content is delivered in different geographies and quality levels for OTT platforms, captions maintain outstanding quality. Delivering high-quality captions is becoming even more crucial to comply with strict regulations and to target global audiences.

Why captioning is important

The first captions appeared nearly 50 years ago, enabling a text version of speech and other sounds to be displayed on television, DVDs, and online videos. In 1972, “The French Chef” with Julia Child was the first open captioned program to air, setting a new standard for television experiences. Those individuals who were hard of hearing, deaf or non-native language speakers could finally understand what was happening on the screen and enjoy the television experience.

Since then, captions have evolved. Not only have captions become more accurate and precise, but they are now universally available on any television and streaming service, helping video service providers resolve audio issues, adhere to industry regulations and comply with standards, improve viewer accessibility, optimize brand retention, and promote globalization.

Captions have become mandatory from a regulatory and business standpoint; however, adding captions to streaming content can be challenging.

Challenges with offering high-quality captions

Captions are not the same as audio transcriptions. A substantial amount of time and effort goes into creating and verifying the quality of captions.

One of the most significant challenges with captioning is the high turnaround time. Creating a caption can take about eight to 10 times the duration of the actual video itself. This is a tedious task and especially difficult when streaming providers are dealing with a high volume of content. Given that video service providers are handling a massive volume of content, manual captioning has become impractical. Additionally, captioning is expensive, costing an average of $5 to $10 per minute.

Creating and verifying captions is a complex process. The quality of captions can be diminished during transcoding, editing, and screen placement. Captions should always be properly placed to avoid blocking an object of interest in the scene, as improperly placed captions can lead to viewer misunderstanding and confusion.

Many video service providers are abandoning manual captioning in favor of automatic, ML-based solutions due to the aforementioned reasons and lack of trained professionals. Ensuring high-quality captions requires knowledge of the subject matter, the nuances of each language and the region’s captioning rules, along with fast and accurate typing skills.

How to speed up captions creation and verification

ML-based systems expediate the creation of captions at a global scale. With a high-performance, ML-based captioning solution video service providers can easily generate caption predictions for review, reducing turnaround time, manual efforts, and costs.

ML-based caption generation is a four-step process that includes transcription, intelligent segmentation taking care of all captioning rules, placement, and review. Transcription is a very complex step that involves voice detection and audio-to-text conversion.  As additional vocabulary words are added to the speech recognition dictionary during the review process, the accuracy of predictions improve.

The intelligent segmentation phase relies on natural language processing to ensure that the transcript is divided into semantically coherent units and each unit can be displayed properly on the video frame. Each caption text should not contain more than two or three lines to avoid covering too much of video. Each line should not contain more than 42 characters or it will run out of space. Segmentation needs to take into account natural pauses, clause boundaries, compound words, and more, to ensure that the meaning of a sentence is not lost. Scene change detection can be employed at this step to ensure that captions do not run over the scene change boundaries.

Captions are generally placed at the bottom of the screen. However, sometimes there is already an object of interest at the bottom of the screen, for example, a sports scoreboard. An ML-based caption generation system will be able to naturally detect objects on the screen and place captions in the most advantageous spot.

Once all predictions along with confidence score is in place, it needs a final review. Having an intelligent player simplifies the review process, giving streaming providers useful insights such as areas with low confidence score, spelling mistakes or captioning violations. A player with instant access to video frames or audio waveforms corresponding to any caption text speeds up the editing and review. Once review is done, the player should also be able to export the caption text and time along with all the formatting and placement attributes to a sidecar format ready to be consumed.

Once audio is captioned, machine translations can be employed to generate subtitles in any given language. A good subtitling tool needs to understand the nuances and segmentation rules of the target language as well rather than only translating the text.


Since the advent of video streaming services, content has become globalized, and foreign-language content can only be consumed with the help of captions and subtitles. As broadcasters and service providers extend the reach of their content to global audiences and look to capture additional views, having an ML-based, automatic captions solution is essential.

Utilizing ML technologies, broadcasters and video streaming providers can create quality captions at scale with consistency and 24/7 availability, without heavily investing in manual labor. ML-based captioning solutions reduce human error and allow time for more creativity, freeing up individuals from transcription, timestamping and complex segmentation. This enables video service providers to focus their efforts on more difficult audio segments and adding audio descriptions. With an ML-based captioning solution, video service providers have peace of mind knowing that captions are exceptional production quality.

Sana Afsa is principal engineer, Interra Systems

Most Recent