Pycaption Documentation Release 0.5.0

pycaption Documentation Release 0.5.0 PBS Jun 16, 2021 Contents 1 Table of contents 3 1.1 Introduction...............................................3 1.1.1 Python Usage..........................................4 1.2 Supported formats............................................5 1.2.1 SAMI Reader / Writer :: spec..................................5 1.2.2 DFXP/TTML Reader / Writer :: spec.............................5 1.2.3 SRT Reader / Writer :: spec...................................6 1.2.4 WebVTT Reader / Writer :: spec................................6 1.2.5 SCC Reader :: spec.......................................7 1.2.6 Transcript Writer........................................7 1.3 Extensibility...............................................7 i ii pycaption Documentation, Release 0.5.0 pycaption is a python library for converting caption formats. Contents 1 pycaption Documentation, Release 0.5.0 2 Contents CHAPTER 1 Table of contents 1.1 Introduction pycaption is a caption reading/writing module. Use one of the given Readers to read content into a CaptionSet object, and then use one of the Writers to output the CaptionSet into captions of your desired format. Requires Python 2.7. Turn a caption into multiple caption outputs: srt_caps=u'''1 00:00:09,209 --> 00:00:12,312 This is an example SRT file, which, while extremely short, is still a valid SRT file. ''' converter= CaptionConverter() converter.read(srt_caps, SRTReader()) print converter.write(SAMIWriter()) print converter.write(DFXPWriter()) print converter.write(pycaption.transcript.TranscriptWriter()) Not sure what format the caption is in? Detect it: from pycaption import detect_format caps=u'''1 00:00:01,500 --> 00:00:12,345 Small caption''' reader= detect_format(caps) if reader: print SAMIWriter().write(reader().read(caps)) 3 pycaption Documentation, Release 0.5.0 Or if you expect to have only a subset of the supported input formats: caps=u'''1 00:00:01,500 --> 00:00:12,345 Small caption''' if SRTReader().detect(caps): print SAMIWriter().write(SRTReader().read(caps)) elif DFXPReader().detect(caps): print SAMIWriter().write(DFXPReader().read(caps)) elif SCCReader().detect(caps): print SAMIWriter().write(SCCReader().read(caps)) 1.1.1 Python Usage Example: Convert from SAMI to DFXP from pycaption import SAMIReader, DFXPWriter sami=u'''<SAMI><HEAD><TITLE>NOVA3213</TITLE><STYLE TYPE="text/css"> </STYLE></HEAD><BODY> <SYNC start="9209"> ( clock ticking ) FRENCH LINE 1! </SYNC> <SYNC start="12312"> </SYNC> <SYNC start="14848"> MAN: When we think of E equals m c-squared, FRENCH LINE 2? </SYNC>''' print DFXPWriter().write(SAMIReader().read(sami)) Which will output the following: <?xml version="1.0" encoding="utf-8"?> <tt xml:lang="en" xmlns="http://www.w3.org/ns/ttml" xmlns:tts="http://www.w3.org/ns/ ,!ttml#styling"> (continues on next page) 4 Chapter 1. Table of contents pycaption Documentation, Release 0.5.0 (continued from previous page) <head> <styling> <style id="p" tts:color="#fff" tts:fontfamily="Arial" tts:fontsize="10pt" ,!tts:textAlign="center"/> </styling> </head> <body> <div xml:lang="fr-cc"> FRENCH LINE 1! FRENCH LINE 2? </div> <div xml:lang="en-US"> ( clock ticking ) MAN: When we think of E equals m c-squared, </div> </body> </tt> 1.2 Supported formats Read: - DFXP/TTML - SAMI - SCC - SRT - WebVTT Write: - DFXP/TTML - SAMI - SRT - Transcript - WebVTT See the examples folder for example captions that currently can be read correctly. 1.2.1 SAMI Reader / Writer :: spec Microsoft Synchronized Accessible Media Interchange. Supports multiple languages. Supported Styling: - text-align - italics - font-size - font-family - color If the SAMI file is not valid XML (e.g. unclosed tags), will still attempt to read it. 1.2.2 DFXP/TTML Reader / Writer :: spec The W3 standard. Supports multiple languages. Supported Styling: - text-align - italics - font-size - font-family - color 1.2. Supported formats 5 pycaption Documentation, Release 0.5.0 1.2.3 SRT Reader / Writer :: spec SubRip captions. If given multiple languages to write, will output all joined together by a ‘MULTI-LANGUAGE SRT’ line. Supported Styling: - None Assumes input language is english. To change: pycaps= SRTReader().read(srt_content, lang='fr') 1.2.4 WebVTT Reader / Writer :: spec WebVTT is a W3C standard for displaying timed text in HTML5. Its specification is currently (as of February 2015) in draft stage and therefore not all features are implemented by major players, the same being true for pycaption. Styling Styling in WebVTT can be done via inline tags (e.g. , etc.) or external CSS rules applied to text wrapped in class (<c>) or voice (<v>) tags. pycaption currently only keeps voice tags on conversion. Example: <v Fred>Hi, my name is Fred is converted to Fred: Hi, my name is Fred The following WebVTT supported tags are stripped off the cue text: • <c>, , , , <ruby>, <rt>, <lang> and timestamp tags (<h:mm:ss.sss>) Non-supported tags are left unchanged as a natural part of the cue text with no special meaning. Positioning The WebVTT specs allow customizing the position of cues by configuring a number of cue settings. pycaption currently only maintains positioning information on writing, in which case it supports the following settings: • A WebVTT line position cue setting. • A WebVTT text position cue setting. • A WebVTT size cue setting. • A WebVTT alignment cue setting. pycaption does not support: • A WebVTT vertical text cue setting. • A WebVTT region cue setting. Refer to the official WebVTT specification for details about the cue settings. 6 Chapter 1. Table of contents pycaption Documentation, Release 0.5.0 1.2.5 SCC Reader :: spec Scenarist Closed Caption format. Assumes Channel 1 input. Supported Styling: - italics By default, the SCC Reader does not simulate roll-up captions. To enable roll-ups: pycaps= SCCReader().read(scc_content, simulate_roll_up= True) Also, assumes input language is english. To change: pycaps= SCCReader().read(scc_content, lang='fr') Now has the option of specifying an offset (measured in seconds) for the timestamp. For example, if the SCC file is 45 seconds ahead of the video: pycaps= SCCReader().read(scc_content, offset=45) The SCC Reader handles both dropframe and non-dropframe captions, and will auto-detect which format the captions are in. 1.2.6 Transcript Writer Text stripped of styling, arranged in sentences. Supported Styling: - None The transcript writer uses natural sentence boundary detection algorithms to create the transcript. 1.3 Extensibility Different readers and writers are easy to add if you would like to: - Read/Write a previously unsupported format - Read/Write a supported format in a different way (more styling?) Simply follow the format of a current Reader or Writer, and edit to your heart’s desire. 1.3. Extensibility 7.

Pycaption Documentation Release 0.5.0

Introduction to Closed Captions

Maccaption 7.0.4 User Guide

X-Title Caption Export

Open Source Support for TTML Subtitles Status Quo and Outlook

Maccaption 6.6.5 User Guide

Captionmaker User Guide 4

ESUB-XF Specification Version 1.03 “European Subtitle Exchange

Maccaption 6.5 User Guide

Common Metadata Version: 2.5 Date: December 16, 2016

Differences from HTML4

Weaving the Web(VTT) of Data Thomas Steiner,1 Hannes Mühleisen,2 Ruben Verborgh,3 Pierre-Antoine Champin,1 Benoît Encelle,1 and Yannick Prié4

IMSC 1.1 End-To-End Worldwide Subtitles and Captions HPA 2019 What Is IMSC 1.1?