Are Subtitling Corpora Really Subtitle-Like?

Are Subtitling Corpora really Subtitle-like? Alina Karakanta1;2, Matteo Negri1, Marco Turchi1 1 Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento - Italy 2 University of Trento, Italy fakarakanta,negri,[email protected] Abstract proper segmentation by phrase or sentence signifi- cantly reduces reading time and improves compre- Growing needs in translating multimedia hension (Perego, 2008; Rajendran et al., 2013). content have resulted in Neural Machine Hence, there is ample room for developing Translation (NMT) gradually becoming an fully or at least partially automated solutions for established practice in the field of subti- subtitle-oriented NMT, which would contribute tling. Contrary to text translation, subti- in reducing post-processing effort and speeding- tling is subject to spatial and temporal con- up turn-around times. Automated approaches straints, which greatly increase the post- though, especially NMT, are data-hungry. Perfor- processing effort required to restore the mance greatly depends on the availability of large NMT output to a proper subtitle format. amounts of high-quality data (up to tens of mil- In this work, we explore whether exist- lions of parallel sentences), specifically tailored ing subtitling corpora conform to the con- for the task. In the case of subtitle-oriented NMT, straints of: 1) length and reading speed; this implies having access to large subtitle train- and 2) proper line breaks. We show that ing corpora. This leads to the following ques- the process of creating parallel sentence tion: What should data specifically tailored for alignments removes important time and subtitling-oriented NMT look like? line break information and propose prac- There are large amounts of available parallel tices for creating resources for subtitling- data extracted from subtitles (Lison and Tiede- oriented NMT faithful to the subtitle for- mann, 2016; Pryzant et al., 2018; Di Gangi et al., mat. 2019). These corpora are usually obtained by col- 1 Introduction lecting files in a subtitle specific format (.srt) in several languages and then parsing and aligning Machine Translation (MT) of subtitles is a grow- them at sentence level. MT training at sentence ing need for various applications, given the level generally increases performance as the sys- amounts of online multimedia content becoming tem receives longer context (useful, for instance, available daily. Subtitling translation is a complex to disambiguate words). As shown in Table 1, process consisting of several stages (transcription, this process compromises the subtitle format by translation, timing), and manual approaches to the converting the subtitle blocks into full sentences. task are laborious and costly. Subtitling has to With this “merging”, information about subtitle conform to spatial constraints such as length, and segmentation (line breaks) is often lost. Therefore, temporal constraints such as reading speed. While recovery of the MT output to a proper subtitle for- length and reading speed can be modelled as a mat has to be performed subsequently, either as a post-processing step in an MT workflow using post-editing process or by using hand-crafted rules simple rules, subtitle segmentation, i.e. where and and boundary predictions. Integrating the subti- if to insert a line break, depends on semantic and tle constraints in the model can help reduce the syntactic properties. Subtitle segmentation is par- post-processing effort, especially in cases where ticularly important, since it has been shown that a the input is a stream of data, such as in end- to-end Speech Neural Machine Translation. To Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 date, there has been no study examining the conse- International (CC BY 4.0). quences of obtaining parallel sentences from sub- 1 ing heuristics based on time codes and punctua- 00:00:14,820 −− > 00:00:18,820 Grazie mille, Chris. tion. Then, the extracted sentences are aligned to E´ un grande onore venire create parallel corpora with the time-overlap al- 2 gorithm (Tiedemann, 2008) and bilingual dictio- 00:00:18,820 −− > 00:00:22,820 su questo palco due volte. naries. The 2018 version of OpenSubtitles has Vi sono estremamente grato. high-quality sentence alignments, however, it does not resemble the realistic subtitling scenario de- Grazie mille, Chris. E´ un grande onore venire su questo palco due volte. scribed above, since time and line break informa- Vi sono estremamente grato. tion are lost in the merging process. The same methodology was used for compiling Montene- Table 1: Subtitle blocks (top, 1-2) as they appear grinSubs (Bozoviˇ c´ et al., 2018), an English – Mon- in an .srt file and the processed output for obtain- tenegrin parallel corpus of subtitles, which con- ing aligned sentences (bottom). tains only 68k sentences. The Japanese-English Subtitle Corpus titles on preserving the subtitling constraints. JESC (Pryzant et al., 2018) is a large paral- In this work, we explore whether the large, pub- lel subtitling corpus consisting of 2.8 million licly available parallel data compiled from sub- sentences. It was created by crawling the internet titles conform to the temporal and spatial con- for film and TV subtitles and aligning their straints necessary for achieving quality subtitles. captions with improved document and caption We compare the existing resources to an adapta- alignment algorithms. This corpus is aligned tion of MuST-C (Di Gangi et al., 2019), where the at caption level, therefore its format is closer to data is kept as subtitles. For evaluating length and our scenario. On the other hand, non-matching reading speed, we employ character counts, while alignments are discarded, which might hurt the for proper line breaks we use the Chink-Chunk al- integrity of the subtitling documents. As we will gorithm (Liberman and Church, 1992). Based on show, this is particularly important for learning the analysis, we discuss limitations of the existing proper line breaks between subtitle blocks. data and present a preliminary road-map towards A corpus preserving both subtitle segmentation creating resources for training subtitling-oriented and order of lines is SubCo (Mart´ınez and Vela, NMT faithful to the subtitling format. 2016), a corpus of machine and human translated subtitles for English–German. However, it only 2 Related work consists of 2 source texts (∼150 captions each) 2.1 Subtitling corpora with multiple student and machine translations. Therefore, it is not sufficient for training MT sys- Building an end-to-end subtitle-oriented transla- tems, although it could be useful for evaluation be- tion system poses several challenges, mainly re- cause of the multiple reference translations. lated to the fact that NMT training needs large Slightly deviating from the domain of films and amounts of high-quality data representative of the TV series, corpora for Spoken Language Transla- target application scenario (subtitling in our case). tion (SLT) have been created based on TED talks. Human subtitlers translate either directly from the The Web Inventory of Transcribed and Translated audio/video or they are provided with a template Talks (Cettolo et al., 2012) is a multilingual col- with the source text already in the format of subti- lection of transcriptions and translations of TED tles containing time codes and line breaks, which talks. The talks are aligned at sentence level they have to adhere to when translating. without audio information. Based on WIT, the Several projects have attempted to collect paral- IWSLT campaigns (Niehues et al., 2018) are an- lel subtitling corpora. The most well-known one is nually releasing parallel data and their correspond- the OpenSubtitles1 corpus (Lison and Tiedemann, ing audio for the task of SLT, which are extracted 2016), extracted from 3.7 million subtitles across based on time codes but again with merging op- 60 languages. Since subtitle blocks do not always erations to create segments. MuST-C (Di Gangi correspond to sentences (see Table 1), the blocks et al., 2019) is to date the largest multilingual are merged and then segmented into sentences us- corpus for end-to-end speech translation. It con- 1http://www.opensubtitles.org/ tains (audio-source language transcription-target language translation) triplets, aligned at segment 2. Lines per subtitle. Subtitles should not take up level. The process of creation is the opposite from too much space on screen. The space allowed for IWSLT; the authors first align the written parts a subtitle is about 20% of screen space. Therefore, and then match the audio. This is a promising a subtitle block should not exceed 2 lines. corpus for an end-to-end system which translates from audio directly into subtitles. However, the 3. Reading speed. The on-air time of a subtitle translations are merged to create sentences, there- should be sufficient for the audience to read and fore they are far from the suitable subtitle format. process its content. The subtitle should match as Given the challenges discussed above, there exists much as possible the start and the end of an utter- no systematic study of the suitability of the exist- ance. The duration of the utterance (measured ei- ing corpora for subtitling-oriented NMT. ther in seconds or in feet/frames) is directly equiv- alent to the space a subtitle should occupy. As a 2.2 Subtitle segmentation general rule, we consider max. 21 chars/second. Subtitle segmentation techniques have so far fo- cused on monolingual subtitle data. Alvarez´ et al. 4. Preserve ‘linguistic wholes’. This criterion is (2014) trained Support Vector Machine and Logis- related to subtitle segmentation. Subtitle segmen- tic Regression classifiers on correctly/incorrectly tation does not rely only on the allowed length, segmented subtitles to predict line breaks. Ex- but should respect linguistic norms. To facilitate tending this work, Alvarez´ et al. (2017) used a readability, subtitle splits should not “break” se- Conditional Random Field (CRF) classifier for the mantic and syntactic units.

Load more