arXiv:2107.06246v1 [cs.CL] 13 Jul 2021 ihtetre agaesbils uh“bilingual Such . language target the with iuladaosi ieso fteaudiovisual the of dimension acoustic and visual ivsa otn.Sec rnlto,adespe- ( au- and approaches of direct , amounts cially Speech vast content. for diovisual subtitles rapid of the for generation call localisation media in trends New Introduction 1 n utte hudntol ecnitn ihthe with consistent be only not should captions dif- subtitles cases, and with those In audiences needs. and accessibility ferent learners language for languages, official or multiple meet- with online countries multilingual in ings, in useful are subtitles” along displayed in be while to need applications, captions of settings range some a for necessary is “captions”) (hereafter subtitles intralingual the obtain- However, ing audio. the from language directly target subtitles the of generate automatic) but speech or not source the (manual do transcription they a because require efficiency high with results a tal. et har hc eeaecnitn atossbilsin captions-subtitles consistent generate which source intermediate an for need the without n olnug-pcfi ed n norms. and needs language-specific to conform- ing subtitles suf- produce for to flexibility allowing ficient cap- still while generated subtitles and the tions between consistency performance and increased to leads decoding that joint show We evaluating findings Our for consistency. metrics subtitling content. new lexical introduce and further structure of models ST terms on focus we work, this sce- In multilingual in narios. required often also is other, it each but inform only processes decoding not two the does when advantages quality subtitles output potential bring target and cap- source captions of generation joint (i.e. the However, timing tions). and transcription language received subtitles of generation lately the for has interest growing (ST) translation Speech , 2019 ln Karakanta Alina 1 ,hv eetysonpromising shown recently have ), odzoeBuoKslr i omrv 8 oo rno-I - Trento Povo, 18, Sommarive Via Kessler, Bruno Fondazione Abstract { akarakanta,mgaido,negri,turchi on eeaino atosadSubtitles and Captions of Generation Joint ´ rr tal. et B´erard ewe lxblt n Consistency: and Flexibility Between 1 , 2 ac Gaido Marco , 2 , nvriyo rno Italy Trento, of University 2016 ; Ba- 1 , hr h rncitdcdrsae r asdto passed are states decoder transcript the where ( 2 hrdbtentetasrpinadtetransla- 2) the decoder, and tion transcription the between shared utte o h aeadosuc.W experi- We 1) source. models: following the audio with ment same the and captions for both subtitles generating jointly by time, first desideratum. is a consistency where applications automa- subtitling in increase tion could models joint improvesconsistency, generation joint if locali- Lastly, the process. up sation speed turn in and efficiency, models, increase different system two of single maintenance the a avoid can with improvements. generation such joint transcription to Moreover, of lead could tasks translation the and between sharing knowl-edge that im- hypothesise ex- We to quality. without leads in provements generation b) joint and the subtitles, whether amining consis- target captions the with obtaining tent of necessity the ering 2019 vlainbyn h sa erc sdt assess to used metrics usual the the beyond extend evaluation we cas- Moreover, a model. and (ASR+MT) model cade ST sub- direct for independent ST an in titling: mod- approaches established these the compare with We to els attends inputs. which mecha- speech decoder encoded attention translation the second to a nism adding by two-stage Chiang and sopoulos ngnrtn nelnulsbils( subtitles interlingual generating on industry. localisation the quality in the process among assurance facilitate reaction to same or ex- the audiences, elicit for multilingual to experience, order user in for ample segmentation. vital and is time-aligned length Consistency of their (pieces occupy, they blocks text) of number example the for other, in each between also but material h rnlto eoe,ad3) and decoder, translation the es tal. et Weiss atoNegri Matteo , nti ok eadesteeise o the for issues these address we work, this In rvoswr nS o uttighsfocused has subtitling for ST in work Previous ; aaat tal. et Karakanta , 2017 } Two-Stage 1 @fbk.eu ,weetesec noe is encoder speech the where ), ac Turchi Marco , , , 2018 2020a ,wihetnsthe extends which ), ( ,a ihu consid- without a) ), aoe al. et Kano taly Triangle hrdDirect Shared auo tal. et Matusov 1 ( , Anasta- 2017 ), , transcription and translation quality (respectively been common in countries with more than one offi- WER and BLEU), by also evaluating the form and cial languages, such as Belgium and Finland (Got- consistency of the generated subtitles. tlieb, 2004). Recently, however, captions along Sperber et al. (2020) introduced several lexical with subtitles have been employed in other coun- and surface metrics to measure consistency of ST tries to attract wider audiences, e.g. in Mainland outputs, but they were only applied to standard, China English captions are displayed along with non-subtitle, texts. Subtitles, however, are a partic- Mandarin subtitles. Interestingly, despite doubling ular type of text structured in blocks which accom- the amount of text that appears on the screen and pany the action on screen. Therefore, we propose the high redundancy, it has been shown that bilin- to measure their consistency by taking advantage gual subtitles do not significantly increase users’ of this structure and introduce metrics able to re- cognitive load (Liao et al., 2020). ward subtitles that share similar structure and con- One group which undoubtedly benefits from the tent. parallel presence of captions and subtitles are lan- Our contributions can be summarised as fol- guage learners. Captions have been found to in- lows: crease learners’ L2 vocabulary (Sydorenko, 2010) and improve listening comprehension (Guichon • We employ ST to directly generate both cap- and McLornan, 2008). Subtitles in the learn- tions and subtitles without the need for hu- ers’ native language (L1) are an indispensable man pre-processing (transcription, segmenta- tool for comprehension and access to information, tion). especially for beginners. In bilingual subtitles, the captions support learners in understanding the • We propose new, task-specific metrics to eval- speech and acquiring terminology, while subtitles uate subtitling consistency, a challenging and serve as a dictionary, facilitating bilingual map- understudied problem in subtitling evalua- ping (Garc´ıa, 2017). Consistency is particularly tion. important for bilingual subtitles. Terms should fall in the same block and in similar positions. More- • We show increased performance and consis- over, similar length and equal number of lines can tency between the generated captions and prevent long distance saccades, assisting in spot- subtitles compared to independent decoding, ting the necessary information in the two language while preserving adequate conformity to sub- versions. Several subtitling tools have recently al- titling constraints. lowed for combining captions and subtitles on the same video (e.g. Dualsub1) and bilingual subti- 2 Background tles can be obtained for Youtube videos2 and TED 3 2.1 Bilingual subtitles Talks. Another aspect where consistency between cap- New life conditions maximised the time spent tions and subtitles is present is in subtitling tem- in front of screens, transforming today’s medias- plates. A subtitling template is a source lan- cape in a complex puzzle of new actors, voices guage/English version of a subtitle file already and workflows. Face-to-face meetings and con- segmented and containing timestamps, which is ferences moved online, opening up participation used to directly translate the text in target lan- for global audiences. In these new settings multi- guages while preserving the same structure (Cin- linguality and versatility are dominant, and mani- tas and Remael, 2007; Georgakopoulou, 2019; fested in business meetings with international part- Netflix, 2021). This process reduces the cost, turn- ners, conferences with world-wide coverage, mul- around times and effort needed to create a sepa- tilingual classrooms and audiences with mixed ac- rate timed version for each language, and facili- cessibility needs. Given these growing needs for tates quality assurance since errors can be spotted providing inclusive and equal access to audiovi- across the same blocks (Nikoli´c, 2015). These ben- sual material for a multifaceted audience spectrum, efits motivated our work towards simultaneously efficiently obtaining high-quality captions and sub- titles is becoming increasingly relevant. 1https://www.dualsub.xyz/ Traditionally, displaying subtitles in two lan- 2https://www.watch-listen-read.com/ guages in parallel (bilingual or dual subtitles) has 3https://amara.org/en/teams/ted/ generating two language versions with the maxi- et al. (2020) evaluated these methods with the mum consistency, where the caption file can fur- focus of jointly producing consistent source and ther serve as a template for multilingual localisa- target language texts. Their underlying intuition tion. This paper is a first step towards maximising is that, since in cascade solutions the translation automation for the generation of high-quality mul- is derived from the transcript, cascades should tiple language/accessibility subtitle versions. achieve higher consistency than direct solutions. Their results, however, showed that triangle mod- 2.2 MT and ST for subtitling els achieve the highest consistency among the ar- Subtitling has long sparked the interest of the Ma- chitectures tested (considerably better than that chine Translation (MT) community as a challeng- of cascade systems) and have competitive per- ing type of translation. Most works employing MT formance in terms of translation quality. Direct for subtitling stem from the statistical era (Volk independent and shared models, instead, do not et al., 2010; Etchegoyhen et al., 2014) or even achieve the translation quality and consistency of before, with example-based approaches (Melero cascades. However, all these previous efforts fall et al., 2006; Armstrong et al., 2006; Nyberg and outside the domain of automatic subtitling and ig- Mitamura, 1997; Popowich et al., 2000; Piperidis nore the inner structure of the subtitles and their et al., 2005). With the neural era, the interest in au- relevance when considering consistency. tomatic approaches to subtitling revived. Neural (NMT) led to higher perfor- 3 Methodology mance and efficiency and opened new paths and 3.1 Models opportunities. Matusov et al. (2019) customised a cascade of ASR and NMT for subtitling, using To study the effectiveness of the different existing domain adaptation with fine-tuning and improv- ST approaches in the subtitling scenario, we exper- ing subtitle segmentation with a specific segmen- iment with the following models: tation module. Similarly, using cascades, Kopo- The Multitask Direct Model (DirMu) model nen et al. (2020) explored sentence- and document- consists of a single audio encoder and two sepa- level NMT for subtitling and showed productiv- rate decoders (Weiss et al., 2017): one for generat- ity gains for some language pairs. However, by- ing the source language captions, and the other for passing the need to train and maintain separate the target language subtitles. The weights of the components for transcription, translation and seg- encoder are shared. The model can exploit knowl- mentation, direct end-to-end ST systems are now edge sharing between the two tasks, but allows for being considered as a valid and potentially more some degree of flexibility since inference for one promising alternative (Karakanta et al., 2020a). In- task is not directly influenced by the other task. deed, besides the architectural advantages, they The Two-Stage (2ST) model (Kano et al., 2017) come with the promise to avoid error propagation also has two decoders, but the transcription de- (a well known issue of cascade solutions), reduce coder states are passed to the translation decoder. latency, and better exploit speech information (e.g. This is the only source of information for the trans- prosody) without loss of information thanks to a lation decoder as it does not attend to the encoder less mediated access to the source utterance. To output. our knowledge, no previous work has yet explored The Triangle (Tri) model (Anastasopoulos and the effectiveness of joint automatic generation of Chiang, 2018) is similar to the two-stage model, captions and subtitles. but with the addition of an attention mechanism to the translation decoder, which attends to the output 2.3 Joint generation of transcription and embeddings of the encoder. Both 2ST and Tri translation support coupled inference and joint training. The idea of generating transcript and translation We compare these models with common solu- has been previously addressed in (Weiss et al., tions for ST. The Cascade (Cas) model is a com- 2017; Anastasopoulos and Chiang, 2018). These bination of an ASR + NMT components; the ASR papers presented different solutions (e.g. shared transcribes the audio into text in the source lan- decoder and triangle) with the goal of improv- guage, which is then passed to an NMT system ing translation performance by leveraging both for translation into the target language. The two ASR and ST data in direct ST. Later, Sperber components are trained separately and can there- fore take advantage of richer data for the two tions and subtitles for each source utterance should tasks. The cascade features full dependence be- be split across the same number of blocks. This tween transcription and translation, which will po- is a prerequisite for bilingual subtitles, since each tentially lead to high consistency. caption-subtitle pair has the same timestamps. In The Direct Independent (DirInd) system con- other words, the captions and subtitles should ap- sists of two independent direct ST models, one for pear and disappear simultaneously. Therefore, we the transcription (as in the ASR component of the define structural consistency as the percentage of cascade) and one for the translation. It hence lies utterances having the same number of blocks be- on the flexibility edge of flexibility-consistency tween captions and subtitles. spectrum compared to the models above. The second aspect of subtitling consistency is lexical consistency. Lexical consistency means 3.2 Evaluation of subtitling consistency that each caption-subtitle pair has the same lexical While some metrics for evaluating transcription- content. It is particularly important for ensuring translation consistency have been proposed in synchrony between the content displayed in the (Sperber et al., 2020), these do not capture the captions and subtitles. This facilitates language peculiarities of subtitling. The goal for consis- learning, when terms appear in similar positions, tent captions/subtitles is having the same structure and quality assurance, as it is easier to spot errors (same number of subtitle blocks) and same content in parallel text. We define lexical consistency as (each pair of caption-subtitle blocks has the same the percentage of words in each caption-subtitle lexical content). Consider the following example: pair that are aligned to words belonging in the same block. In our example, there are six tokens 0:00:50,820, 00:00:53,820 To put the assumptions very clearly: of the subtitles which are not aligned to captions of Enonc¸ons clairement nos hypoth`eses : le capitalisme, the same block: le capitalisme , au memeˆ titre. For obtaining this score, we compute the number of 00:00:53,820, 00:00:57,820 capitalism, after 150 years, has become acceptable, words in each caption aligned to the corresponding apr`es 150 ans, est devenu acceptable, au mˆeme titre subtitle and vice versa. For each caption-subtitle pair, this process results in two lexical consistency 00:00:58,820, 00:01:00,820 and so has democracy. scores: Lexcaption→subtitle and Lexsubtitle→caption, que la d´emocratie. where, in the former, the number of aligned words is normalised by the number of words in the cap- In the example above, three blocks appear se- tion, while, in the latter, by the number of words quentially on the screen based on timestamp infor- in the subtitle. These two quantities are then av- mation, and each of them contains one line of text eraged into a single value (Lexpair). The corpus- in English (caption) and French (subtitle). Since level lexical consistency is obtained by averaging the source utterance is split across the same num- the Lexpair of all caption-subtitle pairs in the test ber of blocks (3), the captions and subtitles have set. the same structure. However, the captions and sub- titles do not have the same lexical content. The 4 Experimental setting first block contains the French words le capital- isme, which appear in the second block for the 4.1 Data English captions. Similarly, au memeˆ titre corre- For our experiments we use MuST-Cinema sponds to the third block in relation to the captions. (Karakanta et al., 2020b), an ST corpus compiled This is problematic because terms do not appear in from subtitles of TED talks. For a sound compari- the same blocks (e.g. capitalism), and also leads to son with Karakanta et al. (2020a), we conduct the sub-optimal segmentation, since the French subti- experiments on 2 language pairs, English→French tles are not complete semantic units (logical com- and English→German. The breaks between subti- pletion occurs after hypotheses` and acceptable). tles are marked with special symbols, for We hence define the consistency between cap- breaks between blocks of subtitles and for tions and subtitles based on two aspects: the struc- new lines inside the same block. The training data tural and the lexical consistency. Structural consis- contain 408 and 492 hours of pre-segmented au- tency refers to the way subtitles are distributed on dio (229K and 275K sentences) for German and a video. In order to be structurally consistent, cap- French respectively. For tuning and evaluation we use the official development and test sets. We ex- The MT component is based on the Trans- pect the captions and subtitles of TED Talks to former architecture (big) (Vaswani et al., 2017) have high consistency, since the captions serve as with similar settings to the original paper. Since the basis for translating the speech in target subti- the ASR component outputs punctuation, no other tles. pre-processing (except for BPE) is applied to the The text data is segmented into sub-words with training data. In order to ensure a fair comparison Sentencepiece (Kudo and Richardson, 2018) with with the direct and joint models, the MT compo- the unigram setting. In line with recent works in nent is trained only on MuST-Cinema data. ST, we found that a small vocabulary size is benefi- All experiments are run with the fairseq cial for the performance of ST models. Therefore, toolkit (Ott et al., 2019). Training is per- we set a shared vocabulary of 1024 for all models formed on two K80 GPUs with 11 GB except the MT component of the cascade, where memory and models converged in about five vocabulary size is set to 24k. The special symbols days. Our implementation of the DirMu, and are kept as a single token. Tri and 2ST models is publicly available at: For the audio input, we use 40-dimensional https://github.com/mgaido91/FBK-fairseq-ST/tree/acl 2021 log Mel filterbank speech features. The ASR en- coder was pretrained on the IWSLT 2020 data, i.e. 4.3 Evaluation Europarl-ST (Iranzo-S´anchez et al., 2020), Lib- We evaluate three aspects of the automatically gen- rispeech (Panayotov et al., 2015), How2 (Sanabria erated captions and subtitles: 1) quality, 2) form, et al., 2018), Mozilla Common-Voice,4 MuST-C and 3) consistency. For quality of transcription we (Cattoni et al., 2020), and the ST TED corpus.5 compute WER on unpunctuated, lowercased out- 4.2 Model training put, while for quality of translation we use Sacre- BLEU (Post, 2018).6 We report scores computed The ASR and ST models are trained using at the level of utterances, where the output sen- the same settings. The architecture used is S- tences contain subtitle breaks. A break symbol Transformer, (Di Gangi et al., 2019), an ST adap- is considered as another token contributing to the tation of Transformer, which has been shown score. to achieve high performance on different speech For evaluating the form of the subtitles, we fo- translation benchmarks. Following state-of-the-art cus on the conformity to the subtitling constraints systems (Potapczyk and Przybysz, 2020; Gaido of length and reading speed, as well as proper et al., 2020), we do not add 2D self-attentions. The segmentation, as proposed in (Karakanta et al., size of the encoder is set to 11 layers, and to 4 2019). We compute the percentage of subtitles layers for the decoder. The ASR model used to conforming to a maximum length of 42 charac- pretrain the encoder, instead, has 8 encoder and 6 ters/line and a maximum reading speed of 21 char- decoder layers. The additional 3 encoder layers acters/second.7 The plausibility of segmentation are initialised randomly, similarly to the adapta- is evaluated based on syntactic properties. Subtitle tion layer proposed by Bahar et al. (2019). As dis- breaks should be placed in such a way that keeps tance penalty, we choose the logarithmic distance syntactic and semantic units together. For exam- penalty. We optimise using Adam (Kingma and ple, an adjective should not be separated from Ba, 2015) (betas 0.9, 0.98), 4000 warm-up steps the noun it describes. We consider as plausible with initial learning rate of 0.0003, and learning only those breaks following punctuation marks or rate decay with the inverse square root of the it- those between a content word (chunk) and a func- eration. We apply label smoothing of 0.1, and tion word (chink). We obtain Universal Dependen- dropout (Srivastava et al., 2014) is set to 0.2. We cies8 PoS-tags using the Stanza toolkit (Qi et al., further use SpecAugment (Park et al., 2019), a 2020) and calculate the percentage of break sym- technique for online data augmentation, with aug- bols falling either in the punctuation or the content- ment rate of 0.5. Training is completed when the function groups as plausible segmentation. validation perplexity does not improve for 3 con- secutive epochs. 6BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.4.3 7In line with the TED Talk guidelines: 4https://voice.mozilla.org/ https://www.ted.com/participate/translate/guidelines 5http://iwslt.org/doku.php?id=offline speech 8translationhttps://universaldependencies.org/ Lastly, we evaluate structural and lexical con- line with Anastasopoulos and Chiang (2018), the sistency between the generated captions and cor- gains for Tri are higher for translation than for responding subtitles, as described in Section 3.2. transcription. Comparing the BLEU score of our Word alignments are obtained using fast align DirInd models to the models in (Karakanta et al., (Dyer et al., 2013) on the concatenation of MuST- 2020a), we found that our models achieve higher Cinema training data and the system outputs. Text performance with 20.07 BLEU compared to 18.76 is tokenised using Moses tokeniser and the consis- for French and 13.55 compared to 11.92 for Ger- tency percentage is computed on tokenised text. man. All in all, we found that coupled-inference, sup- 5 Results ported by Cas, 2ST and Tri, improves transla- 5.1 Transcription/Translation quality tion but not transcription quality. On the contrary, multi-tasking as in DirMu is beneficial for tran- We first examine the quality of the systems’ out- scription, possibly because of a reinforcement of puts. The first two columns of Table 1 show the speech encoder. However, could this improve- the WER and SacreBLEU score for the examined ment come at the expense of conformity to the sub- models. titling constraints? In terms of transcription quality, DirMu (Multitask Direct – see Section 3.1) obtains the 5.2 Subtitling conformity lowest WER for both languages (17.73 for French and 16.95 for German). As far as the rest of Columns 3-5 of Table 1 show the percentage of the models are concerned, there is a different ten- captions/subtitles conforming to the length, read- dency for French and German. Tri (Triangle) and ing speed and segmentation constraints discussed 2ST (Two-Stage) perform equally better than the in Section 4.3. We observe that joint decoding Cas/DirInd for French, while the Cas/DirInd does not lead to significant losses in conformity. have higher transcription quality than Tri and Specifically, the captions generated by DirMu 2ST for German. An explanation for this incon- have the highest conformity in terms of length gruity is that these two models perform coupled (95%), reading speed (85% and 62%) and seg- inference, therefore the benefit of the joint decod- mentation quality (87%). Moreover, the high ing for the transcription can be related to similar- conformity score for DirMu correlates with the ities in terms of vocabulary between the two lan- low WER, showing that quality goes hand-in-hand guages. Since French has a higher vocabulary sim- with conformity. ilarity to English, with many words in TED Talks For the conformity of the target language sub- being cognates (e.g. specialised terminology), it is titles, instead, the picture is different. Even possible that joint decoding favours the transcrip- though the differences are not large, Cas has tion for French but not for German. lower conformity to length (93% and 90%) and When it comes to translation quality, Cas out- reading speed (70% and 58%). The segmenta- performs all other models for French with 26.9 tion scores show that, despite their high transla- BLEU points, while the differences are not statis- tion quality, the systems featuring coupled infer- tically significant among DirMu, 2ST and Tri. ence (Cas, Tri and 2ST) are constrained by the For German, however, Cas, 2ST and Tri per- structure of the captions and segment subtitles in form on par. The model obtaining the lowest positions which are not optimal for the target lan- scores is DirInd. This finding confirms our hy- guage norms (82% and 76%). DirInd, on the pothesis that joint decoding, despite being more contrary, has higher conformity compared to the complex, improves translation quality thanks to other models (94% and 92% for length, 73% and the knowledge shared between the two tasks at de- 59% for reading speed), as well as segmentation coding time. quality (84% and 78%). DirInd is left to deter- In comparison to previous works, our transcrip- mine the most plausible segmentation for the tar- tion results are contrary to Sperber et al. (2020), get language without being bound by consistency who obtained the lowest WER with the cascade constraints from the source. The lowest segmen- and direct independent models. However, for tation quality of subtitles is achieved by DirMu translation quality our best models are Cas, 2ST (80% and 73%). and Tri, as in previous work. Moreover, in We can conclude that the quality improvements En→Fr WER SacreBLEU Length Read speed Segment. Struc. Lex. Cas 19.69 26.9 .94 / .93 .85 / .70 .86 / .82 .98 .99 DirInd 19.69 24.0 .94 / .94 .85 / .73 .86 / .84 .75 .86 DirMu 17.73 25.2 .95 /.93 .85/ .73 .87 / .80 .77 .87 2ST 19.05 25.6 .95 / .94 .85 / .71 .87 / .82 .83 .84 Tri 18.93 25.3 .93 / .91 .85 / .72 .87 / .82 .82 .92 En→De Cas 18.52 19.9 .94 / .90 .62 / .58 .86 / .76 .95 .96 DirInd 18.52 18.1 .94 / .92 .62/.59 .86/ .78 .73 .86 DirMu 16.95 18.7 .95 / .92 .62 / .59 .87 / .73 .75 .82 2ST 18.93 19.6 .94 / .92 .62 / .60 .86 / .76 .82 .81 Tri 19.10 19.8 .93 / .92 .62 / .60 .87 / .76 .78 .91

Table 1: Results for quality (WER and BLEU), subtitling conformity (Length, Reading speed and Segmentation), and subtitling consistency (Structural and Lexical) for model outputs for French and German. Conformity scores are reported for captions / subtitles. Bold denotes the best score. Results that are not statistically significant – according to pairwise bootstrap resampling (Koehn, 2004), p<0.05 – than the best score are reported in italics. of coupled inference and multi-tasking come with overlap in parallel caption-subtitle blocks with a slight compromise of subtitling conformity, as a 99% and 96% of the words being aligned to the result of loss of flexibility in decoding. same block. As with the structural consistency, the lexical consistency of the cascade is higher 5.3 Subtitling consistency than the references (95% for French, 86% for Ger- man). The direct model with the highest lexical The last two columns of Table 1 present the results consistency is Tri (92% and 91%). Interestingly, for the subtitling consistency. despite its high structural consistency, 2ST does In terms of Structural consistency (Struc.), the not distribute the content consistently in the paral- model achieving the highest scores is Cas, with lel blocks, achieving the lowest conformity (81%). 98% and 95% of the utterances being distributed The DirMu also achieves lower consistency than along the same number of blocks. As expected, DirInd for German (82% compared to 86%) but the lowest structural consistency is achieved by not for French (87% compared to 86%). It is DirInd (75% and 73%), which determines in- worth noting that lexical consistency is generally dependently the positions of the block symbols. lower for German than for French. Indeed, a 100% Among the joint models, Tri outputs captions lexical consistency between subtitles in languages and subtitles with higher consistency than DirMu, with different word order is not always feasible but both are outperformed by 2ST (83% and or even appropriate. For example, the main verb 82%). Our hypothesis is that by attending only in an English subordinate clause appears in the the caption decoder, 2ST behaves similarly to the second position while in German at the end of cascade, and the translation decoder better repli- the sentence. In order to adhere to grammatical cates the block structure. We noted that the ref- rules, words in subtitles of different languages of- erence captions and subtitles have lower consis- ten have inter-block reordering. Therefore, the bal- tency (92% both for French and German) than the ance between flexibility and consistency is man- cascade. This shows that the cascade copies the ifested here as a compromise between grammati- < > same eob tokens and achieves extreme struc- callity and preservation of the same lexical content tural consistency, which is a desideratum for our on each pair of subtitles. study case but may be harmful in other scenarios, since it leads to lower conformity (see Section 5.2). To sum up, the results of structural consistency Indeed, in scenarios where consistency is not a show that the models are able to preserve the block key, subtitlers should have the flexibility to adjust structure between captions and subtitles in more subtitling segmentation to suit the needs of their than 75% of the utterances. In addition, the high target languages (Oziemblewska and Szarkowska, lexical consistency shows that the block symbols 2020). are not inserted randomly, but placed in a way that The Lexical consistency (Lex.) results show preserves the same lexical content in the parallel that Cas is the model with the highest content blocks. All in all, our results show that the evaluation for the human annotator. Instead, the agreement of captions and subtitles is a multifaceted process in the consistent/not-consistent judgement is high, that needs to be addressed from multiple aspects: with .85 for French and .75 for German. Consider- quality, conformity and consistency. Missing one ing the difficulty of aligning sentences belonging of the three can lead to wrong conclusions. For to languages with different word ordering, and the instance, only considering quality and consistency lower quality of German outputs, it is not surpris- could lead to disregard the importance of confor- ing that the word aligner from English to German mity and consider independent solutions an obso- affects more our metric. However, these results lete technology. Secondly, among the Direct ar- show that the real impact is moderate and the met- chitectures, the use of techniques that allow link- ric is consistent with the human judgements in the ing the generation process of captions and subtitles majority of cases. helps to achieve overall better quality and consis- tency than independent decoding, with a slight dis- 6.2 Does structural consistency extend to line count in conformity, especially for the target subti- breaks? tles. Between the DirMu, 2ST and Tri, there is But what happens with the line breaks? Does not a model that outperforms all the others in all a one-line caption correspond to a one-line sub- the metrics, so the choice mainly depends on the title in the output of our models? Having the application scenario. Lastly, comparing the Cas- same number of lines between caption and subti- cade and the Direct, the Cascade seems to be the tle blocks is a more challenging scenario, since the best choice, but recent advancements in Direct ap- subtitles tend to expand because of different length proaches result in competitive solutions with in- ratios between languages and translation strategies creased efficiency of maintaining one model for such as explicitation. For instance, for the tar- both tasks. get languages considered in this work (French and German) the length of the target subtitles when 6 Analysis subtitling from English has been reported to be 5%-35% higher.9 If structural consistency is en- 6.1 Evaluation of Lexical Consistency forced to line breaks, it may compromise either the In this section, we test the reliability of the lexi- quality of the translation or the conformity to the cal consistency metric. The metric depends on the subtitling constraints. In case of a one-liner cap- successful word alignment, which, especially for tion, important information may be not rendered low quality text, might be sub-optimal. We there- in the corresponding subtitle in order to match a fore manually count the number of words in the shorter length of the caption, or the length con- subtitles which do not appear in the correspond- straint will be violated since the longer subtitle ing captions. The task is performed on the first will not be adequately segmented in two lines. In 347 sentences of the output of DirMu for French order to ensure that our models do not push the and German. We then estimate the mean absolute structural consistency to an extreme, we compute error between the consistency metric computed us- the percentage of caption-subtitle blocks having ing the manual and the automatic alignments. As the same number of lines. an additional step, we compute how often the au- tomatic and the manual annotations agree in their Cas DirInd DirMu 2ST Tri Ref judgement of consistent/non-consistent content in Fr 67% 49% 54% 59% 57% 67% each block. De 66% 47% 55% 53% 51% 66% The mean absolute error between the manually Table 2: Percentage of subtitle blocks containing the and the automatically computed score is .08 for same number of lines for French and German outputs. French and .11 for German. The metric may not be able to account for very small score differences Table 2 confirms that caption and subtitle blocks between systems, however, when inspecting the do not always have the same number of lines, differences between manual and automatic anno- since only 67% and 66% of blocks in the cap- tation we noticed that most errors appear in very tion/subtitle references have the same number of low quality outputs or where lexical content was 9https://www.andiamo.co.uk/resources/expansion-and-contraction-factors/ missing, and lead to a misalignment of only a few http://www.aranchodoc.com/wp-content/uploads/2017/07/Text-Expansion-Contraction-Chart3.png words. These cases were in fact challenging even https://www.kwintessential.co.uk/resources/expansion-retraction lines. When it comes to the models, the cas- tween automatically generated captions and sub- cade exactly matches the percentages of the ref- titles. One important insight emerging from this erences, while the direct models have even lower work is that different degrees of conformity are re- percentage of equal number of lines. Among quired, or even appropriate, depending on the ap- the direct models, again the DirInd shows the plication scenario and languages involved. Given lowest similarity. We observed that more line these challenges, we are aiming at developing ap- breaks were present in the target subtitles, which proaches which allow for tuning the output to ensures length conformity, since the target subti- the desired degree of conformity, whether lexical, tles expand (source-target character ratio of 0.91 structural or both. We hope that this work will for French and 0.93 for German). Therefore, the contribute to the line of research efforts towards fact that structural consistency allows for flexibil- improving efficiency and quality of automatically ity in relation to the number and position of line generated captions and subtitles. breaks is key to achieving high quality and confor- mity. References 7 Conclusions Antonios Anastasopoulos and David Chiang. 2018. Tied multitask learning for neural speech translation. In this work we explored joint generation of cap- In Proceedings of the 2018 Conference of the North tions and subtitles as a way to increase efficiency American Chapter of the Association for Computa- and consistency in scenarios where this property tional Linguistics: Human Language Technologies, is a desideratum. To this aim, we proposed met- Volume 1 (Long Papers), pages 82–91, New Orleans, Louisiana. Association for Computational Linguis- rics for evaluating subtitling consistency, tailored tics. to the structural peculiarities of this type of trans- lation. We found that coupled inference, either Stephen Armstrong, Colm Caffrey, and Marian Flana- by models supporting end-to-end training (2ST, gan. 2006. Translating DVD Subtitles from Tri) or not (Cas), leads to quality and consis- English-German and English-Japanese Using Ex- ample-Based Machine Translation. In MuTra tency improvements, but with a slight degrada- 2006—Audiovisual Translation Scenarios: Confer- tion of the conformity to target subtitle constraints. ence Proceedings. The final architectural choice depends on the flexi- bility versus conformity requirements of the appli- Parnia Bahar, Tobias Bieschke, and Hermann Ney. 2019. A Comparative Study on End-to-end Speech cation scenario. to Text Translation. In Proceedings of International The findings of this work have provided initial Workshop on Automatic and Un- insights related to the joint generation of captions derstanding (ASRU), pages 792–799, Sentosa, Sin- and subtitles. One future research direction is to- gapore. wards improving the quality of generation by us- Alexandre B´erard, Olivier Pietquin, Christophe Servan, ing more recent, higher-performing ST architec- and Laurent Besacier. 2016. Listen and Translate: tures. For example, Liu et al. (2020) extended A Proof of Concept for End-to-End Speech-to-Text the notion of the dual decoder by adding an in- Translation. In NIPS Workshop on end-to-end learn- teractive attention mechanism which allows the ing for speech and audio processing, Barcelona, Spain. two decoders to exchange information and learn from each other, while synchronously generating Roldano Cattoni, Mattia A. Di Gangi, Luisa Bentivogli, transcription and translation. Le et al. (2020) pro- Matteo Negri, and Marco Turchi. 2020. MuST-C: A posed two variants of the dual decoder of Liu et al. Multilingual Corpus for end-to-end Speech Transla- (2020), the cross and parallel dual decoder, and tion. Computer Speech & Language Journal. Doi: https://doi.org/10.1016/j.csl.2020.101155. experimented with multilingual ST. While neither of these works reported results on consistency, we Jorge Diaz Cintas and Aline Remael. 2007. Audiovi- expect that they are relevant to our scenario and sual Translation: Subtitling. Translation practices have the potential of jointly generating multiple explained. Routledge. language/accessibility versions with high consis- Mattia Antonino Di Gangi, Matteo Negri, and Marco tency. Moving beyond generic architectures, in Turchi. 2019. Adapting Transformer to End-to-end the future we are planning to experiment with tai- Spoken Language Translation. In INTERSPEECH, lored architectures for improving consistency be- pages 1133–1137. Chris Dyer, Victor Chahuneau, and Noah A. Smith. of the 17th International Conference on Spoken Lan- 2013. A simple, fast, and effective reparameter- guage Translation, pages 209–219, Online. Associa- ization of IBM model 2. In Proceedings of the tion for Computational Linguistics. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- Alina Karakanta, Matteo Negri, and Marco Turchi. man Language Technologies, pages 644–648, At- 2020b. MuST-cinema: a speech-to-subtitles cor- lanta, Georgia. Association for Computational Lin- pus. In Proceedings of the 12th Language Resources guistics. and Evaluation Conference, pages 3727–3734, Mar- seille, France. European Language Resources Asso- Thierry Etchegoyhen, Lindsay Bywood, Mark Fishel, ciation. Panayota Georgakopoulou, Jie Jiang, Gerard van Loenhout, Arantza del Pozo, Mirjam Sepesy Diederik Kingma and Jimmy Ba. 2015. Adam: A Mauˇcec, Anja Turner, and Martin Volk. 2014. Ma- method for stochastic optimization. In 3rd Interna- chine Translation for Subtitling: A Large-Scale tional Conference for Learning Representations. Evaluation. In Proceedings of the 9th International Philipp Koehn. 2004. Statistical Significance Tests Conference on Language Resources and Evaluation for Machine Translation Evaluation. In Proceed- (LREC), pages 46–53. ings of the 2004 Conference on Empirical Meth- Marco Gaido, Mattia A. Di Gangi, Matteo Negri, and ods in Natural Language Processing, pages 388– Marco Turchi. 2020. End-to-end speech-translation 395, Barcelona, Spain. Association for Computa- with knowledge distillation: FBK@IWSLT2020. In tional Linguistics. Proceedings of the 17th International Conference on Maarit Koponen, Umut Sulubacak, Kaisa Vitikainen, Spoken Language Translation, pages 80–88, Online. and J¨org Tiedemann. 2020. MT for subtitling: User Association for Computational Linguistics. evaluation of post-editing productivity. In Proceed- Boni Garc´ıa. 2017. Bilingual subtitles for second-lan- ings of the 22nd Annual Conference of the European guage acquisition and application to engineering ed- Association for Machine Translation (EAMT 2020), ucation as learning pills. Computer Applications in pages 115–124. Engineering Education, 25(3):468–479. Taku Kudo and John Richardson. 2018. SentencePiece: Panayota Georgakopoulou. 2019. Template files: The A simple and language independent subword tok- Holy Grail of subtitling. Journal of Audiovisual enizer and detokenizer for neural text processing. In Translation, 2(2):137–160. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Henrik Gottlieb. 2004. Subtitles and international Demonstrations, pages 66–71, Brussels, Belgium. anglification. Nordic Journal of English Studies, Association for Computational Linguistics. 3:219–230. Hang Le, Juan Pino, Changhan Wang, Jiatao Gu, Di- Nicolas Guichon and Sinead McLornan. 2008. The ef- dier Schwab, and Laurent Besacier. 2020. Du- fects of multimodality on l2 learners: Implications al-decoder transformer for joint automatic speech for call resource design. System, pages 85–93. recognition and multilingual speech translation. In Proceedings of the 28th International Conference Javier Iranzo-S´anchez, Joan Albert Silvestre-Cerd`a, on Computational Linguistics, pages 3520–3533, Javier Jorge, Nahuel Rosell´o, Gim´enez. Adri`a, Al- Barcelona, Spain (Online). International Committee bert Sanchis, Jorge Civera, and Alfons Juan. 2020. on Computational Linguistics. Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates. In Proceed- Sixin Liao, Jan-Louis Kruger, and Stephen Doherty. ings of 2020 IEEE International Conference on 2020. The impact of monolingual and bilingual sub- Acoustics, Speech and Signal Processing (ICASSP), titles on visual attention, cognitive load, and compre- pages 8229–8233, Barcelona, Spain. hension. The Journal of Specialised Translation. Takatomo Kano, Sakriani Sakti, and Satoshi Naka- Yuchen Liu, Jiajun Zhang, Hao Xiong, Long Zhou, mura. 2017. Structured-based Curriculum Learn- Zhongjun He, Hua Wu, Haifeng Wang, and ing for End-to-end English-Japanese Speech Trans- Chengqing Zong. 2020. Synchronous Speech lation. In Annual Conference of the International Recognition and Speech-to-Text Translation with In- Speech Communication Association (InterSpeech), teractive Decoding. In The Thirty-Fourth AAAI Con- pages 2630–2634. ference on Artificial Intelligence (AAAI-20). Associ- ation for the Advancement of Artificial Intelligence Alina Karakanta, Matteo Negri, and Marco Turchi. (www.aaai.org). 2019. Are Subtitling Corpora really Subtitle-like? In Sixth Italian Conference on Computational Lin- Evgeny Matusov, Patrick Wilken, and Yota Geor- guistics, CLiC-It. gakopoulou. 2019. Customizing neural machine translation for subtitling. In Proceedings of the Alina Karakanta, Matteo Negri, and Marco Turchi. Fourth Conference on Machine Translation (Volume 2020a. Is 42 the answer to everything in subti- 1: Research Papers), pages 82–93, Florence, Italy. tling-oriented speech translation? In Proceedings Association for Computational Linguistics. Maite Melero, Antoni Oliver, and Toni Badia. 2006. Tomasz Potapczyk and Pawel Przybysz. 2020. SR- Automatic Multilingual Subtitling in the eTITLE POL’s system for the IWSLT 2020 end-to-end Project. In Proceedings of ASLIB Translating and speech translation task. In Proceedings of the the Computer 28. 17th International Conference on Spoken Language Translation, pages 89–94, Online. Association for Netflix. 2021. Subtitle template timed text style Computational Linguistics. guide. https://partnerhelp.netflixstudios.com/hc/en- us/articles/219375728-English-Template-Timed- Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, Text-Style-Guide. Last accessed: 02/05/2021. and Christopher D. Manning. 2020. Stanza: A python natural language processing toolkit for many Kristijan Nikoli´c. 2015. The pros and cons of using human languages. In Proceedings of the 58th An- templates in subtitling. In Roc´ıo Ba˜nos Pi˜nero and nual Meeting of the Association for Computational Jorge D´ıaz Cintas, editors, Audiovisual Translation Linguistics: System Demonstrations, pages 101– in a Global Context: Mapping an Ever-changing 108, Online. Association for Computational Linguis- Landscape, pages 192–202. Palgrave Macmillan tics. UK, London. Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Lo¨ıc Barrault, Lucia Specia, and Eric Nyberg and Teruko Mitamura. 1997. A Real-Time Florian Metze. 2018. How2: A Large-scale Dataset MT System for Translating Broadcast Captions. In For Multimodal Language Understanding. In Pro- Proceedings of the Sixth Machine Translation Sum- ceedings of Visually Grounded Interaction and Lan- mit, pages 51–57. guage (ViGIL), Montr´eal, Canada. Neural Informa- tion Processing Society (NeurIPS). Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Matthias Sperber, Hendra Setiawan, Christian Gollan, Michael Auli. 2019. fairseq: A fast, extensible Udhyakumar Nallasamy, and Matthias Paulik. 2020. toolkit for sequence modeling. In Proceedings of Consistent transcription and translation of speech. the 2019 Conference of the North American Chap- Transactions of the Association for Computational ter of the Association for Computational Linguistics Linguistics, 8:695–709. (Demonstrations), pages 48–53, Minneapolis, Min- nesota. Association for Computational Linguistics. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Magdalena Oziemblewska and Agnieszka Szarkowska. Dropout: a simple way to prevent neural networks 2020. The quality of templates in subtitling. A sur- from overfitting. Journal of machine learning re- vey on current market practices and changing subti- search, 15(1):1929–1958. tler competences. Perspectives, 0(0):1–22. Tetyana Sydorenko. 2010. Modality of input and vo- Vassil Panayotov, Guoguo Chen, Daniel Povey, and cabulary acquisition. Lang Learn Technol, pages Sanjeev Khudanpur. 2015. Librispeech: An ASR 50–73. corpus based on public domain audio books. In Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 2015 IEEE International Conference on Acoustics, Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Speech and Signal Processing (ICASSP), pages Kaiser, and Illia Polosukhin. 2017. Attention is all 5206–5210, South Brisbane, Queensland, Australia. you need. In Advances in Neural Information Pro- cessing Systems, pages 6000–6010. Daniel S. Park, William Chan, Yu Zhang, Chung- Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Martin Volk, Rico Sennrich, Christian Hardmeier, and Quoc V. Le. 2019. SpecAugment: A Simple Data Frida Tidstr¨om. 2010. Machine Translation of TV Augmentation Method for Automatic Speech Recog- Subtitles for Large Scale Production. In Proceed- nition. Interspeech 2019. ings of the Second Joint EM+/CNGL Workshop ”Bringing MT to the User: Research on Integrating Stelios Piperidis, Iason Demiros, and Prokopis Proko- MT in the Translation Industry (JEC’10), pages 53– pidis. 2005. Infrastructure for a Multilingual Subti- 62, Denver, CO. tle Generation System. In 9th International Sympo- sium on Social Communication, pages 24–28. Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen. 2017. Sequence-to- Fred Popowich, Paul McFetridge, Davide Turcato, and Sequence Models Can Directly Translate Foreign Janine Toole. 2000. Machine Translation of Closed Speech. In Proceedings of Interspeech 2017, pages Captions. Machine Translation, pages 311–341. 2625–2629, Stockholm, Sweden.

Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Brussels, Belgium. Association for Computa- tional Linguistics.