Between Flexibility and Consistency: Joint Generation of Captions And

Between Flexibility and Consistency: Joint Generation of Captions and Subtitles Alina Karakanta1,2, Marco Gaido1,2, Matteo Negri1, Marco Turchi1 1 Fondazione Bruno Kessler, Via Sommarive 18, Povo, Trento - Italy 2 University of Trento, Italy {akarakanta,mgaido,negri,turchi}@fbk.eu Abstract material but also between each other, for example in the number of blocks (pieces of time-aligned Speech translation (ST) has lately received text) they occupy, their length and segmentation. growing interest for the generation of subtitles Consistency is vital for user experience, for ex- without the need for an intermediate source language transcription and timing (i.e. cap- ample in order to elicit the same reaction among tions). However, the joint generation of source multilingual audiences, or to facilitate the quality captions and target subtitles does not only assurance process in the localisation industry. bring potential output quality advantageswhen Previous work in ST for subtitling has focused the two decoding processes inform each other, on generating interlingual subtitles (Matusov et al., but it is also often required in multilingual sce- 2019; Karakanta et al., 2020a), a) without consid- narios. In this work, we focus on ST models ering the necessity of obtaining captions consis- which generate consistent captions-subtitles in tent with the target subtitles, and b) without ex- terms of structure and lexical content. We further introduce new metrics for evaluating amining whether the joint generation leads to im- subtitling consistency. Our findings show that provements in quality. We hypothesise that knowl- joint decoding leads to increased performance edge sharing between the tasks of transcription and consistency between the generated cap- and translation could lead to such improvements. tions and subtitles while still allowing for suf- Moreover, joint generation with a single system ficient flexibility to produce subtitles conform- can avoid the maintenance of two different models, ing to language-specific needs and norms. increase efficiency, and in turn speed up the locali- 1 Introduction sation process. Lastly, if joint generation improves consistency, joint models could increase automa- New trends in media localisation call for the rapid tion in subtitling applications where consistency is generation of subtitles for vast amounts of au- a desideratum. diovisual content. Speech translation, and espe- In this work, we address these issues for the cially direct approaches (Bérard et al., 2016; Ba- first time, by jointly generating both captions and har et al., 2019), have recently shown promising subtitles for the same audio source. We experi- results with high efficiency because they do not ment with the following models: 1) Shared Direct arXiv:2107.06246v1 [cs.CL] 13 Jul 2021 require a transcription (manual or automatic) of (Weiss et al., 2017), where the speech encoder is the source speech but generate the target language shared between the transcription and the transla- subtitles directly from the audio. However, obtain- tion decoder, 2) Two-Stage (Kano et al., 2017), ing the intralingual subtitles (hereafter “captions”) where the transcript decoder states are passed to is necessary for a range of applications, while in the translation decoder, and 3) Triangle (Anasta- some settings captions need to be displayed along sopoulos and Chiang, 2018), which extends the with the target language subtitles. Such “bilingual two-stage by adding a second attention mecha- subtitles” are useful in multilingual online meet- nism to the translation decoder which attends to ings, in countries with multiple official languages, encoded speech inputs. We compare these mod- or for language learners and audiences with dif- els with the established approaches in ST for sub- ferent accessibility needs. In those cases, captions titling: an independent direct ST model and a cas- and subtitles should not only be consistent with the cade (ASR+MT) model. Moreover, we extend the visual and acoustic dimension of the audiovisual evaluation beyond the usual metrics used to assess transcription and translation quality (respectively been common in countries with more than one offi- WER and BLEU), by also evaluating the form and cial languages, such as Belgium and Finland (Got- consistency of the generated subtitles. tlieb, 2004). Recently, however, captions along Sperber et al. (2020) introduced several lexical with subtitles have been employed in other coun- and surface metrics to measure consistency of ST tries to attract wider audiences, e.g. in Mainland outputs, but they were only applied to standard, China English captions are displayed along with non-subtitle, texts. Subtitles, however, are a partic- Mandarin subtitles. Interestingly, despite doubling ular type of text structured in blocks which accom- the amount of text that appears on the screen and pany the action on screen. Therefore, we propose the high redundancy, it has been shown that bilin- to measure their consistency by taking advantage gual subtitles do not significantly increase users’ of this structure and introduce metrics able to re- cognitive load (Liao et al., 2020). ward subtitles that share similar structure and con- One group which undoubtedly benefits from the tent. parallel presence of captions and subtitles are lan- Our contributions can be summarised as fol- guage learners. Captions have been found to in- lows: crease learners’ L2 vocabulary (Sydorenko, 2010) and improve listening comprehension (Guichon • We employ ST to directly generate both cap- and McLornan, 2008). Subtitles in the learn- tions and subtitles without the need for hu- ers’ native language (L1) are an indispensable man pre-processing (transcription, segmenta- tool for comprehension and access to information, tion). especially for beginners. In bilingual subtitles, the captions support learners in understanding the • We propose new, task-specific metrics to eval- speech and acquiring terminology, while subtitles uate subtitling consistency, a challenging and serve as a dictionary, facilitating bilingual map- understudied problem in subtitling evalua- ping (Garc´ıa, 2017). Consistency is particularly tion. important for bilingual subtitles. Terms should fall in the same block and in similar positions. More- • We show increased performance and consis- over, similar length and equal number of lines can tency between the generated captions and prevent long distance saccades, assisting in spot- subtitles compared to independent decoding, ting the necessary information in the two language while preserving adequate conformity to sub- versions. Several subtitling tools have recently al- titling constraints. lowed for combining captions and subtitles on the same video (e.g. Dualsub1) and bilingual subti- 2 Background tles can be obtained for Youtube videos2 and TED 3 2.1 Bilingual subtitles Talks. Another aspect where consistency between cap- New life conditions maximised the time spent tions and subtitles is present is in subtitling tem- in front of screens, transforming today’s medias- plates. A subtitling template is a source lan- cape in a complex puzzle of new actors, voices guage/English version of a subtitle file already and workflows. Face-to-face meetings and con- segmented and containing timestamps, which is ferences moved online, opening up participation used to directly translate the text in target lan- for global audiences. In these new settings multi- guages while preserving the same structure (Cin- linguality and versatility are dominant, and mani- tas and Remael, 2007; Georgakopoulou, 2019; fested in business meetings with international part- Netflix, 2021). This process reduces the cost, turn- ners, conferences with world-wide coverage, mul- around times and effort needed to create a sepa- tilingual classrooms and audiences with mixed ac- rate timed version for each language, and facili- cessibility needs. Given these growing needs for tates quality assurance since errors can be spotted providing inclusive and equal access to audiovi- across the same blocks (Nikolić, 2015). These ben- sual material for a multifaceted audience spectrum, efits motivated our work towards simultaneously efficiently obtaining high-quality captions and subtitles is becoming increasingly relevant. 1https://www.dualsub.xyz/ Traditionally, displaying subtitles in two lan- 2https://www.watch-listen-read.com/ guages in parallel (bilingual or dual subtitles) has 3https://amara.org/en/teams/ted/ generating two language versions with the maxi- et al. (2020) evaluated these methods with the mum consistency, where the caption file can fur- focus of jointly producing consistent source and ther serve as a template for multilingual localisa- target language texts. Their underlying intuition tion. This paper is a first step towards maximising is that, since in cascade solutions the translation automation for the generation of high-quality mul- is derived from the transcript, cascades should tiple language/accessibility subtitle versions. achieve higher consistency than direct solutions. Their results, however, showed that triangle mod- 2.2 MT and ST for subtitling els achieve the highest consistency among the ar- Subtitling has long sparked the interest of the Ma- chitectures tested (considerably better than that chine Translation (MT) community as a challeng- of cascade systems) and have competitive per- ing type of translation. Most works employing MT formance in terms of translation quality. Direct for subtitling stem from the statistical

Between Flexibility and Consistency: Joint Generation of Captions And

Can You Give Me Another Word for Hyperbaric?: Improving Speech

A Survey of Voice Translation Methodologies - Acoustic Dialect Decoder

Learning Speech Translation from Interpretation

Is 42 the Answer to Everything in Subtitling-Oriented Speech Translation?

Speech Translation Into Pakistan Sign Language

Implementing Machine Translation and Post-Editing to the Translation of Wildlife Documentaries Through Voice-Over and Off-Screen Dubbing

Breeding Gender-Aware Direct Speech Translation Systems

TAUS Speech-To-Speech Translation Technology Report March 2017

Arxiv:2006.01080V1 [Cs.CL] 1 Jun 2020

Multilingual Named Entity Extraction and Translation from Text and Speech

Integration of Speech Recognition and Machine Translation in Computer-Assisted Translation Shahram Khadivi and Hermann Ney, Senior Member, IEEE

Automatic Post-Editing for Machine Translation Rajen Chatterjee