Text and Speech Alignment Methods for Speech Translation Corpora Creation

Text and Speech Alignment Methods for Speech Translation Corpora Creation Augmenting English LibriVox Recordings with Italian Textual Translations Giuseppe Della Corte Uppsala University Department of Linguistics and Philology Master Programme in Language Technology Master’s Thesis in Language Technology, 30 ects credits June 12, 2020 Supervisors: Sara Stymne, Uppsala University Abstract The recent uprise of end-to-end speech translation models requires a new gener- ation of parallel corpora, composed of a large amount of source language speech utterances aligned with their target language textual translations. We hereby show a pipeline and a set of methods to collect hundreds of hours of English audio-book recordings and align them with their Italian textual translations, using exclusively public domain resources gathered semi-automatically from the web. The pipeline consists in three main areas: text collection, bilingual text alignment, and forced alignment. For the text collection task, we show how to automatically nd e-book titles in a target language by using machine translation, web information retrieval, and named entity recognition and translation techniques. For the bilingual text alignment task, we investigated three methods: the Gale–Church algorithm in conjunction with a small-size hand-crafted bilingual dictionary, the Gale–Church algorithm in conjunction with a bigger bilingual dictionary automatically inferred through statistical machine translation, and bilingual text alignment by computing the vector similarity of multilingual embeddings of concatenation of consecutive sentences. Our ndings seem to indicate that the consecutive-sentence-embeddings similarity computation approach manages to improve the alignment of dicult sentences by indirectly performing sentence re-segmentation. For the forced alignment task, we give a theoretical overview of the preferred method depending on the properties of the text to be aligned with the audio, suggesting and using a TTS-DTW (text-to-speech and dynamic time warping) based approach in our pipeline. The result of our experiments is a publicly available multi-modal corpus composed of about 130 hours of English speech aligned with its Italian textual translation and split in 60561 triplets of English audio, English transcript, and Italian textual translation. We also post-processed the corpus so as to extract 40-MFCCs features from the audio segments and released them as a data-set. Contents 1 Introduction4 1.1 Purpose...................................5 1.2 Outline...................................5 2 Background and Related Work7 2.1 An Overview on Speech Translation...................7 2.2 Augmenting LibriSpeech.........................8 2.3 Web Scraping...............................8 2.4 The Semantic Web and Knowledge-Bases................9 2.5 Bilingual Text Alignment......................... 10 2.6 Sentence Embeddings........................... 11 2.7 ASR and Forced Alignment........................ 12 3 Text Collection 14 3.1 Book Titles Translation.......................... 14 3.1.1 Methods and Experiments.................... 15 3.1.2 Results and Discussion...................... 17 3.2 Chapter Extraction and Sentence Segmentation............ 17 4 Bilingual Sentence Alignment 19 4.1 Introduction................................ 19 4.2 Hunalign.................................. 20 4.3 Vecalign and LASER............................ 21 4.4 Evaluation and Discussion........................ 21 5 Forced Alignment 24 5.1 Word Based and Sentence Based Alignment.............. 24 5.2 Audio Editing and Aeneas........................ 24 5.3 Evaluation and Discussion........................ 25 6 Signal Processing and Corpus Statistics 26 6.1 Signal Processing............................. 26 6.2 Corpus Statistics.............................. 27 7 Conclusion 29 3 1 Introduction High quality speech corpora are indispensable for a great variety of natural language processing tasks ranging from automatic speech recognition (ASR) and text-to-speech (TTS) software production to direct end-to-end speech translation. Traditional speech corpora are usually monolingual and consist in audio utterances aligned with their textual transcription. The recent uprise of end-to-end models in the domain of speech translation demands a new kind of speech corpora, enriched with the textual translation of the audio utterances. This is simply due to the fact that end-to-end direct speech translation models require a high amount of multilingual (source and target language) audio-textual paired data aligned sentence by sentence to be used as the training dataset, as remarked by Bérard et al. (2018), Chung et al. (2019), and Jia et al. (2019). Recent research in the domain of speech translation has therefore been focusing on cost-eective ways of gathering speech signals paired to a textual golden translation. A remarkable successful and cost-eective strategy to create a speech corpus for English-to-French direct speech translation was outlined in Kocabiyikoglu et al. (2018). Kocabiyikoglu et al. (2018) augmented LibriSpeech, an English speech corpus specically designed for ASR (automatic speech recognition), with French textual translation, releasing the Augmented LibriSpeech coprus as a freely available resource for English-French speech translation. The LibriSpeech corpus was created by aligning some audio utterances from a subset of the English audio-books available through the LibriVox project to their transcription in the Gutenberg Project. The LibriVox project consists of public domain audiobooks read from the public domain e-books available through the Gutenberg Project itself. As it is possible to notice, both the LibriSpeech corpus and the Augmented Lib- riSpeech corpus have at their core the data available thanks to both the LibriVox and the Gutenberg Project. As far as our knowledge, there is a scarcity of publicly available English-Italian corpora for speech translation and there are limitations to their availability for commercial use. In fact, the biggest freely available corpus for English-Italian end-to-end speech translation, MuST-C (Di Gangi, Cattoni, et al., 2019), a multilingual corpus whose English-Italian sub-corpus contains 465 hours of English audio utterances extracted from Ted Talks and aligned with their Italian textual translation, has been released under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 License. To attempt to tackle this scarcity, we describe a solid pipeline to semi-automatically collect public domain audio-textual data and automatically align them. The rst task is to collect the data to align, which includes the employment of named entity recognition, named entity translation, and web scraping techniques. Then we proceed to bilingual text alignment and evaluate three dierent approaches. The result of the bilingual text alignment task is a textual parallel corpus where each sentence in the source language corresponds to a sentence in the target language. We then force the alignment between the English output of the bilingual text alignment task and the corresponding English audio. The result is a publicly available corpus of around 130 hours of English speech aligned with its Italian textual translation, released under the CC-BY 4.0 (Creative Commons Attribution) licence. 4 1.1 Purpose This master thesis project has several purposes: • Firstly, we want to contribute to the NLP community with the release of a freely and publicly available English speech corpus augmented with the corresponding Italian textual translation, under a permissive licence, which was chosen to be the CC-BY (Creative Commons Attribution) licence, that allows to use, share, remix, and make derivative works for both research and commercial use. This is possible since the material used to create the corpus is all in the public domain. • Secondly, we want to describe a replicable and methodological approach to cost-eective speech translation corpora creation that is based on empirical experiments and evaluation of dierent experiments related to the gathering and the alignment of the data: 1. First we propose, implement and evaluate a variety of methods for automatically translate/retrieve book titles in a target language. These methods include named entity recognition through knowledge-bases and open linked data, machine translation, and web scraping information retrieval techniques. 2. Then we perform bilingual text alignment experiments, comparing, evalu- ating and discussing the results of Gale-Church based approaches and a recent method that computes vector similarity of embeddings of consecutive sentences (Thompson and Koehn, 2019). We mainly aim to nd out which method manages to improve the accuracy of objectively dicult alignments, due to a dierent number of sentences and paragraphs among the two source texts. 3. Then we explain the dierent forced alignment methods to provide the theoretical insights to choose the best approach depending on the properties of the text to be aligned with the audio. In our case, we suggest a TTS-DTW based approach and manually evaluate the results. 4. Eventually we show how to process the audio signals so as to extract multi- dimensional mel-frequency-ceptrum coecients (MFCCs) as 2D arrays, since MFCCs are meaningful speech features to train ASR and end-to-end speech translation models. 1.2 Outline In Chapter 2, we provide an overview of the current trends in the research area of speech translation, describing the dierence

Load more