arXiv:1802.03142v1 [cs.CL] 9 Feb 2018 irSec.Ti ops sdfratmtcsec recogn speech p automatic This for language. used target corpus, a This in text LibriSpeech. to aligned language source a r vial o riigmcietasainsses th systems, translation machine training Howev for available decoding. are or atte learning have during (SLT) transcription translation language language spoken in works Recent n mnlnul opsbsdo edadook called audiobooks exist- read an on enrich based to corpus propose (monolingual) we ing this, described For corpora introduction. existing the in than an bigger is magnitude which of evaluation order translation speech direct for corpus again is contributions. language). corpus Paper per 8h this than But (less small rather 2016). Lewis, and (Federmann Keywords: re for useful is online) experiments. available translation made language is c wit (which evaluation correlated corpus manual reasonably this a are as scores well alignments as automatic processing respe the their of with details level the sentence the at Fren segments gathering speech After align aligned. and segmented carefully been lopoie pehaindt rnltdtx.Sec is Speech text. translated through to recorded corpus aligned speech (MSLT) provides Translation also Language recordings. low-bandwidth Speech contain Microsoft and ar size corpora these medium their However, only 2013). with al., aligned et (Post conversations translations tran- telephonic speech of of hours scriptions 38 provide corpora Spanish-English r ulcyaalbe o example, For For corpora available. parallel language. publicly few target are a a only a translation, in in speech text speech End-to-End to include aligned that ( language corpora large source parallel no source are open (text there and training systems, for as translation available 2017). (such machine Chiang, texts are and parallel Anastasopoulos OpenSubtitles) of Europarl, quantities 2016; al., large (no et while (Blachon recordings However, language language) 2016; audio another al., source et Adda in of the translation in made transcript their corpora with uses aligned documentation language often speech for with- attractive which End-to-End Translation also speech, is transcription). raw translation intermediate Machine 2016; from any al., (B´erard et (translation out Translation 2017) in al., et Speech Weiss results End-to-End promising have successful shown have in and 2014), al., approaches et very (Bahdanau encoder-decoder been Attention-based umnigLbipehwt rnhTasain:AMultim A Translations: French with Librispeech Augmenting [email protected],laurent.besacier@univ [email protected], ietsec rnlto,blnulainet librispe alignment, bilingual translation, speech direct Skype .Introduction 1. u betv st rvd large a provide to is objective Our l a Kocabiyikoglu Can Ali o nls,Gra n French and German English, for ietSec rnlto Evaluation Translation Speech Direct nv rnbeAps -80,SitMri d’Heres Saint-Martin F-38400, Alpes, Grenoble Univ. ⋆ I,UA -N,CR,IRA rnbe France Grenoble, INRIA, CNRS, G-INP, UGA, LIG, Fisher † IIE,UA rnbe France Grenoble, UGA, LIDILEM, and Callhome r r olre( large no are ere > tv rnltosadoti 3ho sbeprle data. parallel usable of 236h obtain and translations ctive 100h) ⋆ h ua uget fteblnulainetqaiy We quality. alignment bilingual the of judgments human the h heboscrepnigt h nls ui-ok rmLi from audio-books English the to corresponding e-books ch r hl ag uniiso aalltxs(uha Europa as (such texts parallel of quantities large while er, arn Besacier Laurent , to,i eie rmra uibosfo h irVxpro LibriVox the from audiobooks read from derived is ition, lcbeeprmnsi ietsec rnlto rmr g more or translation speech direct in experiments plicable prtist l hsgpb umniga xsig(monolin existing an augmenting by gap this fill to tries aper pe obidedt-n peht-ettasainwith translation speech-to-text end-to-end build to mpted nutdo ml usto h ops hseauto sho evaluation This corpus. the of subset small a on onducted Abstract e ) , c corpus ech geol-le.r olivier.kraif@univ-grenoble-alpe -grenoble-alpes.fr, nARadbcuew eiv ti osbet n h text the find to possible is it believe we because and ASR in e osatwt edsec opsfreautn End-2-E evaluating for corpus speech read translation. a speech with start to ter https://www.ted.com https://persyval-platform.univ-grenoble-alpes.fr/DS esatfo hscorpus this from start We recordings. original r ae npbi oanbosaalbeon available books Project domain recordings public speech on The based are LibriVox. based effort: project collaborative a 2015) from on derive al., recordings et book (Panayotov audio read transcriptions The cor- their speech scale of with hours large aligned 1000 a approximatively is contains It which pus (ASR). Recognition Speech matic pehatmtclyaindt rnhtasain tthe at translations level French utterance to aligned ut- automatically English speech the with (French) of language terances foreign a in books LibriSpeech hsppri raie sfloig fe rsnigour presenting after ( following: point as starting organized is paper This Outline. u trigpitis point starting Our gives and work this subset concludes a perspectives. align- some 5. of obtained section evaluation automatically Finally, the our of ments). describes (quality corpus 4. the of Section sec- in corpus 3.. speech the tion to translations foreign aligned we .OrSatn on:LbipehCorpus Librispeech Point: Starting Our 2. > 3 2 1 0h n pnsuc aallcroata nld pehi speech include that corpora parallel source open and 100h) nte aae ol aebe sd E ak see - Talks TED used: been have could dataset Another https://www.gutenberg.org/ u aae saalbeat available is dataset Our 2 n r itiue with distributed are and h prahi tagtowr:w lg e- align we straightforward: is approach The . LibriSpeech ⋆ 1 Librispeech . lve Kraif Olivier , LibriSpeech hsrslsi 3ho English of 236h in results This . 3 u ecniee twsb bet- be was it considered we but - nscin2,w eciehow describe we 2., section in ) eas thsbe ieyused widely been has it because LibriSpeech dlCru for Corpus odal † opsue o Auto- for used corpus hspprpresents paper This l OpenSubtitles) rl, u sn source using out swl sthe as well as rSec,we briSpeech, nrlspoken eneral ul corpus: gual) et n has and ject, Gutenberg eiv that believe sta the that ws s.fr nd n . 91/detaildataset translations for a large subset of the read audiobooks. books) in French to be aligned with Librispeech. Some of the public domain resources that we used are: Gutenberg per-spk female male total Project5, Wikisource6, Gallica7, Google Books8, BEQ9, subset hours minutes spkrs spkrs spkrs UQAC10. dev-clean 5.4 8 20 20 40 Audiobooks available in LibriSpeech are of different liter- test-clean 5.4 8 20 20 40 ary genres: most of them are novels, however there are also dev-other 5.3 10 16 17 33 poems, fables, treaties, plays, religious texts, etc. Belong- test-other 5.1 10 17 16 33 ing to the public domain, most of the texts are old and not train- available publicly in foreign language. Therefore, the nov- 100.6 25 125 126 251 clean-100 els that were collected in foreign language are mostly nov- train- els from world’s classics. As few of them are ancient texts, 363.6 25 439 482 921 clean-360 some translations are in old French. train- 496.7 30 564 602 1166 3.3. Chapters Extraction other-500 LibriSpeech transcriptions are provided for each chapter. Table 1: Details on LibriSpeech corpus As the readers only read a short period of time11, transcrip- tions may correspond to incomplete chapters. For the same reason, books are not read entirely. Therefore, in order to Table 1. gives details on Librispeech as well as data split. obtain an alignment at the sentence level, a first step was to Recordings are segmented and put into different subsets of decompose English and French language books into chap- the corpus according to their quality (better quality speech ters. This step was achieved by a semi-automatic process. segments are put in the clean part). Note that in order to After converting books to text format (both English and obtain a balanced corpus with a large number of speakers, French), regular expressions were used to identify chapter each speaker only read a small portion of a book (8-10 min- transitions. Then, each French chapter was extracted and utes for dev and test, 25-30 minutes for train). Moreover, aligned to its counterpart in English. After manual verifica- in training data, speech segments are obtained by splitting tion of all chapters, we obtained 1423 usable chapters (from long signals according to (> 0.3s) silences in order to ob- 247 books). tain segments that are maximum 35s long. 3.4. Bilingual Text Alignement 3. Aligning Foreign Translations to The 1423 parallel chapters establish the comparable cor- Librispeech pus from which we extracted bilingual sentences. This 3.1. Overview was done using an off-the-shelf bilingual sentence aligner called hunAlign (Varga et al., 2007). HunAlign takes as in- The main steps of our process are the following: put a comparable (not sentence-aligned) corpus and outputs • Collect e-books in foreign language corresponding to a sequence of bilingual sentence pairs. It combines (Gale- English books read in Librispeech (section 3.2.), Church) sentence-length information as well as dictionary- based alignment methods. • Extract chapters from these foreign books, corre- Initial dictionary available for alignment was the de- spondingto read chapters in Librispeech (section 3.3.), fault French-English (40k entries) lexicon created for 12 • Perform bilingual text alignement from comparable LFAligner (wrapper for hunAlign created by Andras chapters (section 3.4.), Farkas). We enriched this dictionary by adding entries from other open source bilingual dictionaries. Different dictio- • Realign speech signal with text translations obtained naries (woaifayu, apertium, freedict, quick) from a lan- (section 3.5.). guage learning resource were gathered in various formats and adapted to hunAlign dictionary format13. We finally These different steps are described in the next subsections. obtained and used a dictionary of 128,000 unique entries. 3.2. Collecting Foreign Novels In order to improve the quality of sentence level align- ments, data had to be pre-processed. For English and LibriSpeech corpus is composed of 5831 chapters (from French, our extracted chapters were cleaned with regular 1568 books) aligned with their transcriptions. We used expressions. Then, we used Python NLTK (Bird, 2006) the given metadata to search e-books in foreign language Lib- (French) corresponding to English books read in 5 rispeech. Firstly, we used DBPedia (Auer et al., 2007) in http://www.gutenberg.org/ 6http://www.wikisource.org/ order to (automatically) obtain title translations. Secondly, 7http://gallica.bnf.fr/ we used a public domain index of French e-books4 to find 8http://books.google.com Web links matching titles we found. Then, we finished 9http://beq.ebooksgratuits.com this process for the entire LibriSpeech corpus by manu- 10http://www.uqac.ca/ ally searching for French novels in different public domain 11One goal of Librispeech was to have as many speakers as resources. Overall, we collected 1818 chapters (from 315 possible 12https://sourceforge.net/projects/aligner/ 4https://www.noslivres.net/ 13https://polyglotte.tuxfamily.org sentence split to detect sentence boundaries in the cor- have 2 French translations in the end (a correct one from pora. Furthermore, the bitexts were stemmed (removing automatic alignement ; a noisy one from MT). suffixes to reduce data sparsity). Finally, parallel sen- tences found were brought back to their initial form with Chapters Books Duration (h) Total Segments reverse . This last step was done using Google’s 1408 247 ~236h 131395 diff − patch − match library (Fraser, 2012). Table 3: Statistics of the final multimodal and bilingual cor- pus obtained (English speech aligned to French text) English Sentence French Sentence ≪Oh! je vous demande Oh, I beg your pardon! bien pardon! 4. Human Evaluation of a Corpus Subset A lane was forthwith Un chemin fut alors opened through the ouvert parmi la foule 4.1. Protocol crowd of spectators. des spectateurs. Now that we have obtained a multimodal alignment be- - Non, dit Catherine, tween (English) speech signals and (French) translations, No, ”said Catherine,” il n’est pas ici. we want to evaluate its quality. At this point, the only score he is not here; Jamais je ne parviens available is the confidence score given by hunalign indi- I cannot see him anywhere. `ale rencontrer. cating confidence for aligned sentences. One goal of this human evaluation, that can only be made on a corpus sub- Table 2: Examples of parallel sentences obtained from set, is to see if hunalign score has a good correlation with comparable corpora made up of aligned book chapters human judgements. 50 sentences from 4 different chapters have been chosen Table 2. shows examples of 3 bilingual sentences obtained for evaluation. These chapters were chosen according to from 3 different chapters. their average alignment scores (from hunalign). We chose two chapters that were near the mean of overall align- 3.5. Realigning Speech Signal with Text ment scores (hypothesized medium quality alignments), Translations one chapter which was above the mean score (hypothesized In order to associate parallel sentences to speech signal good quality alignment) and a final chapter below mean transcriptions, realignment of speech segments of Lib- score (hypothesized bad quality alignment). These sen- riSpeech was necessary. This realignment is a two step tences were evaluated by three annotators. We established a process: first, we forced aligned Librispeech English tran- scale from 1 to 3 to judge matchingquality between English scripts to match English sentences obtained in the previous speech and English transcriptions. This 3-step scale is pre- stage ; secondly, we resegmented the speech signal accord- cise enough because few errors were found in speech align- ing to new sentence splits. ments. We established a scale from 1 to 5 to judge quality For the first step, we used mweralign, a tool for realigning between bilingual text alignments. Overall, 200 sentences texts in a same language but with a different sentence to- were evaluated (on both scales) by 3 annotators. kenization (Matusov et al., 2005). We applied mweralign We give, as example below, sentences for each mark (1-5) to realign our speech transcriptions in English to the En- for human evaluation of bilingual alignments. Two differ- glish sentences of our bilingual corpus obtained in section ent dimensions are evaluated at the same time: the accuracy 3.4.. The outcome of this first step is a new sentence seg- of alignment (an alignment can be wrong, partial or correct) mentation for our English transcriptions that are now cor- and the fact that translational equivalence is compositional rectly aligned to our French translations. and may be isolated from the current context. The second step was to resegment the speech signals to match them to the new sentence segmentation. We did that • 1. Wrong alignment by: • creating a big wav file by concatenating speech seg- English: COMMIT TO ME I SHALL LET PASS ments for each chapter, NO ADVANTAGE • re-aligning the large speech wav signal to the tran- scripts using gentle14 toolkit, an off-the-shelf En- French: Je sais, par exemple, que maintenant il glish forced-aligner based on Kaldi ASR toolkit souffre de la faim dans un vaste d´esert, o`ul’on ne (Povey et al., 2011), saurait trouver de nourriture. • re-segmenting speech according to the desired sen- tence split. • 2. Partial alignment with slightly compositional trans- Table 3. presents and overview of final data (speech with lational equivalence aligned translations) obtained after this final step. For each sentence pair, we also added En-Fr out- put of our English transcripts (Google Translate). So we English: THAN IN SET TERMS AND IN COURTLY LANGUAGE 14https://github.com/lowerquality/gentle Average confidence Average speech alignment Average textual alignment Chapter score (hunalign) score (max 3) score (max 5) Ivanhoe 1.34 2.82 4.64 Chapter XXIII Alice’s Adventures in Wonderland 1.14 2.98 4.28 Chapter V A Tale of Two Cities 0.96 2.86 3.86 Book III, Chapter III Adventures of Huckleberry Finn 0.66 2.9 2.58 Chapter VIII Average 1.02 2.89 3.84

Table 4: Results of human evaluation by 3 annotators. Kappa’s Cohen (weighted) for inter annotator agreement for textual alignment is 0.76

French: Mais il paraˆıt que tu pr´ef`eres ˆetre cour- average score of 3.84/5. Some sentences were found un- tis´ee avec l’arc et la hache, plutˆot qu’avec des correctly aligned but overall, the alignment quality can be phrases polies et avec la langue de la courtoisie. considered as correct. The main reason why the average alignment score varies between chapters is reflected by the translations compositionnality. Also, the dictionary that we • 3. Partial alignment with compositional translation used for bilingual alignments is inadequate for old texts and and additional or missing information results in lower overall confidence scores. We also computed automatic correspondence scores obtained with a cross-language textual similarity de- English: SO AT LAST BEGAN THE EVENING tection between transcriptions and their translations PAPER AT LA FORCE (Ferreroet al.,2016). Our idea was to add another au- tomatic score in addition to hunalign score. We com- French: C’est ainsi qu’enfin d´ebuta le journal du puted the correlation between human evaluation scores and soir `ala Force, le jour o`ula pauvre Lucie avait vu hunalign scores and obtained a correlation of 0.41. The danser la carmagnole. same correlation was obtained between human evaluation scores and those obtained automatically with method of (Ferrero et al., 2016). This shows that automatic alignment • 4. Correct alignment with compositional translation scores are reasonably correlated with human judgments and and few additional or missing information could be used to extract a subset of the best alignments by ranking them according to hunalign score for instance.

English: THE NIGHT WAS DARK AND A 5. Conclusion COLD WIND BLEW We have presented a large corpus (236h) which is an aug- French: La nuit ´etait sombre; le vent ˆapre et froid mentation of Librispeech in order to provide a bilingual chassait devant lui avec rage les nuages rapides. speech- for direct (end-2-end) speech transla- tion experiments. The methodology described here could be used in order to add other languages than French (Ger- • 5. Correct alignment and fully compositional transla- man, Spanish, etc.) to our augmented Librispeech. The tion current corpus contains several ancient texts, so it would also be interesting to extend it to other kinds of corpora: different speaking styles (not only read speech), more con- English: WHAT IS A CAUCUS RACE temporary texts, etc. For direct experiments, preliminary ex- French: Qu’est-ce qu’une course cocasse? periments have been done recently and will be presented at next ICASSP 2018 conference (B´erard et al., 2018). Our 4.2. Results online repository15 provides a data split for speech transla- tion experiments and results show that it is possible to train Table 4. reports our human evaluations for the 4 chapters. compact and efficient end-to-end speech translation models The first thing that we can notice is that the alignment qual- in this setup, but the dataset is challenging (BLEU score ity is higher for chapters with higher confidence scores. around 15 for direct speech translation task - more details The first evaluation (speech alignement ; scale 1-3) shows in (B´erard et al., 2018)). an average score of 2.89/3 which confirms that our re- segmentation of speech signals worked correctly. The sec- ond evaluation (bilingual alignment ; scale 1-5) shows an 15see https://persyval-platform.univ-grenoble-alpes.fr/DS91/detaildataset 6. Bibliographical References Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glem- bek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Adda, G., St¨ucker, S., Adda-Decker, M., Ambouroue, O., Y., Schwarz, P., et al. (2011). The kaldi speech recog- Besacier, L., Blachon, D., Bonneau-Maynard, H., Go- nition toolkit. In IEEE 2011 workshop on automatic dard, P., Hamlaoui, F., Idiatov, D., Kouarata, G.-N., and understanding, number EPFL- Lamel, L., Makasso, E.-M., Rialland, A., Van de Velde, CONF-192584. IEEE Signal Processing Society. M., Yvon, F., and Zerbian, S. (2016). Breaking the un- written language barrier: The Bulb project. In Proceed- Varga, D., Hal´acsy, P., Kornai, A., Nagy, V., N´emeth, L., ings of SLTU (Spoken Language Technologies for Under- and Tr´on, V. (2007). Parallel corpora for medium den- Resourced Languages), Yogyakarta, Indonesia. sity languages. Amsterdam Studies in the Theory and History of Linguistic Science Series 4, 292:247. Anastasopoulos, A. and Chiang, D. (2017). A case study Weiss, R. J., Chorowski, J., Jaitly, N., Wu, Y., and Chen, Z. on using speech-to-translation alignments for language (2017). Sequence-to-sequence models can directly tran- documentation. arXiv preprint arXiv:1702.04372. scribe foreign speech. arXiv preprint arXiv:1703.08581. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. The semantic web, pages 722–735. Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural ma- chine translation by jointly learning to align and trans- late. arXiv preprint arXiv:1409.0473. B´erard, A., Pietquin, O., Servan, C., and Besacier, L. (2016). Listen and translate: A proof of concept for end- to-end speech-to-text translation. In NIPS workshop on End-to-end Learning for Speech and Audio Processing. B´erard, A., Besacier, L., Kocabiyikoglu, A. C., and Pietquin, O. (2018). End-to-end automatic speech trans- lation of audiobooks. In Accepted to Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing. IEEE. Bird, S. (2006). Nltk: the . In Pro- ceedings of the COLING/ACL on Interactive presenta- tion sessions, pages 69–72. Association for Computa- tional . Blachon, D., Gauthier, E., Besacier, L., Kouarata, G.-N., Adda-Decker, M., and Rialland, A. (2016). Parallel speech collection for under-resourced language studies using the LIG-Aikuma mobile device app. In Proceed- ings of SLTU (Spoken Language Technologies for Under- Resourced Languages), Yogyakarta, Indonesia, May. Federmann, C. and Lewis, W. D. (2016). Microsoft speech language translation (mslt) corpus: The iwslt 2016 re- lease for english, french and german. Ferrero, J., Agnes, F., Besacier, L., and Schwab, D. (2016). A multilingual, multi-style and multi-granularity dataset for cross-language textual similarity detection. Fraser, N. (2012). google-diff-match-patch-diff,match and patch libraries for plain text. Matusov, E., Leusch, G., Bender, O., and Ney, H. (2005). Evaluating machine translation output with automatic sentence segmentation. In International Workshop on Spoken Language Translation (IWSLT) 2005. Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). Librispeech: an asr corpus based on public do- main audio books. In Acoustics, Speech and Signal Pro- cessing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE. Post, M., Kumar, G., Lopez, A., Karakos, D., Callison- Burch, C., and Khudanpur, S. (2013). Improved speech- to-text translation with the fisher and callhome spanish- english speech translation corpus.