<<

Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch

Year: 2018

Parallel Corpora, Extraction and Machine

Volk, Martin

Abstract: In this paper we first give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency , but we also introduce -specific methods, e.g. for dealing with split verbs or truncated compounds in German. We argue for careful sentence and alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to . We conclude with a discussion of the latest developments in .

Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-150769 Conference or Workshop Item Published Version

Originally published at: Volk, Martin (2018). Parallel Corpora, Terminology Extraction and Machine Translation. In: 16. DTT- Symposion. Terminologie und Text(e), Mannheim, 22 March 2018 - 24 March 2018, 3-14. Parallel Corpora, Terminology Extraction und Machine Translation

Martin Volk University of Zurich Institute of Computational [email protected]

Abstract this is a natural basis for a plethora of parallel cor- pora. We have taken advantage of this situation In this paper we first give an overview of and collected and annotated various Swiss parallel parallel corpus annotation, alignment and corpora. retrieval. We present standard annotation Our research group specializes in building par- methods such as Part-of-Speech tagging, allel corpora for special domains which span over lemmatization and dependency parsing, but time: We have digitized parallel texts from the we also introduce language-specific meth- Swiss Alpine Club in French and German from ods, e.g. for dealing with split verbs or trun- 1957 until today (Gohring¨ and Volk, 2011), banking cated compounds in German. We argue for texts in English, French, German and Italian from careful sentence and word alignment for 1895 up to the present (Volk et al., 2016a), and the parallel corpora. And we explain how word announcements of the Swiss federal government alignment is the basis for a wide range of (DE: Bundesblatt, FR:Feuille fed´ erale,´ IT:Foglio applications from translation variant rank- federale) since 1849. ing to terminology extraction. We conclude In the current paper we will focus on the latest with a discussion of the latest developments methods in the automatic annotation and alignment in Machine Translation. of parallel corpora. We will argue that word align- 1 Introduction ment across improves annotation. We focus on parallel corpora for linguistic and trans- In recent years an increasing number of large par- lation studies, but we believe that parallel corpus allel corpora have become available for research search systems are also interesting for language in processing. The best known learners for viewing translation variants in context is Europarl with the proceedings of the European and for terminologists who want to extract terms parliament (Koehn, 2005, Graen¨ et al., 2014) with or verify domain-specific language usage. around 50 million tokens in the languages of the Eu- The paper is structured as follows. In section 2 ropean Union. Other well known multilingual and we describe corpus annotation methods. Section 3 multiparallel corpora are JRC Acquis (Steinberger is devoted to alignment techniques and their bene- et al., 2006) with the EU law collection, OpenSubti- fits for corpus annotation such as word sense dis- tles (Lison and Tiedemann, 2016), United Nations ambiguation with practical applications in lemma documents (Ziemski et al., 2016), and collections disambiguation and recognition. In of patent applications (Junczys-Dowmunt et al., section 4 we give usage examples of our parallel 2016b). corpora including translation discovery and transla- Switzerland is a country with four official lan- tion error detection. Section 5 describes the latest guages (French, German, Italian and Rumansh) and developments in using parallel corpora for machine because of the many international companies and translation. organizations in Switzerland English is becoming ever more popular. Therefore there is a constant need for between these languages and Figure 1: A Multilingwis query with hits in six Europarl languages

2 Corpus Annotations Often PoS tagging also provides lemmas. For example, the TreeTagger outputs lemmas for word Corpus building starts with corpus collection, clean- form - lemma pairs that it has seen in its training ing and tokenization. The latter is often language- corpus. For all other word forms the corpus builder specific and therefore multilingual corpora require may provide a tagger with additional pairs. language identification. Typically identification is These pairs lack the that tagger train- done on the sentence level. For each sentence we ing derives from a manually annotated corpus, but compute the language in order to be able to use the for word forms with little or no PoS appropriate processing tools during corpus annota- the extension of the tagger lexicon is still useful. tion. Using, for instance, a Part-of-Speech tagger, So, any additional lemma information from other that was trained for one language, for the annota- corpora or from or morphological ana- tion of a sentence in another language will give lyzers are valuable. erratic results. Therefore language identification Recent parallel corpora have been annotated on the sentence level is of paramount importance with more annotation layers: Named entity recog- for all texts with mixed languages. nition (NER) is a popular method for a first step 2.1 General Corpus Annotation towards semantic annotation. Typically it involves the recognition of person names, location names Part-of-Speech tagging is standard procedure when and organization names which are the central the corpora are meant for linguistic research. There classes when processing newspaper texts. Spe- are a number of PoS taggers available with parame- cial text types may require other name classes (e.g. ter files for many large languages of Europe. Most event names) or more fine-grained distinctions. For of the time the parameter files are the result of train- example, in our parallel corpus of Alpine texts ing the taggers on newspaper texts. This means that we sub-classify toponyms into the name classes the taggers work best on newspaper texts and grad- of mountains, glaciers, lakes, valleys, cabins, and ually worse the more the corpus material differs cities. Toponyms are essential for the mountaineer- from newspapers. ing reports and therefore very frequent. to strike, to notice) (Volk et al., 2016b). Shallow NER includes only the recognition and classification of names. A deeper analysis includes (1) Selber fallt¨ mir der kleine Fehler aber kaum co-reference resolution (Ebling et al., 2011) and en- auf. tity linking (sometimes called grounding). Mono- EN: However I do not notice the little lingual co-reference resolution will deal with men- mistake. tion variants like Grand Combin = Combin, while Our re-attachment is based on PoS multilingual co-references will catch translation tags and re-attaches the separated prefix to the most variants as e.g. DE:Matterhorn = FR:Combin = recent preceding finite verb form when this results IT:Combino. Monolingual co-references might in- in a valid German prefix verb (from a manually cu- clude anaphora resolution and will thus allow for rated list of about 8000 such verbs). It works with investigating coherence phenomena in texts. 96.8% precision when evaluated against manually Lately, dependency parsing has become avail- re-attached prefixes in the TuBa/DZ¨ . able for many languages (e.g. Maltparser, Spacy, Stanford, ...). These parsers allow for the effi- 2.3 Exploiting Parallel Corpora for cient analysis of large corpora with a labeled at- Annotation tachment score of 80-90% (McDonald and Nivre, Traditionally most corpus annotation is done mono- 2011) and higher values for unlabeled attachment. lingually. This means that PoS tagging, lemmati- Even though parsing is far from perfect, the au- zation and parsing of e.g. a German corpus is done tomatically assigned information opens a irrespective of a in English or any other whole new chapter for corpus studies. For example, language. However the parallel text may help to searches for verb-object relations no longer need to disambiguate and thus to improve the annotation speculate on co-occurrence in some arbitrary range, precision on many levels. Most obviously the paral- but can be conditioned on parsing evidence. In lel corpus may help to determine the correct word this way, we find candidates for support verb con- sense in a given sentence. For example, the word structions like to take into consideration or for verb Monch¨ in our Alpine corpus may refer to a promi- sub-categorizations for particular prepositional ob- nent mountain in central Switzerland or to a monk jects like to wait for. (= male person in a monastery). If the correspond- 2.2 Language-specific Corpus Annotation ing sentence in the English or French translation also contains the word Monch¨ , then it is clear that In addition to these general annotation considera- the ambiguous word in the German sentence refers tions, many languages will have specific require- to the mountain name. ments. Compounding languages like German or We have developed a similar kind of disambigua- Swedish will profit from segmentation tion for lemmas. For example, the German word in lemmatization. E.g. a German text might men- form gehort¨ may have the lemma horen¨ (EN: to tion the compounds Montblanc-Besteigung, Mont- hear) or gehoren¨ (EN: to belong). Depending on Blanc-Expedition, Montblancgipfel with or without the corresponding sentence in English or any other a hyphen, and we will miss the mountain name language we can easily compute the correct lemma Montblanc if we do not split these compounds and for the ambiguous word in the German sentence normalize the spelling variants. Splitting and nor- (Volk et al., 2016a). Of course, this kind of knowl- malization also facilitate cross-lingual word align- edge transfer between the languages in a parallel ment since it reduces 1-to-many alignments. corpus presupposes word alignments across the Another example of a language-specific annota- languages. tion is the re-attachment of German verb prefixes that occur separated in the sentence. In example 3 Aligning Parallel Corpora sentence 1 the prefix auf (EN: on) needs to be re- attached to the verb stem fallt¨ (EN: falls) in order Document alignment is the starting point of all to compute the correct verb lemma auffallen (EN: alignment activities. If a corpus is built on OCRed text or on web-crawled text, then document align- word alignment. ment requires article boundary detection and sub- We are working towards such a flexible and pow- sequent document alignment based on properties erful retrieval tool for parallel corpora. Our proto- such as author names, article lengths and publica- type system, Multilingwis1, allows for word form tion dates. or lemma searches, for fixed sequence and bag-of- The next step is sentence alignment which is a word searches and for the exclusion of function precondition for any exploitation of parallel cor- from queries. We have also included the op- pora. Sentence alignment can be computed effi- tion to filter for source language. In this case, the ciently based on length comparisons (based on char- query (in any of the supported languages) results acter counts), co-occurring numbers, names and only in hits with utterances that originate in a spe- cognates. Hun-Align is a popular sentence aligner cific source language. For example, I could search that implements a two-pass algorithm which does for German Binnenmarkt only in those cases where a rough alignment and builds a bilingual the original utterance is in French. This allows in the first pass and uses this dictionary in the sec- the researcher to distinguish between searches in ond pass for refined sentence alignment. It works original texts versus translations. well for parallel texts that have corresponding sen- In a second prototype we have experimented tences in the same order (monotonicity condition) with database searches over multi-parallel texts in- and with few omissions. cluding displays of the hits as parallel dependency Finally we compute word alignments through trees with PoS tags and word alignment (see figure GIZA, the Berkeley Aligner or FastAlign. Word 2). alignment finds corresponding words or phrases Word alignment provides the basis for many ap- in aligned sentence pairs. It can be computed on plication scenarios. For example, word aligned word forms or lemmas. Word alignment enables corpora allow for translation discovery. We built e.g. sorting the hits after translation variants (which a parallel corpus English-German of film and TV may correspond to different word senses). It also . When we queried for the German word enables annotation transfer (e.g. transferring name fragen we found the obvious English translation classes across languages) and cross-language dis- variants to ask, to say, to wonder, to question (in ambiguation. Word alignment has opened many this order of frequency). But next in the ranked new avenues for linguistic research and translation list we found to go which looked like an alignment studies over parallel corpora. error at first sight. But on closer inspection we discovered that this is a real translation option for 4 Retrieval from Parallel Corpora German fragen as in the following example sen- The Corpus Query Workbench has become a de- tences. facto standard for retrieval from annotated monolin- (2) DE: ... und sie fragte “Was ist das?”. gual corpora. It allows simple and complex queries EN: ... and she goes, “What’s that?” over words and their associated features (like PoS tags, lemmas, name classes etc.). There is no such (3) DE: Ich war jung. Ich fragte “Wo ist standard retrieval tool for parallel corpora. England?”. Different commercial web sites offer searches EN: I was young. I went, “Where’s over parallel corpora as a substitute or comple- England?” ment to searches. Most notably these are Linguee, Glosbe and Tradooit (Volk et al., A special case of translation discovery is trans- 2014). But on these sites the texts do not have any lation error detection. For example, we checked linguistic annotation. for translation variants of month names. When we SketchEngine is one of the few search systems queried for English July we found that in about 1% that allows for query conditions to be specified over of the translations it is erroneously translated with both sentences in a parallel sentence aligned corpus. German Juni, French Juin, and Spanish Junio all But it does not exploit automatically computed 1pub.cl.uzh.ch/purl/multilingwis Figure 2: Query result for “single market” on Europarl German-English with full annotation display meaning June. We observed this confusion in many corpus of Swiss laws. In the Alpine corpus the of our parallel corpora. Obviously the similarity French echelle´ (EN: ladder) is ranked second after of the month names June and July is disturbing for chef whereas in the Swiss Laws in French echelle´ the human translators. is only on fifth place after conducteur, directeur, Another application of parallel corpora is syn- chef, responsable. onym detection through mirroring. Starting from a query in one language and then querying back 5 Machine Translation from a translation hit in another language will lead Machine Translation research dates back to the to synonyms in the starting language. For exam- 1950s. The history has been told in many publica- ple, when we searched for the idiomatic German tions and can therefore be cut short here. adverb phrase klipp und klar, we found that quite clear and very clearly are among the top English 5.1 Old-style Machine Translation translation variants in Europarl (cf. figure 1). Now, The first approach to MT was based on manu- if we query for very clearly in the opposite direc- ally crafted rules for the automatic analysis of the tion, we get German sehr deutlich, ganz klar and source sentence, for the transfer to the target lan- ganz eindeutig which are synonymous expressions guage and for the generation of the output sen- for the initial query phrase klipp und klar. tence on the target side. This was a cumbersome Queries over parallel corpora provide translation and time-consuming process that relied on large variant ranking based on corpus frequencies. These bilingual dictionaries and complex rules. frequencies are obviously dependent on the corpus Some rule-based MT systems achieved good trans- (more precisely on the textual domain). For ex- lation quality for limited topical domains. The most ample when querying for the German word Leiter severe limitation of rule-based MT was the lack of (EN: leader or ladder or electrical conductor) we a good ranking between possible translation vari- get different rankings for the French translation ants and the huge effort needed for adapting an MT variants in our Alpine corpus in comparison to our system from one domain to another. These issues were partially remedied through word embeddings semantically similar words get Statistical Machine Translation (SMT) which be- similar numerical values. For example, house and came prominent in the 1990s and reached its break- building will be positioned close to each other in through a decade later. SMT learns translations the meaning space, whereas house and cloud will from large parallel corpora. Bilingual dictionaries be further apart. and grammar rules are no longer needed. When NMT operates with a fixed vocabulary. But real- a parallel corpus of sufficient size and quality is world translation has to deal with new words con- available and aligned, the SMT system can be built stantly. To-date the most successful solution to this in a matter of days. problem is to translate via subword units (Sennrich In such a way was able to build up et al., 2016b). The intuition is that various types from few languages in 2005 to of words can be translated via smaller units than more than 100 languages to date. Others have built words, e.g. names can be translated via translitera- domain-specific systems that outperformed Google tion, compounds can be handled via splitting and Translate in specific topical domains and genres translating the parts, and cognates and loanwords such as TV subtitles. In recent years pure SMT has can be processed via phonological and morpholog- hit a quality ceiling in particular when translating ical transformations. The subword method reduces into morphologically rich languages and between the number of out-of-vocabulary words consider- languages with distinctly differing word order. ably and is currently used in most NMT systems. As a result NMT output is much more fluent 5.2 Neural Machine Translation than SMT output. NMT is essentially a powerful Neural Machine Translation has achieved its break- (on the target language) which is through at the WMT Conference 2016 in Berlin triggered by the source language. Moreover NMT when Edinburgh University’s NMT systems ranked captures more context in the source sentence. The first for many language pairs (Sennrich et al., translation choice of every word in the source sen- 2016a). In the same year Google published a a tence is conditioned on all other words in the sen- seminal paper (Wu et al., 2016) demonstrating a tence (whereas in SMT this conditioning was lim- jump in MT quality with the introduction of Neural ited to adjacent phrase pairs). MT. Still, many MT problems persist even in NMT. based on neural networks has How can we ensure terminology-consistent trans- been discussed for many years. However progress lation? Many large companies and organizations was limited since it is computationally expensive have compiled terminology databases to ensure the and thus requires powerful hardware. With the un-ambiguous and consistent translation of techni- advent of modern GPUs the use of neural networks cal terms. Chatterjee et al. (2017) have suggested has become wide-spread and is now known as Deep an approach to guide NMT decoding based on ex- Learning. ternal terminology lists which works for words that Neural Machine Translation is based on a so- are known to the NMT system and unknown ones. called encoder-decoder network. It consists of two Another concern are missing words in the trans- recurrent neural networks, one is an encoder which lation. Currently there is no guarantee that the reads the source sentence and produces a sequence NMT output in the target language has translated of hidden states, and the second is a decoder which all pieces of information from the source. Some- predicts each word in the target sentence based on times there are omissions of modifiers and even the previously produced target words and a source negation particles that change the meaning of the context, either the last hidden state of the encoder, output sentence considerably. Koehn and Knowles or a weighted average of the source states in atten- (2017) discuss more challenges in NMT including tional models (Bahdanau et al., 2015). larger amounts of training texts needed, difficulties A central notion for NMT are word embeddings, in translating highly inflected words, and worse i.e. a mechanism to convert words into numbers so translation results on very long sentences. that the neural network can process them. Through Prominent NMT applications are Google Trans- late2, Systran Pure NMT3, and DeepL4. In particu- and Elena Callegaro at the English Department of lar the latter - though currently only available for 7 the University of Zurich. languages - has attracted much attention because of its high translation quality. In the , the increased fluency References of NMT output has been greeted as a welcome im- Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Ben- provement (Junczys-Dowmunt et al., 2016a). How- gio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ever, it is also acknowledged that this requires even 3rd International Conference on Learning Represen- more attention and a higher cognitive workload for tations (ICLR), San Diego. professional translators who are post-editing MT Rajen Chatterjee, Matteo Negri, Marco Turchi, Mar- output. The errors of the MT system are now even cello Federico, Lucia Specia, and Fred´ eric´ Blain. more difficult to spot. 2017. Guiding neural machine translation decoding with external knowledge. In Proceedings of the Con- 6 Conclusion ference on Machine Translation (WMT), pages 157– 168, Copenhagen. We have outlined a number of issues for the an- notation and alignment of parallel corpora. We Josep Maria Crego et al. 2016. Systran’s pure neural machine translation systems. In arXiv:1610.05540. have argued that there are standard corpus annota- tion tools that work for many languages, and there Sarah Ebling, Rico Sennrich, David Klaper, and Mar- are language-specific annotations that, for example, tin Volk. 2011. Digging for names in the moun- tains: Combined person name recognition and refer- deal with separated verb prefixes in German. In ence resolution for German alpine texts. In Proceed- addition, parallel corpora require automatic align- ings of The 5th Language & Conference: ment not only on the sentence level but also for Human Language as a Challenge for words and phrases. This level of alignment opens Computer Science and Linguistics, Poznan. for many new applications such as translation dis- Anne Gohring¨ and Martin Volk. 2011. The Text+Berg covery and term extraction. corpus: An alpine French-German parallel re- Automatic word alignment is the off-spring of source. In Proceedings of Traitement Automatique des Langues Naturelles (TALN 2011), Montpellier. work on the development of statistical machine translation systems. SMT has been the dominant Johannes Graen,¨ Dolores Batinic, and Martin Volk. paradigm in MT research for the last 25 years. Re- 2014. Cleaning the for linguis- tic applications. In Proceedings of KONVENS, cently it has been superseded by neural machine Hildesheim. translation which produces better and in particular more fluent translations. Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang. 2016a. Is neural machine translation Acknowledgments ready for deployment? A case study on 30 trans- lation directions. In Proceedings of the Interna- We would like to thank Johannes Graen¨ for prepar- tional Workshop on Spoken Language Translation (IWSLT), Tokyo, Japan, December. ing the multiparallel Europarl corpus and Christof Bless for the visualizations of parallel parsed Marcin Junczys-Dowmunt, Bruno Pouliquen, and sentences. This research was supported by the Christophe Mazenc. 2016b. Coppa v2.0: Corpus of parallel patent applications building large parallel Swiss National Science Foundation under grant corpora with gnu make. In Proceedings of the Work- 105215 146781 for “SPARCLING: Large Scale shop on Challenges in the Management of Large PARallel Corpora for LINGuistic Investigation” Corpora at LREC, Portoroz,ˇ Slovenia, May. (2013-2017) a joint project with Marianne Hundt and Rebecca Knowles. 2017. Six chal- 2https://translate.google.com; At the time of writing about lenges for neural machine translation. In Proceed- half of the languages supported by Google Translate are han- ings of First Workshop on Neural Machine Transla- dled with NMT and the rest still with SMT. tion, Vancouver. 3 https://demo-pnmt.systran.net; see also (Crego et al., Philipp Koehn. 2005. Europarl: A parallel corpus 2016) for statistical machine translation. In Proceedings 4https://www.deepl.com of Machine Translation Summit X, Phuket. Pierre Lison and Jorg¨ Tiedemann. 2016. Opensub- titles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), Portoroz,ˇ Slovenia. Ryan McDonald and Joakim Nivre. 2011. Analyzing and integrating dependency parsers. Computational Linguistics, 37(1). Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural machine translation sys- tems for WMT 16. In Proceedings of the First Con- ference on Machine Translation, Berlin. Association for Computational Linguistics. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In In Proceedings of the 54th Annual Meeting of the ACL, pages 1715–1725, Berlin. Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Carmelia Ignat, Tomaz Erjavec, Dan Tufis¸, and Daniel Varga. 2006. The JRC-Acquis: A multilin- gual aligned parallel corpus with 20+ languages. In Proceedings of LREC, Genoa. Martin Volk, Johannes Graen,¨ and Elena Callegaro. 2014. Innovations in parallel corpus search tools. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC), Reykjavik.

Martin Volk, Chantal Amrhein, Noemi¨ Aepli, Mathias Muller,¨ and Phillip Strobel.¨ 2016a. Building a par- allel corpus on the world’s oldest banking magazine. In Proceedings of KONVENS, Bochum. Martin Volk, Simon Clematide, Johannes Graen,¨ and Phillip Strobel.¨ 2016b. Bi-particle adverbs, PoS- tagging and the recognition of German separa- ble prefix verbs. In Proceedings of KONVENS, Bochum.

Yonghui Wu et al. 2016. Google’s Neural Ma- chine Translation System: Bridging the Gap be- tween Human and Machine Translation. ArXiv e- prints, September.

Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations parallel cor- pus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), Portoroz,ˇ Slovenia.