Parallel Corpora, Terminology Extraction Und Machine Translation

Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2018 Parallel Corpora, Terminology Extraction and Machine Translation Volk, Martin Abstract: In this paper we first give an overview of parallel corpus annotation, alignment and retrieval. We present standard annotation methods such as Part-of-Speech tagging, lemmatization and dependency parsing, but we also introduce language-specific methods, e.g. for dealing with split verbs or truncated compounds in German. We argue for careful sentence and word alignment for parallel corpora. And we explain how word alignment is the basis for a wide range of applications from translation variant ranking to terminology extraction. We conclude with a discussion of the latest developments in Machine Translation. Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-150769 Conference or Workshop Item Published Version Originally published at: Volk, Martin (2018). Parallel Corpora, Terminology Extraction and Machine Translation. In: 16. DTT- Symposion. Terminologie und Text(e), Mannheim, 22 March 2018 - 24 March 2018, 3-14. Parallel Corpora, Terminology Extraction und Machine Translation Martin Volk University of Zurich Institute of Computational Linguistics [email protected] Abstract this is a natural basis for a plethora of parallel corpora. We have taken advantage of this situation In this paper we first give an overview of and collected and annotated various Swiss parallel parallel corpus annotation, alignment and corpora. retrieval. We present standard annotation Our research group specializes in building par- methods such as Part-of-Speech tagging, allel corpora for special domains which span over lemmatization and dependency parsing, but time: We have digitized parallel texts from the we also introduce language-specific meth- Swiss Alpine Club in French and German from ods, e.g. for dealing with split verbs or trun- 1957 until today (Gohring¨ and Volk, 2011), banking cated compounds in German. We argue for texts in English, French, German and Italian from careful sentence and word alignment for 1895 up to the present (Volk et al., 2016a), and the parallel corpora. And we explain how word announcements of the Swiss federal government alignment is the basis for a wide range of (DE: Bundesblatt, FR:Feuille fed´ erale,´ IT:Foglio applications from translation variant rank- federale) since 1849. ing to terminology extraction. We conclude In the current paper we will focus on the latest with a discussion of the latest developments methods in the automatic annotation and alignment in Machine Translation. of parallel corpora. We will argue that word align- 1 Introduction ment across languages improves annotation. We focus on parallel corpora for linguistic and trans- In recent years an increasing number of large par- lation studies, but we believe that parallel corpus allel corpora have become available for research search systems are also interesting for language in natural language processing. The best known learners for viewing translation variants in context is Europarl with the proceedings of the European and for terminologists who want to extract terms parliament (Koehn, 2005, Graen¨ et al., 2014) with or verify domain-specific language usage. around 50 million tokens in the languages of the Eu- The paper is structured as follows. In section 2 ropean Union. Other well known multilingual and we describe corpus annotation methods. Section 3 multiparallel corpora are JRC Acquis (Steinberger is devoted to alignment techniques and their bene- et al., 2006) with the EU law collection, OpenSubti- fits for corpus annotation such as word sense dis- tles (Lison and Tiedemann, 2016), United Nations ambiguation with practical applications in lemma documents (Ziemski et al., 2016), and collections disambiguation and named entity recognition. In of patent applications (Junczys-Dowmunt et al., section 4 we give usage examples of our parallel 2016b). corpora including translation discovery and transla- Switzerland is a country with four official lan- tion error detection. Section 5 describes the latest guages (French, German, Italian and Rumansh) and developments in using parallel corpora for machine because of the many international companies and translation. organizations in Switzerland English is becoming ever more popular. Therefore there is a constant need for translations between these languages and Figure 1: A Multilingwis query with hits in six Europarl languages 2 Corpus Annotations Often PoS tagging also provides lemmas. For example, the TreeTagger outputs lemmas for word Corpus building starts with corpus collection, clean- form - lemma pairs that it has seen in its training ing and tokenization. The latter is often language- corpus. For all other word forms the corpus builder specific and therefore multilingual corpora require may provide a tagger lexicon with additional pairs. language identification. Typically identification is These pairs lack the probabilities that tagger train- done on the sentence level. For each sentence we ing derives from a manually annotated corpus, but compute the language in order to be able to use the for word forms with little or no PoS ambiguity appropriate processing tools during corpus annota- the extension of the tagger lexicon is still useful. tion. Using, for instance, a Part-of-Speech tagger, So, any additional lemma information from other that was trained for one language, for the annota- corpora or from dictionaries or morphological ana- tion of a sentence in another language will give lyzers are valuable. erratic results. Therefore language identification Recent parallel corpora have been annotated on the sentence level is of paramount importance with more annotation layers: Named entity recog- for all texts with mixed languages. nition (NER) is a popular method for a first step 2.1 General Corpus Annotation towards semantic annotation. Typically it involves the recognition of person names, location names Part-of-Speech tagging is standard procedure when and organization names which are the central the corpora are meant for linguistic research. There classes when processing newspaper texts. Spe- are a number of PoS taggers available with parame- cial text types may require other name classes (e.g. ter files for many large languages of Europe. Most event names) or more fine-grained distinctions. For of the time the parameter files are the result of train- example, in our parallel corpus of Alpine texts ing the taggers on newspaper texts. This means that we sub-classify toponyms into the name classes the taggers work best on newspaper texts and grad- of mountains, glaciers, lakes, valleys, cabins, and ually worse the more the corpus material differs cities. Toponyms are essential for the mountaineer- from newspapers. ing reports and therefore very frequent. to strike, to notice) (Volk et al., 2016b). Shallow NER includes only the recognition and classification of names. A deeper analysis includes (1) Selber fallt¨ mir der kleine Fehler aber kaum co-reference resolution (Ebling et al., 2011) and en- auf. tity linking (sometimes called grounding). Mono- EN: However I do not notice the little lingual co-reference resolution will deal with men- mistake. tion variants like Grand Combin = Combin, while Our re-attachment algorithm is based on PoS multilingual co-references will catch translation tags and re-attaches the separated prefix to the most variants as e.g. DE:Matterhorn = FR:Combin = recent preceding finite verb form when this results IT:Combino. Monolingual co-references might in- in a valid German prefix verb (from a manually cu- clude anaphora resolution and will thus allow for rated list of about 8000 such verbs). It works with investigating coherence phenomena in texts. 96.8% precision when evaluated against manually Lately, dependency parsing has become avail- re-attached prefixes in the TuBa/DZ¨ treebank. able for many languages (e.g. Maltparser, Spacy, Stanford, ...). These parsers allow for the effi- 2.3 Exploiting Parallel Corpora for cient analysis of large corpora with a labeled at- Annotation tachment score of 80-90% (McDonald and Nivre, Traditionally most corpus annotation is done mono- 2011) and higher values for unlabeled attachment. lingually. This means that PoS tagging, lemmati- Even though parsing is far from perfect, the au- zation and parsing of e.g. a German corpus is done tomatically assigned syntax information opens a irrespective of a parallel text in English or any other whole new chapter for corpus studies. For example, language. However the parallel text may help to searches for verb-object relations no longer need to disambiguate and thus to improve the annotation speculate on co-occurrence in some arbitrary range, precision on many levels. Most obviously the paral- but can be conditioned on parsing evidence. In lel corpus may help to determine the correct word this way, we find candidates for support verb con- sense in a given sentence. For example, the word structions like to take into consideration or for verb Monch¨ in our Alpine corpus may refer to a promi- sub-categorizations for particular prepositional ob- nent mountain in central Switzerland or to a monk jects like to wait for. (= male person in a monastery). If the correspond- 2.2 Language-specific Corpus Annotation ing sentence in the English or French translation also contains the word Monch¨ , then it is clear that In addition to these general

Load more