Language-Independent Bilingual Terminology Extraction from a Multilingual Parallel Corpus Els Lefever1,2, Lieve Macken1,2 and Veronique Hoste1,2
Total Page:16
File Type:pdf, Size:1020Kb
Language-independent bilingual terminology extraction from a multilingual parallel corpus Els Lefever1,2, Lieve Macken1,2 and Veronique Hoste1,2 1LT3 2Department of Applied Mathematics School of Translation Studies and Computer Science University College Ghent Ghent University Groot-Brittannielaan¨ 45 Krijgslaan281-S9 9000 Gent, Belgium 9000 Gent, Belgium {Els.Lefever, Lieve.Macken, Veronique.Hoste}@hogent.be Abstract statistical measures. More recent ATR systems use hybrid approaches that combine both linguis- We present a language-pair independent tic and statistical information (Frantzi and Anani- terminology extraction module that is adou, 1999). based on a sub-sentential alignment sys- tem that links linguistically motivated Most bilingual terminology extraction systems phrases in parallel texts. Statistical filters first identify candidate terms in the source lan- are applied on the bilingual list of candi- guage based on predefined source patterns, and date terms that is extracted from the align- then select translation candidates for these terms ment output. in the target language (Kupiec, 1993). We compare the performance of both We present an alternative approach that gen- the alignment and terminology extrac- erates candidate terms directly from the aligned tion module for three different language words and phrases in our parallel corpus. In a sec- pairs (French-English, French-Italian and ond step, we use frequency information of a gen- French-Dutch) and highlight language- eral purpose corpus and the n-gram frequencies pair specific problems (e.g. different com- of the automotive corpus to determine the term pounding strategy in French and Dutch). specificity. Our approach is more flexible in the Comparisons with standard terminology sense that we do not first generate candidate terms extraction programs show an improvement based on language-dependent predefined PoS pat- of up to 20% for bilingual terminology ex- terns (e.g. for French, N N, N Prep N, and N traction and competitive results (85% to Adj are typical patterns), but immediately link lin- 90% accuracy) for monolingual terminol- guistically motivated phrases in our parallel cor- ogy extraction, and reveal that the linguis- pus based on lexical correspondences and syntac- tically based alignment module is particu- tic similarity. larly well suited for the extraction of com- This article reports on the term extraction ex- plex multiword terms. periments for 3 language pairs, i.e. French-Dutch, 1 Introduction French-English and French-Italian. The focus was on the extraction of automative lexicons. Automatic Term Recognition (ATR) systems are usually categorized into two main families. On the The remainder of this paper is organized as fol- one hand, the linguistically-based or rule-based lows: Section 2 describes the corpus. In Section 3 approaches use linguistic information such as PoS we present our linguistically-based sub-sentential tags, chunk information, etc. to filter out stop alignment system and in Section 4 we describe words and restrict candidate terms to predefined how we generate and filter our list of candidate syntactic patterns (Ananiadou, 1994), (Dagan and terms. We compare the performance of our sys- Church, 1994). On the other hand, the statistical tem with both bilingual and monolingual state-of- corpus-based approaches select n-gram sequences the-art terminology extraction systems. Section 5 as candidate terms that are filtered by means of concludes this paper. Proceedings of the 12th Conference of the European Chapter of the ACL, pages 496–504, Athens, Greece, 30 March – 3 April 2009. c 2009 Association for Computational Linguistics 496 2 Corpus the linguistic processing and the alignment mod- ule as well as to define the thresholds for the sta- The focus of this research project was on the au- tistical filtering of the candidate terms (see 4.1). tomatic extraction of 20 bilingual automative lex- icons. All work was carried out in the framework # Words # Sentence pairs of a customer project for a major French automo- Short (< 8 words) +- 9,000 823 Medium (8-19 words) +- 9,000 386 tive company. The final goal of the project is to Long (> 19 words) +- 9,000 180 improve vocabulary consistency in technical texts Development corpus +-5,000 393 across the 20 languages in the customer’s portfo- lio. The French database contains about 400,000 Table 2: Number of words and sentence pairs in entries (i.e. sentences and parts of sentences with the test and development corpora an average length of 9 words) and the translation percentage of the database into 19 languages de- pends on the target market. 3 Sub-sentential alignment module For the development of the alignment and termi- As the basis for our terminology extraction sys- nology extraction module, we created three paral- tem, we used the sub-sentential alignment sys- lel corpora (Italian, English, Dutch) with French tem of (Macken and Daelemans, 2009) that links as a central language. Figures about the size of linguistically motivated phrases in parallel texts each parallel corpus can be found in table 1. based on lexical correspondences and syntactic similarity. In the first phase of this system, anchor Target Lang. # Sentence pairs # words French Italian 364,221 6,408,693 chunks are linked, i.e. chunks that can be linked French English 363,651 7,305,151 with a very high precision. We think these anchor French Dutch 364,311 7,100,585 chunks offer a valid and language-independent al- ternative to identify candidate terms based on pre- Table 1: Number of sentence pairs and total num- defined PoS patterns. As the automotive corpus ber of words in the three parallel corpora contains rather literal translations, we expect that a high percentage of anchor chunks can be retrieved. Although the architecture of the sub-sentential 2.1 Preprocessing alignment system is language-independent, some We PoS-tagged and lemmatized the French, En- language-specific resources are used. First, a glish and Italian corpora with the freely available bilingual lexicon to generate the lexical correspon- TreeTagger tool (Schmid, 1994) and we used Tad- dences and second, tools to generate additional Pole (Van den Bosch et al., 2007) to annotate the linguistic information (PoS tagger, lemmatizer and Dutch corpus. a chunker). The sub-sentential alignment system In a next step, chunk information was added takes as input sentence-aligned texts, together with by a rule-based language-independent chunker the additional linguistic annotations for the source (Macken et al., 2008) that contains distituency and the target texts. rules, which implies that chunk boundaries are The source and target sentences are divided into added between two PoS codes that cannot occur chunks based on PoS information, and lexical cor- in the same constituent. respondences are retrieved from a bilingual dic- tionary. In order to extract bilingual dictionaries 2.2 Test and development corpus from the three parallel corpora, we used the Perl As we presume that sentence length has an impact implementation of IBM Model One that is part of on the alignment performance, and thus on term the Microsoft Bilingual Sentence Aligner (Moore, extraction, we created three test sets with vary- 2002). ing sentence lengths. We distinguished short sen- In order to link chunks based on lexical clues tences (2-7 words), medium-length sentences (8- and chunk similarity, the following steps are taken 19 words) and long sentences (> 19 words). Each for each sentence pair: test corpus contains approximately 9,000 words; 1. Creation of the lexical link matrix the number of sentence pairs per test set can be found in table 2. We also created a development 2. Linking chunks based on lexical correspon- corpus with sentences of varying length to debug dences and chunk similarity 497 3. Linking remaining chunks In Figure 1, the chunks [Fr: gradient] – [En: gradient] and the final punctuation mark have been 3.1 Lexical Link Matrix retrieved in the first step as anchor chunk. In the For each source and target word, all translations last step, the n:m chunk [Fr: de remontee´ pedale´ for the word form and the lemma are retrieved d’ embrayage] – [En: of rising of the clutch pedal] from the bilingual dictionary. In the process of is selected as candidate anchor chunk because it is building the lexical link matrix, function words are enclosed within anchor chunks. neglected. For all content words, a lexical link is created if a source word occurs in the set of pos- sible translations of a target word, or if a target word occurs in the set of possible translations of the source words. Identical strings in source and target language are also linked. 3.2 Linking Anchor chunks Candidate anchor chunks are selected based on the information available in the lexical link matrix. The candidate target chunk is built by concatenat- ing all target chunks from a begin index until an end index. The begin index points to the first target chunk with a lexical link to the source chunk un- Figure 1: n:m candidate chunk: ’A’ stands for an- der consideration. The end index points to the last chor chunks, ’L’ for lexical links, ’P’ for words target chunk with a lexical link to the source chunk linked on the basis of corresponding PoS codes under consideration. This way, 1:1 and 1:n candi- and ’R’ for words linked by language-dependent date target chunks are built. The process of select- rules. ing candidate chunks as described above, is per- formed a second time starting from the target sen- As the contextual clues (the left and right neig- tence. This way, additional n:1 candidates are con- bours of the additional candidate chunks are an- structed. For each selected candidate pair, a simi- chor chunks) provide some extra indication that larity test is performed.