Language-independent bilingual extraction from a multilingual parallel corpus Els Lefever1,2, Lieve Macken1,2 and Veronique Hoste1,2

1LT3 2Department of Applied Mathematics School of Studies and Computer Science University College Ghent Ghent University Groot-Brittannielaan¨ 45 Krijgslaan281-S9 9000 Gent, Belgium 9000 Gent, Belgium

{Els.Lefever, Lieve.Macken, Veronique.Hoste}@hogent.be

Abstract statistical measures. More recent ATR systems use hybrid approaches that combine both linguis- We present a language-pair independent tic and statistical information (Frantzi and Anani- terminology extraction module that is adou, 1999). based on a sub-sentential alignment sys- tem that links linguistically motivated Most bilingual terminology extraction systems phrases in parallel texts. Statistical filters first identify candidate terms in the source lan- are applied on the bilingual list of candi- guage based on predefined source patterns, and date terms that is extracted from the align- then select translation candidates for these terms ment output. in the target language (Kupiec, 1993). We compare the performance of both We present an alternative approach that gen- the alignment and terminology extrac- erates candidate terms directly from the aligned tion module for three different language and phrases in our parallel corpus. In a sec- pairs (French-English, French-Italian and ond step, we use frequency information of a gen- French-Dutch) and highlight language- eral purpose corpus and the n-gram frequencies pair specific problems (e.g. different com- of the automotive corpus to determine the term pounding strategy in French and Dutch). specificity. Our approach is more flexible in the Comparisons with standard terminology sense that we do not first generate candidate terms extraction programs show an improvement based on language-dependent predefined PoS pat- of up to 20% for bilingual terminology ex- terns (e.g. for French, N N, N Prep N, and N traction and competitive results (85% to Adj are typical patterns), but immediately link lin- 90% accuracy) for monolingual terminol- guistically motivated phrases in our parallel cor- ogy extraction, and reveal that the linguis- pus based on lexical correspondences and syntac- tically based alignment module is particu- tic similarity. larly well suited for the extraction of com- This article reports on the term extraction ex- plex multiword terms. periments for 3 language pairs, i.e. French-Dutch, 1 Introduction French-English and French-Italian. The focus was on the extraction of automative lexicons. Automatic Term Recognition (ATR) systems are usually categorized into two main families. On the The remainder of this paper is organized as fol- one hand, the linguistically-based or rule-based lows: Section 2 describes the corpus. In Section 3 approaches use linguistic information such as PoS we present our linguistically-based sub-sentential tags, chunk information, etc. to filter out stop alignment system and in Section 4 we describe words and restrict candidate terms to predefined how we generate and filter our list of candidate syntactic patterns (Ananiadou, 1994), (Dagan and terms. We compare the performance of our sys- Church, 1994). On the other hand, the statistical tem with both bilingual and monolingual state-of- corpus-based approaches select n-gram sequences the-art terminology extraction systems. Section 5 as candidate terms that are filtered by means of concludes this paper.

Proceedings of the 12th Conference of the European Chapter of the ACL, pages 496–504, Athens, Greece, 30 March – 3 April 2009. c 2009 Association for Computational Linguistics

496 2 Corpus the linguistic processing and the alignment mod- ule as well as to define the thresholds for the sta- The focus of this research project was on the au- tistical filtering of the candidate terms (see 4.1). tomatic extraction of 20 bilingual automative lex- icons. All work was carried out in the framework # Words # Sentence pairs of a customer project for a major French automo- Short (< 8 words) +- 9,000 823 Medium (8-19 words) +- 9,000 386 tive company. The final goal of the project is to Long (> 19 words) +- 9,000 180 improve vocabulary consistency in technical texts Development corpus +-5,000 393 across the 20 languages in the customer’s portfo- lio. The French database contains about 400,000 Table 2: Number of words and sentence pairs in entries (i.e. sentences and parts of sentences with the test and development corpora an average length of 9 words) and the translation percentage of the database into 19 languages de- pends on the target market. 3 Sub-sentential alignment module For the development of the alignment and termi- As the basis for our terminology extraction sys- nology extraction module, we created three paral- tem, we used the sub-sentential alignment sys- lel corpora (Italian, English, Dutch) with French tem of (Macken and Daelemans, 2009) that links as a central language. Figures about the size of linguistically motivated phrases in parallel texts each parallel corpus can be found in table 1. based on lexical correspondences and syntactic similarity. In the first phase of this system, anchor Target Lang. # Sentence pairs # words French Italian 364,221 6,408,693 chunks are linked, i.e. chunks that can be linked French English 363,651 7,305,151 with a very high precision. We think these anchor French Dutch 364,311 7,100,585 chunks offer a valid and language-independent al- ternative to identify candidate terms based on pre- Table 1: Number of sentence pairs and total num- defined PoS patterns. As the automotive corpus ber of words in the three parallel corpora contains rather literal , we expect that a high percentage of anchor chunks can be retrieved. Although the architecture of the sub-sentential 2.1 Preprocessing alignment system is language-independent, some We PoS-tagged and lemmatized the French, En- language-specific resources are used. First, a glish and Italian corpora with the freely available bilingual lexicon to generate the lexical correspon- TreeTagger tool (Schmid, 1994) and we used Tad- dences and second, tools to generate additional Pole (Van den Bosch et al., 2007) to annotate the linguistic information (PoS tagger, lemmatizer and Dutch corpus. a chunker). The sub-sentential alignment system In a next step, chunk information was added takes as input sentence-aligned texts, together with by a rule-based language-independent chunker the additional linguistic annotations for the source (Macken et al., 2008) that contains distituency and the target texts. rules, which implies that chunk boundaries are The source and target sentences are divided into added between two PoS codes that cannot occur chunks based on PoS information, and lexical cor- in the same constituent. respondences are retrieved from a bilingual dic- tionary. In order to extract bilingual dictionaries 2.2 Test and development corpus from the three parallel corpora, we used the Perl As we presume that sentence length has an impact implementation of IBM Model One that is part of on the alignment performance, and thus on term the Microsoft Bilingual Sentence Aligner (Moore, extraction, we created three test sets with vary- 2002). ing sentence lengths. We distinguished short sen- In order to link chunks based on lexical clues tences (2-7 words), medium-length sentences (8- and chunk similarity, the following steps are taken 19 words) and long sentences (> 19 words). Each for each sentence pair: test corpus contains approximately 9,000 words; 1. Creation of the lexical link matrix the number of sentence pairs per test set can be found in table 2. We also created a development 2. Linking chunks based on lexical correspon- corpus with sentences of varying length to debug dences and chunk similarity

497 3. Linking remaining chunks In Figure 1, the chunks [Fr: gradient] – [En: gradient] and the final punctuation mark have been 3.1 Lexical Link Matrix retrieved in the first step as anchor chunk. In the For each source and target , all translations last step, the n:m chunk [Fr: de remontee´ pedale´ for the word form and the lemma are retrieved d’ embrayage] – [En: of rising of the clutch pedal] from the bilingual dictionary. In the process of is selected as candidate anchor chunk because it is building the lexical link matrix, function words are enclosed within anchor chunks. neglected. For all content words, a lexical link is created if a source word occurs in the set of pos- sible translations of a target word, or if a target word occurs in the set of possible translations of the source words. Identical strings in source and target language are also linked.

3.2 Linking Anchor chunks Candidate anchor chunks are selected based on the information available in the lexical link matrix. The candidate target chunk is built by concatenat- ing all target chunks from a begin index until an end index. The begin index points to the first target chunk with a lexical link to the source chunk un- Figure 1: n:m candidate chunk: ’A’ stands for an- der consideration. The end index points to the last chor chunks, ’L’ for lexical links, ’P’ for words target chunk with a lexical link to the source chunk linked on the basis of corresponding PoS codes under consideration. This way, 1:1 and 1:n candi- and ’R’ for words linked by language-dependent date target chunks are built. The process of select- rules. ing candidate chunks as described above, is per- formed a second time starting from the target sen- As the contextual clues (the left and right neig- tence. This way, additional n:1 candidates are con- bours of the additional candidate chunks are an- structed. For each selected candidate pair, a simi- chor chunks) provide some extra indication that larity test is performed. Chunks are considered to the chunks can be linked, the similarity test for be similar if at least a certain percentage of words the final candidates was somewhat relaxed: the of source and target chunk(s) are either linked by percentage of words that have to be linked was means of a lexical link or can be linked on the basis lowered to 0.80 and a more relaxed PoS matching of corresponding part-of-speech codes. The per- function was used. centage of words that have to be linked was em- pirically set at 85%. 3.4 Evaluation

3.3 Linking Remaining Chunks To test our alignment module, we manually indi- cated all translational correspondences in the three In a second step, chunks consisting of one function test corpora. We used the evaluation methodology word – mostly punctuation marks and conjunc- of Och and Ney (2003) to evaluate the system’s tions – are linked based on corresponding part-of- performance. They distinguished sure alignments speech codes if their left or right neighbour on the (S) and possible alignments (P) and introduced the diagonal is an anchor chunk. Corresponding final following redefined precision and recall measures punctuation marks are also linked. (where A refers to the set of alignments): In a final step, additional candidates are con- structed by selecting non-anchor chunks in the |A ∩ P | |A ∩ S| source and target sentence that have correspond- precision = , recall = (1) ing left and right anchor chunks as neigbours. The |A| |S| anchor chunks of the first step are used as contex- and the alignment error rate (AER): tual information to link n:m chunks or chunks for which no lexical link was found in the lexical link |A ∩ P | + |A ∩ S| AER(S, P ; A) = 1 − (2) matrix. |A| + |S|

498 Table 3 shows the alignment results for the three links for the parts, we were able to link language pairs. (Macken et al., 2008) showed that the missing correspondence: pavillon – dakverste- the results for French-English were competitive to viging. state-of-the-art alignment systems. (2) Fr: doublure arc pavillon arriere.` SHORT MEDIUM LONG (En: rear roof arch lining) p r e p r e p r e Du: binnenpaneel dakversteviging achter. Italian .99 .93 .04 .95 .89 .08 .95 .89 .07 English .97 .91 .06 .95 .85 .10 .92 .85 .12 Dutch .96 .83 .11 .87 .73 .20 .87 .67 .24 We experimented with the decompounding mod- Table 3: Precision (p), recall (r) and alignment er- ule of (Vandeghinste, 2008), which is based on ror rate (e) for our sub-sentential alignment sys- the Celex lexical database (Baayen et al., 1993). tem evaluated on French-Italian, French-English The module, however, did not adapt well to the and French-Dutch highly technical automotive domain, which is re- flected by its low recall and the low confidence As expected, the results show that the align- values for many technical terms. In order to adapt ment quality is closely related to the similarity be- the module to the automotive domain, we imple- tween languages. As shown in example (1), Ital- mented a domain-dependent extension to the de- ian and French are syntactically almost identical compounding module on the basis of the devel- – and hence easier to align, English and French opment corpus. This was done by first running the are still close but show some differences (e.g dif- decompounding module on the Dutch sentences to ferent compounding strategy and word order) and construct a list with possible compound heads, be- French and Dutch present a very different lan- ing valid compound parts in Dutch. This list was guage structure (e.g. in Dutch the different com- updated by inspecting the decompounding results pound parts are not separated by spaces, separable on the development corpus. While decomposing, verbs, i.e. verbs with prefixes that are stripped off, we go from right to left and strip off the longest occur frequently (losmaken as an infinitive versus valid part that occurs in our preconstructed list maak los in the conjugated forms) and a different with compound parts and we repeat this process word order is adopted). on the remaining part of the word until we reach the beginning of the word. (1) Fr: declipper´ le renvoi de ceinture de securit´ e.´ Table 4 shows the impact of the decompound- (En: unclip the mounting of the belt of safety) ing module, which is more prominent for short It: sganciare il dispositivo di riavvolgimento della and medium sentences than for long sentences. A cintura di sicurezza. superficial error analysis revealed that long sen- (En: unclip the mounting of the belt of satefy) tences combine a lot of other French – Dutch En: unclip the seat belt mounting. alignment difficulties next to the decompounding Du: maak de oprolautomaat van de autogordel los. problem (e.g. different word order and separable (En: clip the mounting of the seat-belt un) verbs).

SHORT MEDIUM LONG p r e p r e p r e We tried to improve the low recall for French- Dutch Dutch by adding a decompounding module to our no dec .95 .76 .16 .88 .67 .24 .88 .64 .26 alignment system. In case the target word does dec .96 .83 .11 .87 .73 .20 .87 .67 .24 not have a lexical correspondence in the source sentence, we decompose the Dutch word into its Table 4: Precision (p), recall (r) and alignment er- meaningful parts and look for translations of the ror rate (e) for French-Dutch without and with de- compound parts. This implies that, without de- compounding information compounding, in example 2 only the correspon- dences doublure – binnenpaneel, arc – dakverste- 4 Term extraction module viging and arriere` – achter will be found. By de- composing the compound into its meaningful parts As described in Section 1, we generate candi- (binnenpaneel = binnen + paneel, dakversteviging date terms from the aligned phrases. We believe = dak + versteviging) and retrieving the lexical these anchor chunks offer a more flexible approach

499 because the method is language-pair independent To measure the termhood criterion and to fil- and is not restricted to a predefined set of PoS pat- ter out general vocabulary words, we applied terns to identify valid candidate terms. In a second Log-Likelihood filters on the French single-word step, we use a general-purpose corpus and the n- terms. In order to filter on low unithood values, gram frequency of the automotive corpus to deter- we calculated the Mutual Expectation Measure for mine the specificity of the candidate terms. the multiword terms in both source and target lan- The candidate terms are generated in several guage. steps, as illustrated below for example (3). 4.1.1 Log-Likelihood Measure (3) Fr: Tableau de commande de climatisation automa- The Log-Likehood measure (LL) should allow us tique to detect single word terms that are distinctive En: Automatic air conditioning control panel enough to be kept in our bilingual lexicon (Daille, 1995). This metric considers word frequencies 1. Selection of all anchor chunks (minimal weighted over two different corpora (in our case a chunks that could be linked together) and lex- technical automotive corpus and the more general ical links within the anchor chunks: purpose corpus “Le Monde”1), in order to assign high LL-values to words having much higher or tableau de commande control panel climatisation air conditioning lower frequencies than expected. We implemented commande control the formula for both the expected values and the tableau panel Log-Likelihood values as described by (Rayson and Garside, 2000). 2. combine each NP + PP chunk: Manual inspection of the Log-Likelihood fig- commande de climatisa- automatic air condition- ures confirmed our hypothesis that more domain- tion automatique ing control specific terms in our corpus were assigned high tableau de commande de automatic air condition- LL-values. We experimentally defined the thresh- climatisation automatique ing control panel old for Log-Likelihood values corresponding to 3. strip off the adjectives from the anchor distinctive terms on our development corpus. Ex- chunks: ample (4) shows some translation pairs which are filtered out by applying the LL threshold. commande de climatisa- air conditioning control tion (4) Fr: cependant – En: however – It: tuttavia – Du: tableau de commande de air conditioning control echter climatisation panel Fr: choix – En: choice – It: scelta – Du: keuze Fr: continuer – En: continue – It: continuare – Du: 4.1 Filtering candidate terms verdergaan To filter our candidate terms, we keep following Fr: cadre – En: frame – It: cornice – Du: frame criteria in mind: (erroneous filtering) • each entry in the extracted lexicon should re- Fr: allegement´ – En: lightening – It: alleggerire – fer to an object or action that is relevant for Du: verlichten (erroneous filtering) the domain (notion of termhood that is used to express “the degree to which a linguis- 4.1.2 Mutual Expectation Measure tic unit is related to domain-specific context” (Kageura and Umino, 1996)) The Mutual Expectation measure as described by Dias and Kaalep (2003) is used to measure the • multiword terms should present a high de- degree of cohesiveness between words in a text. gree of cohesiveness (notion of unithood that This way, candidate multiword terms whose com- expresses the “degree of strength or stability ponents do not occur together more often than ex- of syntagmatic combinations or collocations” pected by chance get filtered out. In a first step, (Kageura and Umino, 1996)) we have calculated all n-gram frequencies (up to 8-grams) for our four automotive corpora and then • all term pairs should contain valid translation used these frequencies to derive the Normalised pairs (translation quality is also taken into consideration) 1http://catalog.elra.info/product info.php?products id=438

500 Expectation (NE) values for all multiword entries, Since the annotators labeled system output, the as specified by the formula of Dias and Kaalep: reported scores all refer to precision scores. In fu- ture work, we will develop a gold standard corpus which will enable us to also calculate recall scores. prob(n − gram) NE = 1 P (3) n prob(n − 1 − grams) 4.2.1 Impact of filtering The Normalised Expectation value expresses the Table 5 shows the difference in performance for cost, in terms of cohesiveness, of the possible loss both single and multiword terms with and with- of one word in an n-gram. The higher the fre- out filtering. Single-word filtering seems to have a quency of the n-1-grams, the smaller the NE, and bigger impact on the results than multiword filter- the smaller the chance that it is a valid multiword ing. This can be explained by the fact that our can- expression. The final Mutual Expectation (ME) didate multiword terms are generated from anchor value is then obtained by multiplying the NE val- chunks (chunks aligned with a very high preci- ues by the n-gram frequency. This way, the Mu- sion) that already answer to strict syntactical con- tual Expectation between n words in a multiword straints. The annotators also mentioned the diffi- expression is based on the Normalised Expecta- culty of judging the relevance of single word terms tion and the relative frequency of the n-gram in for the automotive domain (no clear distinction be- the corpus. tween technical and common vocabulary). We calculated Mutual Expectation values for all NOT FILTERED FILTERED candidate multiword term pairs and filtered out in- OK NOK MAY OK NOK MAY complete or erroneous terms having ME values be- FR-EN Sing w 82% 17% 1% 86.5% 12% 1.5% low an experimentally set threshold (being below Mult w 81% 16.5% 2.5% 83% 14.5% 2.5% 0.005 for both source and target multiword or be- FR-IT Sing w 80.5% 19% 0.5% 84.5% 15% 0.5% low 0.0002 for one of the two multiwords in the Mult w 69% 30% 1.0% 72% 27% 1.0% translation pair). The following incomplete can- FR-DU Sing w 72% 25% 3% 75% 22% 3% didate terms in example (5) were filtered out by Mult w 83% 15% 2% 84% 14% 2% applying the ME filter: Table 5: Impact of statistical filters on Single and (5) Fr: fermeture embout - En: end closing - It: Multiword terminology extraction chiusura terminale - Du: afsluiting deel (should be: Fr: fermeture embout de brancard - En: chassis member end closing panel - It: chiusura ter- 4.2.2 Comparison with bilingual terminology minale del longherone - Du: afsluiting voorste deel extraction van langsbalk) We compared the three filtered bilingual lexi- cons (French versus English-Italian-Dutch) with 4.2 Evaluation the output of a commercial state-of-the-art termi- The terminology extraction module was tested on nology extraction program SDL MultiTerm Ex- all sentences from the three test corpora. The out- tract2. MultiTerm is a statistically based system put was manually labeled and the annotators were that first generates a list of candidate terms in the asked to judge both the translational quality of the source language (French in our case) and then entry (both languages should refer to the same ref- looks for translations of these terms in the target erential unit) as well as the relevance of the term language. We ran MultiTerm with its default set- in an automotive context. Three labels were used: tings (default noise-silence threshold, default stop- OK (valid entry), NOK (not a valid entry) and word list, etc.) on a large portion of our parallel MAYBE (in case the annotator was not sure about corpus that also contains all test sentences3. We the relevance of the term). ran our system (where term extraction happens on First, the impact of the statistical filtering was a sentence per sentence basis) on the three test measured on the bilingual term extraction. Sec- sets. ondly, we compared the output of our system with 2 the output of a commercial bilingual terminology www.translationzone.com/en/products/sdlmultitermextract 370,000 sentences seemed to be the maximum size of extraction module and with the output of a set of the corpus that could be easily processed within MultiTerm standard monolingual term extraction modules. Extract.

501 Table 6 shows that even after applying statistical usually tends to concatenate noun phrases filters, our term extraction module retains a much (even without inserting spaces between the higher number of candidate terms than MultiTerm. different compound parts). This way we can extract larger Dutch chunks that correspond # Extracted terms # Terms after filtering MultiTerm FR-EN 4052 3386 1831 to several French chunks, for instance: FR-IT 4381 3601 1704 FR-DU 3285 2662 1637 Fr: feu regulateur´ – de pression carburant. Table 6: Number of terms before and after apply- Du: brandstofdrukregelaar. ing Log-Likelihood and ME filters

ANCHORCHUNKAPPROACH MULTITERM Table 7 lists the results of both systems and OK NOK MAY OK NOK MAY shows the differences in performance for single FR-EN Sing w 86.5% 12% 1.5% 77% 21% 2% and multiword terms. Following observations can Mult w 83% 14.5% 2.5% 47% 51% 2% be made: Total 84.5% 13.5% 2 % 64% 34% 2% FR-IT Sing w 84.5% 15% 0.5% 85% 14% 1% • The performance of both systems is compa- Mult w 72% 27% 1.0% 65% 34% 1% rable for the extraction of single word terms, Total 77.5% 22% 1% 76.5% 22.5% 1% FR-DU but our system clearly outperforms Multi- Sing w 75% 22% 3% 64.5% 33% 2.5% Mult w 84% 14% 2% 49.5% 49.5% 1% Term when it comes to the extraction of more Total 79.5% 20% 2.5% 58% 40% 2% complex multiword terms. Table 7: Precision figures for our term extraction • Although the alignment results for French- system and for SDL MultiTerm Extract Italian were very good, we do not achieve comparable results for Italian multiword ex- traction. This can be due to the fact that the 4.2.3 Comparison with monolingual syntactic structure is very similar in both lan- terminology extraction guages. As a result, smaller syntactic chunks In order to have insights in the performance of are linked. However one can argue that, just our terminology extraction module, without con- because of the syntactic resemblance of both sidering the validity of the bilingual terminology languages, the need for complex multiword pairs, we contrasted our extracted English terms terms is less prominent in closely related lan- with state-of-the art monolingual terminology sys- guages as translators can just paste smaller tems. As we want to include both single words and noun phrases together in the same order in multiword terms in our technical automotive lex- both languages. If we take the following ex- icon, we only considered ATR systems which ex- ample for instance: tract both categories. We used the implementation deposer´ – l’ embout – de brancard for these systems from (Zhang et al., 2008) which togliere – il terminale – del sotto- is freely available at1. porta We compared our system against 5 other ATR systems: we can recompose the larger compound l’embout de brancard or il terminale del sot- 1. Baseline system (Simple Term Frequency) toporta by translating the smaller parts in the same order (l’embout – il terminale and de 2. Weirdness algorithm (Ahmad et al., 2007) brancard – del sottoporta which compares term frequencies in the tar- get and reference corpora • Despite the worse alignment results for Dutch, we achieve good accuracy results on 3. C-value (Frantzi and Ananiadou, 1999) the multiword term extraction. Part of that which uses term frequencies as well as can be explained by the fact that French and unit-hood filters (to measure the collocation Dutch use a different compounding strategy: strength of units) whereas French compounds are created by concatenating prepositional phrases, Dutch 1http://www.dcs.shef.ac.uk/˜ziqizhang/resources/tools/

502 4. Glossex (Kozakov et al., 2004) which uses Expectation works better than the Log-Likelihood term frequency information from both the tar- ranking. get and reference corpora and compares term An error analysis of the results leads to the fol- frequencies with frequencies of the multi- lowing insights: word components • All systems suffer from partial retrieval of 5. TermExtractor (Sclano and Velardi, 2007) complex multiwords (e.g. ATR management which is comparable to Glossex but intro- ecu instead of engine management ecu, AC duces the ”domain consensus” which ”sim- approach chassis leg end piece closure in- ulates the consensus that a term must gain in stead of chassis leg end piece closure panel). a community before being considered a rele- vant domain term” • We manage to extract nice sets of multiwords that can be associated with a given concept, For all of the above algorithms, the input auto- which could be nice for automatic ontology motive corpus is PoS tagged and linguistic filters population (e.g. AC approach gearbox cas- (selecting nouns and noun phrases) are applied to ing, gearbox casing earth, gearbox casing extract candidate terms. In a second step, stop- earth cable, gearbox control, gearbox control words are removed and the same set of extracted cables, gearbox cover, gearbox ecu, gearbox candidate terms (1105 single words and 1341 mul- ecu initialisation procedure, gearbox fixing, tiwords) is ranked differently by each algorithm. gearbox lower fixings, gearbox oil, gearbox To compare the performance of the ranking algo- oil cooler protective plug). rithms, we selected the top terms (300 single and multiword terms) produced by all algorithms and • Sometimes smaller compounds are not ex- compared these with our top candidate terms that tracted because they belong to the same syn- are ranked by descending Log-likelihood (calcu- tactic chunk (E.g we extract passenger com- lated on the BNC corpus) and Mutual Expectation partment assembly, passenger compartment values. Our filtered list of unique English automo- safety, passenger compartment side panel, tive terms contains 1279 single words and 1879 etc. but not passenger compartment as such). multiwords in total. About 10% of the terms do not overlap between the two term lists. All can- 5 Conclusions and further work didate terms have been manually labeled by lin- We presented a bilingual terminology extraction guists. Table 8 shows the results of this compari- module that starts from sub-sentential alignments son. in parallel corpora and applied it on three differ-

SINGLE WORDTERMS MULTIWORD TERMS ent parallel corpora that are part of the same auto- OK NOK MAY OK NOK MAY motive corpus. Comparisons with standard termi- Baseline 80% 19.5% 0.5% 84.5% 14.5% 1% Weirdness 95.5% 3.5% 1% 96% 2.5% 1.5% nology extraction programs show an improvement C-value 80% 19.5% 0.5% 94% 5% 1% of up to 20% for bilingual terminology extraction Glossex 94.5% 4.5% 1% 85.5% 14% 0.5% TermExtr. 85% 15% 0% 79% 20% 1% and competitive results (85% to 90% accuracy) for AC 85.5% 14.5% 0% 90% 8% 2% monolingual terminology extraction. In the near approach future we want to experiment with other filtering techniques, especially to measure the domain dis- Table 8: Results for monolingual Term Extraction tinctiveness of terms and work on a gold standard on the English part of the automotive corpus for measuring recall next to accuracy. We will also investigate our approach on languages which Although our term extraction module has been tai- are more distant from each other (e.g. French – lored towards bilingual term extraction, the results Swedish). look competitive to monolingual state-of-the-art ATR systems. If we compare these results with Acknowledgments our bilingual term extraction results, we can ob- serve that we gain more in performance for mul- We would like to thank PSA Peugeot Citroen¨ for tiwords than for single words, which might mean funding this project. that the filtering and ranking based on the Mutual

503 References R. C. Moore. 2002. Fast and accurate sentence align- ment of bilingual corpora. In Proceedings of the 5th K. Ahmad, L. Gillam, and L. Tostevin. 2007. Uni- Conference of the Association for Machine Trans- versity of surrey participation in trec8: Weirdness lation in the Americas, : from indexing for logical document extrapolation and research to real users, pages 135–244, Tiburon, Cal- rerieval (wilder). In Proceedings of the Eight Text ifornia. REtrieval Conference (TREC-8). F. J. Och and H. Ney. 2003. A systematic comparison S. Ananiadou. 1994. A methodology for automatic of various statistical alignment models. Computa- term recognition. In Proceedings of the 15th con- tional Linguistics, 29(1):19–51. ference on computational linguistics, pages 1034– 1038. P. Rayson and R. Garside. 2000. Comparing cor- pora using frequency profiling. In Proceedings of R.H. Baayen, R. Piepenbrock, and H. van Rijn. 1993. the workshop on Comparing Corpora, 38th annual The celex lexical database on cd-rom. meeting of the Association for Computational Lin- I. Dagan and K. Church. 1994. Termight: identifying guistics (ACL 2000), pages 1–6. and translating technical terminology. In Proceed- ings of Applied Language Processing, pages 34–40. H. Schmid. 1994. Probabilistic part-of-speech tagging using decision trees. In International Conference on B. Daille. 1995. Study and implementation of com- New Methods in Language Processing, Manchester, bined techniques for automatic extraction of termi- UK. nology. In J. Klavans and P. Resnik, editors, The Balancing Act: Combining Symbolic and Statistical F. Sclano and P. Velardi. 2007. Termextractor: a web Approaches to Language, pages 49–66. MIT Press, application to learn the shared terminology of emer- Cambridge, Massachusetts; London, England. gent web communities. In Proceedings of the 3rd International Conference on Interoperability for En- G. Dias and H. Kaalep. 2003. Automatic extraction terprise Software and Applications (I-ESA 2007). of multiword units for estonian: Phrasal verbs. Lan- guages in Development, 41:81–91. A. Van den Bosch, G.J. Busser, W. Daelemans, and S. Canisius. 2007. An efficient memory-based mor- K.T. Frantzi and S. Ananiadou. 1999. the c-value/nc- phosyntactic tagger and parser for dutch. In Selected value domain independent method for multiword Papers of the 17th Computational Linguistics in the term extraction. journal of Natural Language Pro- Netherlands Meeting, pages 99–114, Leuven, Bel- cessing, 6(3):145–180. gium. K. Kageura and B. Umino. 1996. Methods of au- V. Vandeghinste. 2008. A Hybrid Modular Machine tomatic term recognition: a review. Terminology, Translation System. LoRe-MT: Low Resources Ma- 3(2):259–289. chine Translation. Ph.D. thesis, Centre for Compu- tational Linguistics, KULeuven. L. Kozakov, Y. Park, T.-H Fin, Y. Drissi, Y.N. Do- ganata, and T. Confino. 2004. Glossary extraction Z. Zhang, J. Iria, C. Brewster, and F. Ciravegna. 2008. and knowledge in large organisations via semantic A comparative evaluation of term recognition algo- web technologies. In Proceedings of the 6th Inter- rithms. In Proceedings of the sixth international national Conference and he 2nd Asian conference of Language Resources and Evaluation Semantic Web Conference (Se-mantic Web Chal- (LREC 2008). lenge Track). J. Kupiec. 1993. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceed- ings of the 31st Annual Meeting of the Association for Computational Linguistics. L. Macken and W. Daelemans. 2009. Aligning lin- guistically motivated phrases. In van Halteren H. Verberne, S. and P.-A. Coppen, editors, Selected Pa- pers from the 18th Computational Linguistics in the Netherlands Meeting, pages 37–52, Nijmegen, The Netherlands. L. Macken, E. Lefever, and V. Hoste. 2008. Linguistically-based sub-sentential alignment for terminology extraction from a bilingual automotive corpus. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 529–536, Manchester, United King- dom.

504