A step towards the detection of semantic variants of terms in technical documents Thierry Hamon and Adeline Nazarenko C~cile Gros Laboratoire d'Informatique de Paris-Nord EDF-DER-IMA-TIEM-SOAD Universit~ Paris-Nord 1 Avenue du GSnSral de Gaulle Avenue J-B Clfiment 92141 Clamart CEDEX, FRANCE 93430 Villetaneuse, FRANCE cecile.gros~der.edfgdf.fr thierry.hamon~lipn.univ-paris 13.fr adeline.nazarenko~lipn.univ-paris 13.fr

Abstract be terms if a domain expert validates them) are This paper reports the results of a preliminary first automatically extracted from the technical experiment on the detection of semantic vari- document with a Extraction Soft- ants of terms in a French technical document. ware (LEXTER) (Bourigault, 1992). The list The general goal of our work is to help the struc- of candidate terms is then structured into a se- turation of . Two kinds of seman- mantic network. We focus on the latter point tic variants can be found in traditional termi- by detecting semantic variants, especially syn- nologies : strict synonymy links and fuzzier re- onyms. lations like see-also. We have designed three rules which exploit general dictionary informa- ligne afirienne (overhead line) tion to infer synonymy relations between com- See_also : D@art a~rien (overhead outlet) plex candidate terms. The results have been Synonym : Liaison 51ectrique afirienne examined by a human terminologist. The ex- (overhead electric link) pert has judged that half of the overall pairs of Ligne simple (single circuit line) terms are relevant for the semantic variation. Is_a : Ligne aSrienne (overhead line) He validated an important part of the detected Ligne multiterne (multiple circuit line) links as synonymy. Moreover, it appeared that Is_a : Ligne a~rienne (overhead line) numerous errors are due to few mis-interpreted Synonym : Ligne double (double circuit line) links: they could be eliminated by few exception rules. Figure 1: Example of a structured terminology 1 Introduction in the electric domain. 1.1 Structuring a terminology In order to build a structured terminology, The work presented here is a part of an indus- we thus attempt to link candidate terms ex- trial project of Technical Document Consulta- tracted from a French technical document 1. tion System (Gros et al., 1996) at the French For instance, from synonyms such as matgriel electricity company EDF. The goal is to develop (equipment) / Cquipement (fittings), marche tools to help a terminologist in the construction (running) / fonctionnement (working) and nor- of a structured terminology (cf. figure 1) pro- mal (normal) / bon (right), we infer a synonymy viding : link between candidate terms materiel glec- trique (electric equipment) / iquipement dlec- • terms of a domain, i.e. simple or com- trique (electrical fittings) and marche normale plex lexical units pointing out accurate con- (normal running) / bon fonctionnement (right cepts in a technical document, (Bourigault, working). 1992); • semantic links such as the see-also relation. 1As the terms used in this paper have been extracted from French documents, their , especially for This can be viewed as a two-step process. The the synonymy, does not always show the same nuance candidate terms (i.e. lexical units which can than originally.

498 module (model) : < ] > canon (canon), 6talon (standard), exemi,laire (copy), exemple (example), plan (plan) < > sujet (subject), maquette (maq,,ette) < 3 > h~ros (hero), type (type) < 4 > ~chantillon (sample), spdcimen (sample) < 5 > standard (standard), type (type), prototype (prototype) < 6 > maquette (model) < 7 > gabarit (size), moule (mould), patron (pattern)

Figure 2: Example of a entry from the dictionary Le Robert. 1.2 Using a general language dictionary terms) but also semantic variant (synonyms, hy- for specialized corpora peronyrns) of controlled terms. In this experi- As domain specific semantic information is sel- inent, we attempt to infer synonymy links be- dom available, our aim is to evaluate the rel- tween candidate terms. evance and usefulness of general sem~ntic re- 2.1 Semantic variation and synonymy sources for the detection of synonymy between relation candidate terms. For this study, we used a French general Semantic variation The semantic variation dictionary Le Robert supplied by the Institut includes relations (e.g. synonymy and see-also) National de la Langue Franqaise (INaLF). It between of the same grammatical cate- provides synonyms and analogical words dis- gory, even if one may also take into consider- tributed among the dittk~rent senses (cf. figure 2) ation phenomena such as elliptic relations or of each word entry. It is exploited as a machine- combination of synonymy and derivation rela- readable synonym dictionary. tions (e.g. heat and thermal) where the cate- We use a 200 000 word corpus about electric gories may be different. power plant. Its size is typical of the technical Fuzzier relations such as the traditional see- documents. It; is very technical if one consid- also relations of terminologies are also very use- ers the dictionary lemma coverage for this col ful. Once a link is established between two pus (45%). Concerning two other available doc- terms, it is sometimes easy to interpret for the uments dealing with software engineering and terrninok)gy users. Moreover, for applications electric network planning, the dictionary lemma such as document retrieval, the link itself is of- coverage is respectively of 650./0 and 57%. In that; ten more important than its very type. respect the chosen corpus is the worse case for Synonymy We use a synonymy definition this experiment. close to that of WordNet (Miller et al., 1993). The present corpus h~s been analyzed by It is defined as an equivalence relation between the Terminology Extraction Software LEX- terms having the same meaning in a particu- TER which extracted 12 043 candidate terms lar context. The transitivity rule cannot be ap- (2 831 nouns, 597 adjectives and 8 615 noun plied to the links extracted from the dictionary. phr~ses). Each complex candidate term (ligne indeed, while the synonymy is sometimes very d'alimcntation, supply line) is analyzed into a contextual in the dictionary, the links appear in head (ligne, line) and an expansion (alimenta- the data without context information and would tion, supply). It is part of a syntactic network produce a great deal of errors. Thus, for in- (cf. figure 3). stance, the synonymy links between the adjec- tives polaire (polar) and glacial (icy) and the ad- 2 Method for the detection of jectives glacial (cold) and insensible (insensitive) synonymous terms would allow to deduce a wrong synonym link The terminological variation include morpho- between polaire and insensible. logical (flectional, derivational) variants, syn~ Moreow~r, tests carried out on dictionary tactic variant (coordinated and samples show that the relevant links which

499 ligne adrienne ligne adrienne haute tension (overhead line) . H (hight voltage overhead line) ]igne simple (single line) ligne adrienne moyenne tension ligne double (double line) (middle voltage overhead line) ligne d'alimentation :/ (supply line) ~ E (...) alimentation (supply) ligne (line) capacitd de transit de la ligne (transit capacity of the line) cofit d'investissement de la ligne (cost of investissement of the line) ordre de ddelenchement ddclenchement de la ligne • (order of tripping) (tripping of the line) E de la ligne (of the line) longueur de la ligne (size of the line) puissance caract~ristique de la ligne (caracteristic power of the line) (...) Figure 3: Fragment of the syntactic network (H = head, E = expansion). Nouns Adjectives Total Number of simple terms extracted 2 831 597 3 428 Number of retained words 1 134 408 1 542 at the filtering step Percentage of retained words 40% 68% 45% at the filtering step

Table 1: Coverage of the corpus by the dictionary. could be added thanks to the transitivity rules 2.3 Second step: Detection of already exist in the dictionary. For instance the synonymous candidate terms following words are synonymous pairwise: lo- Assuming that the semantics and the synonymy gement (accommodation), demeure (residence), of the complex candidate terms are composi- domicile (residence) and habitation (house). tional, we design three rules to detect synonymy We consider all links provided by the dictio- relations between candidate terms. Consider- nary as expressing synonymy relation between ing two candidates terms, if one of the following simple candidate terms and design a two-step conditions is met, a synonymy link is added to automatic method to infer links between com- the terminological network: plex candidate terms. -the heads are identical and the expansions 2.2 First step: Dictionary data filtering are synonymous (collecteur gdndral (general In order to reduce the database, we first fil- collector) / collecteur commun (common ter the relevant dictionary links for the stud- collector)); ied document. For instance, the link matdriel -the heads are synonymous and the ex- (equipment) / Cquipement (fittings)is selected pansions are identical (matdriel dlectrique because its both ends, mat&iel and dquipement (electric equipment) / Cquipement dlectrique exist in the studied corpus. For this document, fittings)); 3 369 synonymy links between 1 542 simple (electrical terms are preserved. - tile heads are synonymous and the expansions Table 1 shows the results of the filtering step are synonymous (marche normale (normal in regard to the coverage of our corpus by the running) / bon fonctionnement (right work- dictionary. ing));

500 We first use the dictionary links as a boot- The meronymy links are the most numerous strap to detect synonymy links between com- after synonymy (rapport de s~retd (safety report) plex candidate terms. Then, we iterate the pro / analyse de s'arctd (safety analysis)), in the cess by including the newly detected links in previous example, whereas rapport (report) and our base until no new link can be l'ound. In the analyse (analysis) are given as synonyms by the present experiment, the process ends up after general language dk:tionary (which is context- three iterations. free), their technical meanings in our document are more specific. Therefore, rapport de s~b'etd 3 Results and study of the detected is a meronym rather than a synonyln of analyse links de s,&etd in the studied document. Other detected links allow to group the can- "|.1 Various detected links didate terms which refer to related concepts. Synonymy links 396 links between complex For instance, we detected a link between the candidate terms (i.e. noun phrases) a.re inferred device li97tc dc vidange (draining line) and the by this Inethod. An expert of the domain vali- place point de puTyle (Now-down point) which is dated 37% of them (i.e. 146 links, el. table 2) relevant since a draining line ends at a blow- as real synonymy links: ha lttcur d'eau (water down point. Likewise, it is useful to link fin de height) / niveau d'eau (level of water), ddt6ri- vidange (draining end) which designates an op- oration notable (notable deterioration) / ddgra- eration and destination des pur~es (blow-down dation importante (important damage) (cf. fig- destination) which is the corresponding equip- ure 4). ment. The expert considered that the link be- Number Percentage tween the candidate terms (commande m£ Validated links 1,16 canique (mechanical control) / ordrc automa- Unvalidated links 250 63% tiquc (automatic order)) expresses an antonymy Total 396 100% relation, ~dthough it is infered from the syn- onymy relation of the dictionary mdcanique Table 2: Results of the link validation. (mechanical) / a.utomatiq~zc (automatic). It ap- Most of the synonymy links between candi- pears that; those adjectives have a particular date terms arc detected at the first iteration meaning in the present corpus. Therefore, ev- (383 liens out of 396). The majority of the val- ery link detected from this "synonymy" link is idated links are given by the two tirst rules: 89 an antonymy one. validated links out of 206 with the first rule (ad- Those links express various relations some~ mission d'air (air intake) / entr& d'air (air en- times diffwult to name, even by the expert. try)), 49 out of 105 with the second (toit Jtotta~tt Such links arc important in a terminology. (floating roof) /toit mobile (movable roof)and 3.2 Polysemy, elision and metaphor eollecteur gdndral (general collector) / collecteur commuTt (common collector)). Obviously, the Most real errors are due to the lack of con- last rule has a lower precision rate: 8 out of text information for polysemic words and the 85 (fausse manoeuvre (wrong operation) / mau- noisy dat~ existing in the dictionary. For in- valse manipulation (bad handling)), tlowever, stance the French word temps means either it infers important links which are difficult to time or weather. According to the dictio- detect by hand. nary, temps (weather) is a synonym of temper- ature (temperature) 2, but this meaning is ex- Other useful links On the whole, the expert cluded h'om the present corpus. Since we can- judged that half of the detected links are useful not distinguish the ditferent meanings, the syn- for the terminology structuration even if he re- onymy of temps / time and temperature is taken jected some of them as real synonymy links (cf. for granted. Temps attendu (expected time) figure 5). Our method detects difl'erent types of and temp&ature attcndue (expected tempera- links: meronymy, antonymy, relations between close concepts, connected parts of a whole mech- 2 It would be more precise to interpret it as analogous anisln, etc. words.

501 Term1 Term2

d~tSrioration notable d~gradation importante (notable deterioration) (important damage) fausse manoeuvre (wrong operation) mauvaise manipulation (bad handling) action de l'opfirateur intervention de l'opSrateur (action of the operator) (intervention of the operator) capacit~ interne (internal capacity) volume interne (internal volume) capacitfi totale (total capacity) volume total (total volume) capacit~ utile (useful capacity) volume utile (useful volume) limite de solubilit~ (limit of solubility) seuil de solubilit~ (solubility threshold) marche manuelle (manual running) fonctionnement manuel (manual working) tests pfiriodiques (periodic tests) essais p~riodiques (periodic trials) hauteur d'eau (water height) niveau d'eau (level of water) panneau de commande (control panel) tableau de commande (control board)

Figure 4: Examples of synonymy links between complex candidate terms.

Term 1 Term 2

essai en usine (test in plant) experience d'exploitation (experiment of exploitation) ligne de vidange (draining line) point de purge (blow-down point) fonction d'un temps (fonction of a time) effet d'une tempfirature (effect of a temperature) froid normal (normal cold) refroidissement correct (correct cooling) rapport de sfiret6 (safety report) analyse de sfiretfi (safety analysis) solution d'acide borique dissolution de l'acide borique (solution of boric acid) (dissolving of the boric acid) temperature attendue temps attendu (expected time) (expected temperature) tempfirature normale (normal temperature) temps normal (normal time) organes de commande (control devices) organes d'ordre (order devices) gros d~bit (big flow) plein dfibit (full flow) activitfi importante (important activity) activitfi ~levfie (high activity) commande m&anique (mechanical control) ordre automatique (automatic order) risques de corrosion (risk of corrosion) risques de destruction (risk of destruction)

Figure 5: Examples of rejected links between complex candidate terms. ture) are thus given as synonymous. This type We have detected links such as activitd hautc of wrong links is rather important in the list (high activity) / haute dnergie (high energy). presented to the expert: between 10 to 20 links As regards metaphor, we have observed that out of 396. it preserves semantic relation. For instance, in On the contrary, about ten wrong links are graph theory, the link (arbre (tree) / feuille due to the elision of common terms in the do- (leaf)) can be inferred from the meronyny in- main. For instance, the term activitg (activity) formation of general dictionary. which actually corresponds to the term radioac- Those types of wrong links are easily iden- tivitg (radioactivity) in the document is given as tified during the validation. Some exceptions a synonym of ~nergie (energy) in the dictionary. rules can be designed to first regroup those links

502 and then eliminate them. With that aim, we using morphological and part-of-speech mod- plan to use dictionary definitions. ules, the system are extended to the verbal phrases (tree cutting, tree have been cut down) 3.3 Evaluation (Klavans et al., 1997). Dealing with syntac- The inferred links express not only synonymy, tic paraphrase in the general language, (Dras, but also other relations which may be ditIicult 1997) propose a similar representation by using to name. Apart from real errors, these fuzzy the STAG formalism to detect syntactic related see-also relations are useful in the context of a sentences. Because we deal with the semantic consultation system. level, our work is complementary of those. The results of this first experiment are en- couraging. Although the precision rate and the Semantic variation is rarely studied in spe- number of links are low (37%, 396 links), the cialized domains. Works on word similarity and use of complementary methods (e.g. detection word sense disambiguation are generally based of syntactic variants) would allow to propagate on statistical methods designed for large or even these links and increase their number. Also, very large corpora (Hindle, 1.990; Agirre and the use of other knowledge sources or different Rigau, 1996). Therefore, they cannot be ap- methods (ttabert et al., 1996) is necessary to plied for technical documents which usually are increase precision rate and find links between medium size corpora. However, dealing with more technical candidate terms. already linguistic filtered data, (Assadi, 1.997) As regards the improvement of such a aims at statistically build rough clusters sup- inethod, the terminology acquisition by an ex- posing that similar candidate terms have similar pert will take tens of hours while the automatic exp~msions. Then he relies on human expertise extraction takes one hour and the validation of tbr the semantic interpretation. It differs from the links has been done in two hours. The main difficulty is to evaluate the recall in our work which tries to automatically explicit the semantic relations. In order to disambiguate the results because there is no standard refer- noun objects in a short text (30 000 words), ence in that matter, giving the overall relevant (Li et al., 1995) design heuristic rules using se- relations in the document. One may think that mantic similarity information in WordNet and the compariso,l with links manually detected by verbs as context. Their system disambiguate an an expert is the best evaluation, but such man- encouraging number on noun-verb pairs if one ual detection is subjective. Regarding the wli- considers single and multiple sense assigned to dation by several experts, it is well-known that a word. such validation would give different results de- pending on the background of each expert (Sz- pakowicz et al., 1996). So, we are reduced to In (Basili et al., 1997), the lexical knowledge compare our results with those obtained by dif- base WordNet (Miller et al., 1993) is used as a ferent methods even though they are not perfect bootstrap for verb disambiguation. They tune either. We are planning to compare the clusters it to the domain of the studied document by found by our method with the clustering one of taking into account the contexts in which the (Assadi, 1.997) to study how the results overlap verbs are used. This tuning leads both to elimi- and are complementary. nate certain semantic categories and to add new ones. For instance, the category contact is cre- 4 Related works ated for the verb to record. The resulted sense The variant detection in specialized corpora classification is thus a better description of the must be taken into account for information re- verb specialized meanings. triew~l. This complex operation inw~lves the semantic as well as the morphological and Our symbolic and dictionary-based approach syntactic level. (Jacquemin, 1996) design a is close that of (Basili et al., 1997). They both unification-based partial parser FASTER which use general language information (traditional analyses raw technical text while wets-rules dictionary vs. WordNet) for specialized cor- detect morpho-syntactic variants of controlled pora. ttowever, their goals differ: disambigua- terms (blood cell, blood mononuclear' cell). By tion vs. semantic relation identification.

503 5 Conclusion and future works D. Bourigault. 1992. Surface grammatical analysis for the extraction of terminological The use of a synonym dictionary and the rules of noun phrases. In synonymous candidate terms detection we have Proceedings of COLING'92, pages 977-981, Nantes, France. designed allow to extract an encouraging num- ber of links in a very technical corpus. An ex- Mark Dras. 1997. Representing paraphrases us- pert validated these links. More than one third ing synchronous tree adjoining grammars. In of the detected links are synonymy relations. proceedings of the 1997 Australian NLP Sum- Beside synonymy, our method detects various mer Workshop, Syndney, Australia. kinds of semantic variants. Wrong links due to Ralph Grishman and John Sterling. 1994. Gen- the polysemy can be easily eliminated with ex- eralizing automatically generated selectional ception rules by comparing selectional patterns patterns. In Proceedings of Coling'94, vol- and generalized contexts (Basili et al., 1993; Gr- ume 3, pages 742-747, Kyoto. ishman and Sterling, 1994). C. Gros, H. Assadi, N. Aussenac-Gilles, and Our work shows that general semantic data A. Courcelle. 1996. Task models for techni- are useful for the terminology structuration and cal documentation accessing. In Proceedings the synonym detection in a corpus of specialized of EKA W'96, Nottingham. language. The results show that semantic vari- Benoit Habert, Elie Naulleau, and Adeline ants can be automatically detected. Of course, Nazarenko. 1996. Symbolic word cluster- the number of acquired links is relatively low ing for medium-size corpora. In Proceedings but our method is not to be used in isolation. of COLING'96, volume 1, pages 490-495, Copenhagen, Danmark, August. Acknowledgment D. Hindle. 1990. Noun classification from predicate-argument structures. In Proceed- This work is the result of a collaboration with ings of ACL'90, pages 268-275, Pittsburgh, the Direction des Etudes et Recherche (DER) PA. d'Electricit6 de France (EDF). We thank Marie- C. Jacquemin. 1996. A symbolic and surgi- Luce Picard from EDF and Beno[t Habert from cal acquisition of terms through variation. In ENS Fontenay-St Cloud for their help, Didier E. Riloff et G. Scheler S. Wermter, editor, Bourigault and Jean-Yves Hamon from the In- Connectionist, Statistical and Symbolic Ap- stitut de la Langue FranQaise (INaLF) for the proaches to Learning for Natural Language dictionary and Henry Boecon-Gibod for the val- Processing, pages 425-438, Springer. idation of the results. J. Klavans, C. Jacquemin, and E. Tzouker- mann. 1997. A natural language approach to References multi-word term conflation. In Proceedings of E. Agirre and G. Rigau. 1996. Word sense the third Delos Workshop - Cross-Language disambiguation using conceptual density. In Information Retrieval. Proceedings of COLING'96, pages 16-22, Xiaobin Li, Stan Szpakowicz, and Stan Matwin. Copenhagen, Danmark. 1995. WordNet-based algorithm word sense H. Assadi. 1997. Knowledge acquisition from disambiguation. In Proceedings of IJCAI-95, texts: Using an automatic clustering method pages 1368-1374, Montreal, Canada. based on noun-modifier relationship. In G. A. Miller, R. Beckwith, C. Fellbaum, Proceedings of ACL'97- Student Session, D. Gross, and K. Miller. 1993. Introduc- Madrid, Spain. tion to WordNet: An on-line lexical database. Roberto Basili, Maria Teresa Pazienza, and Technical Report CSL Report 43, Cognitive Paola Velardi. 1993. Acquisition of selec- Science Laboratory, Princeton. tional patterns in sublanguages. Machine Stan Szpakowicz, Stan Matwin, and Ken Translation, 8:175-201. Barker. 1996. WordNet-based word sense Roberto Basili, Michelangelo Della Rocca, and disambiguation that works for short texts. Maria Teresa Pazienza. 1997. Contextual Technical Report TR-96-03, Department of word sense tunig and disambiguation. Ap- Computer Science, University of Ottawa, On- plied Artificial Intelligence, 11:235-262. tario, Canada.

504