Natural Language Processing and Its Application to Automated

Roberto Navigli and Paola Velardi, Università di Roma La Sapienza

Aldo Gangemi, Institute of Cognitive Sciences and Technology

lthough the IT community widely acknowledges the usefulness of domain ontolo- A gies, especially in relation to the ,1,2 we must overcome several bar- The OntoLearn system riers before they become practical and useful tools. Thus far, only a few specific research for automated ontology environments have ontologies. (The “What Is an Ontology?” sidebar on page 24 provides learning extracts a definition and some background.) Many in the and is part of a more general computational-linguistics research community use architecture.4,5 Here, we describe the system and an relevant domain terms WordNet,3 but large-scale IT applications based on experiment in which we used a machine-learned it require heavy customization. tourism ontology to automatically translate multi- from a corpus of text, Thus, a critical issue is ontology construction— terms from English to Italian. The method can identifying, defining, and entering concept defini- apply to other domains without manual adaptation. relates them to tions. In large, complex application domains, this task can be lengthy, costly, and controversial, because peo- OntoLearn architecture appropriate concepts in ple can have different points of view about the same Figure 1 shows the elements of the architecture. concept. Two main approaches aid large-scale ontol- Using the Ariosto language processor,6 OntoLearn a general-purpose ogy construction. The first one facilitates manual extracts terminology from a corpus of domain text, ontology engineering by providing natural language such as specialized Web sites and warehouses or doc- ontology, and detects processing tools, including editors, consistency uments exchanged among members of a virtual com- checkers, mediators to support shared decisions, and munity. It then filters the terminology using natural taxonomic and other ontology import tools. The second approach relies on language processing and statistical techniques that per- and automated language-processing form comparative analysis across different domains, semantic relations techniques to extract concepts and ontological rela- or contrastive corpora. Next, we use the WordNet and tions from structured and unstructured data such as SemCor3 lexical knowledge bases to semantically inter- among the concepts. databases and text. Few systems exploit both pret the terms. We then relate concepts according to approaches. The first approach predominates in most taxonomic (kind-of) and other semantic relations, gen- The authors used it to development toolsets such as Kaon, Protégé, Chi- erating a domain concept forest. We use WordNet and maera, and WebOnto, but some systems also imple- our rule-based inductive-learning method to extract automatically translate ment machine learning techniques.1 such relations. Our OntoLearn system is an infrastructure for auto- Finally, OntoLearn integrates the domain concept multiword terms from mated ontology learning from domain text. It is the forest with WordNet to create a pruned and specialized only system, as far as we know, that uses natural lan- view of the domain ontology. We then use ontology English to Italian. guage processing and machine learning techniques, editing, validation, and management tools4,5 that are

22 1094-7167/03/$17.00 © 2003 IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society part of the more general architecture to enrich, correct, and update the generated ontology. The main novel aspect of our machine Domain Extraction of candidate learning method is semantic interpretation. corpus terminology Ariosto language We identify the right senses (concepts) for processor complex domain term components and the semantic relations between them. For exam- Filtering of domain Contrastive terminology ple, the result of this process on the corpora transport company selects the sense “company- enterprise” for company (as opposed to the Semantic term interpretation “social gathering” or “crew” senses) and the WordNet sense “commercial enterprise” for transport and (as opposed to the “exchange of molecules,” SemCor Identification of taxonomic “overwhelming emotion,” and “transport of and similarity relations magnetic tape” senses), and associates a pur- pose relation (a company for transporting goods and people) between the two concepts. Machine- Identification of other We know of nothing similar in the ontol- learned semantic relations ogy learning literature. Many described rule base methods first extract domain terms using var- ious statistical methods, then detect the tax- Ontology integration onomic and other types of relations between terms. The literature1, 7-9 uses the notion of domain term and domain concept inter- Domain concept forest changeably, but no semantic interpretation of terms actually takes place. For example, Domain the concept digital printing technology is ontology considered a kind of printing technology by virtue of simple string inclusion.7 However, printing has four senses in WordNet and technology has two. The WordNet lexical Figure 1. The OntoLearn architecture. ontology thus contains eight possible con- cept combinations for printing technology. Identifying the correct sense combinations usually captured with shallow techniques that domain relevance of a term t in class Dk is is important in determining the taxonomic range from stochastic methods to more computed as relations between a new domain concept and sophisticated syntactic approaches. an existing ontology. Furthermore, semantic Obviously, richer syntactic information PtD() DR = k , interpretation is relevant in view of document positively influences the quality of the result tk, n semantic indexing and retrieval. For exam- to be input to statistical filtering. We use ∑PtD()j ple, a query for hotel facility could retrieve Ariosto to parse documents in the applica- j =1 documents including concepts that are tion domain to extract a list Tc of syntacti- related by a kinship relation, such as swim- cally plausible terminological candidates. where P(t |Dk) is estimated by ming pool or conference room. Semantic Examples include compounds (credit card), interpretation is also useful for automatic adjective-nouns (public transport), and base f = tk, , translation, as we later demonstrate. Know- noun phrases (board of directors). EPtD( ()k ) ing the right senses of the complex term com- High frequency in a corpus is a property ∑ ftk′, ′∈ ponents in the source language—given a observable for terminological as well as non- tDk bilingual dictionary—greatly simplifies the terminological expressions, such as last problem of selecting the correct translation week. We measure the specificity of a ter- where ft,k is the frequency of term t in the for each component in the target language. minology candidate with respect to the tar- domain Dk. OntoLearn has three main phases: termi- get domain via comparative analysis across A second filter operates on the principle nology extraction, semantic interpretation, and different domains. To this end, we define a that a term cannot be a clue for a domain Dk creation of a specialized view of WordNet. specific domain relevance score. A quanti- unless it appears in several documents; there tative definition of the DR can be given must be some consensus on using that term Phase 1: according to the amount of information cap- in the domain Dk. The domain consensus of Terminology can be considered the sur- tured in the target corpus relative to the term t in class Dk captures those terms that face appearance of relevant domain concepts. entire collection of corpora. More precisely, appear frequently across a given domain’s Candidate terminological expressions are given a set of n domains {D1,… ,Dn}, the documents. DC is defined as

JANUARY/FEBRUARY 2003 computer.org/intelligent 23 Natural Language Processing

What Is an Ontology?

A domain ontology seeks to reduce or eliminate concep- tual and terminological confusion among the members of Foundational ontology a user community who need to share various kinds of electronic documents and information. It does so by iden- tifying and properly defining a set of relevant concepts Core ontology that characterize a given application domain, say, for travel agents or medical practitioners. An ontology speci- fies a shared understanding of a domain. It contains a set of generic concepts (such as “object,” “process,” “accom- Specific modation,” and “single room”), together with their defi- domain nitions and interrelationships. The construction of its uni- ontology fying conceptual framework fosters communication and cooperation among people, better enterprise organiza- tion, and system interoperability. It also provides such sys- tem-engineering benefits as reusability, reliability, and Figure A. The three levels of generality in a domain ontology. specification. Ontologies can have different degrees of formality, but they top ontology. The result, the core ontology, usually includes a must include metadata such as concepts, relations, axioms, few hundred application-domain concepts. Although many instances, or terms that lexicalize concepts. From the termi- projects eventually succeed in defining a core domain ontol- nological viewpoint, an ontology can even be seen as a vocab- ogy, populating the third-level specific domain ontology with ulary containing a set of formal descriptions (made up of specific concepts is difficult. The projects that have overcome axioms) that approximate term meanings and enable a con- this barrier3–5 have done so at the price of inconsistencies and sistent interpretation of the terms and their relationships. limitations. To construct an ontology, specialists from several fields must thoroughly analyze the domain by References • Examining the vocabulary that describes the entities that populate it 1. A. Gangemi et al., “Sweetening Ontologies with DOLCE,” Proc. • Developing formal descriptions of the terms (formalized 13th Int’l Conf. Knowledge Eng. and into concepts, relationships, or instances of concepts) in that (EKAW 02), Springer-Verlag, New York, 2002; www.ladseb.pd. vocabulary cnr.it/infor/Ontology/Papers/DOLCE-EKAW.pdf. • Characterizing the conceptual relations that hold among 2. B. Smith and C. Welty, “Ontology: Towards a New Synthesis,” Proc. or within those terms Formal Ontology in Information Systems (FOIS 2001), ACM Press, Philosophical and AI ontologists usually help define basic kinds New York, 2001, pp. .3–.9. and structures of concepts considered to be domain indepen- dent. These include metaproperties and topmost categories of 3. A. Miller, “WordNet: An On-line ,” J. Lexicography, vol. 3, no. 4, Dec. 1990. entities and relationships. Identifying these few basic principles 1 creates a foundational ontology and supports a model’s gener- 4. D. Lenat, “CYC: A Large Scale Investment in Knowledge Infrastruc- ality to ensure reusability across different domains (see Figure ture,” Comm. ACM, vol. 38, no. 11, Nov. 1995, pp. 32–38. A).2 Then domain modelers and knowledge engineers help identify key domain conceptualizations and describe them 5. T. Yokoi, “The EDR Electronic Dictionary,” Comm. ACM, vol. 38, no. according to the organizational structure established by the 11, Nov. 1995, pp. 42–44.

  1 DC= Σ  P() d log  , guage domain and by the dimension of the In the absence of semantic interpretation, tk, ∈ t () dDk  Pdt  training corpus, we determine this threshold however, we cannot fully capture conceptual empirically.4 relationships between senses (for example, where Pt(d) is the probability that document the kind-of relation between bus service and d includes t. Phase 2: Semantic interpretation public transport service in Figure 2). A linear combination of the two filters Semantic interpretation is the process of The WordNet lexical knowledge base asso- obtains the terminology first determining the right concept (sense) for ciates a set of senses with each word. Sense each component of a complex term (this is names are called synsets (synonym sets). =+−αα()norm , DWtk,, DR tk1 DC tk , known as semantic disambiguation) and then Each word in a synset is marked with its sense identifying the semantic relations holding number. For example, for sense #3 of trans- norm where DC tk, is a normalized entropy and among the concepts to build a complex con- port, it would provide “transportation#4,” α ∈ (0, 1). Only the complex terms with a cept. OntoLearn starts by hierarchically “shipping#1,” and “transport#3.” WordNet DW value over a given threshold are retained. arranging the set of validated terms into sub- also provides sense definitions in natural lan- Because the statistical significance can be trees according to simple string inclusion. guage, called glosses, and several types of influenced by the technicality of the lan- Figure 2 is an example of a lexicalized tree T. lexicosemantic relations, like taxonomic

24 computer.org/intelligent IEEE INTELLIGENT SYSTEMS (kind-of), similarity, and part-of relations. Service WordNet includes over 120,000 (and over 170,000 synsets) but few domain Train service Express service terms. For example, transport and company are individually included, but not transport Ferry Boat Bus Coach Transport Car Taxi service service company. The SemCor knowledge base is a service service service service service balanced corpus of semantically annotated sentences in which every word is annotated Car ferry Public transport with a sense tag selected from the WordNet service service sense inventory for that word. We use Sem- Cor to automatically extract examples of Figure 2. A lexicalized tree in a tourism domain. concept co-occurrences.

Semantic disambiguation ⋅ ⋅ ⋅ Let t = wn … w2 w1 be a valid multiword term belonging to a lexicalized tree T. The process of semantic disambiguation associates submarine#1 hatchway#1 i the appropriate WordNet synset S k to each % @ word wk in t. So, the sense of t is defined as kitchen#1 craft#2 @ escape hatch#1 @ i i @ S() t=∈{} S S Synset() w, w∈ t galley#3 gloss vehicle#1 k k kk , # aircraft#1 @ seat belt#1 # airplane#1 # where Synset(w ) is the set of senses provided ~ fuselage#1 k # airfoil#1 # gloss # by WordNet for word wk. For instance, # @ airliner#1 @ plane seat#1 gloss S(transport company) = wing#1 device#1 fuel pod#1 {{ transportation#4, shipping#1, ~ transport#3 }, { company#1 }} @ flight#2 ballooning#1 air travel#1 →@ Hyperonymy corresponds to sense #1 of company (“an insti- →~ Hyponymy tution created to conduct business”) and sense →# Meronymy #3 of transport (“the commercial enterprise of transporting goods and material”). Let t = →% Holonymy ⋅ ⋅ ⋅ wn … w2 w1 be a multiword term, say, pub- lic transport service. Let w1 be the head of the Figure 3. The semantic net for sense #1 of airplane. string (in English compounds, the leftmost). Semantic disambiguation takes place in three steps. The first is creating semantic Ariosto obtains the gloss and topic rela- native intersection I, we compute a score that nets: For any wk ∈ t and any synset of wk tions by , respectively, the WordNet depends on the number and type of semantic i (where S k is the ith sense of wk in WordNet), concept definitions and the SemCor sentences patterns connecting the semantic net centers. create a semantic net (SN). OntoLearn auto- including that sense. OntoLearn extracts Semantic patterns must be instances of 13 matically builds semantic nets by using the every other relation directly from WordNet. predefined metapatterns. In the following two following lexicosemantic relations: As shown by the arrows in Figure 3, to reduce metapattern examples, S1 and S2 represent the the dimension of a semantic net, we consider central concepts of each semantic net. • Hyperonymy (car is-a-kind-of vehicle, only concepts at a distance not greater than topic →@ i → denoted with ) three relations from S k (the semantic net cen- • Topic, if SS12 (as in the term arche- • Hyponymy (its inverse, →~) ter). Figure 3 is an example of a semantic net ological site, where the two words co- • Meronymy (room has-a wall, →#) generated for sense #1 of airplane. occur and are tagged with sense #1 in a • Holonymy (its inverse, →%) The next step intersects the semantic nets. SemCor sentence) • Pertainymy (dental pertains-to tooth →\) The evaluates pairwise, from left • Gloss + hyperonymy path,if • Attribute (dry value-of wetness, →=) to right, alternative intersections. Given two • Similarity (beautiful similar-to pretty, adjacent words w and w , let gloss @,#≤3 i+1 i ∃∈ →→ →&) G,: M SynsetWN S1 G • Gloss (concept appears in the definition of = j ∩ ()k ≤ 3 ~,%gloss ~,%≤ 3 ≤3 @,# gloss I SN() Si+1 SN Si ←∨→→ ← another concept, → ) MSSGMS21 2. • Topic (concept often co-occurs with another concept, →topic) be one possible intersection. For each alter- For instance, in railway company, the gloss

JANUARY/FEBRUARY 2003 computer.org/intelligent 25 Natural Language Processing

in a unique “semantic domain” on the basis artifact#1 of pertainymy, adjectival similarity, and syn- onymy relations (for instance, respectively, @* manor house and manorial house, expert @* guide and skilled guide, bus service and coach service). Note that we detect seman- heat#6, heating room#1 tic relations between concepts, not words. system#2 For example, bus#1 and coach#5 are syn- gloss onyms, but this relation does not hold for gloss other senses of these two words.

gloss cool#1 At the end of this step, the lexicalized tree Giving relief A cool room T is reorganized into a domain concept tree gloss from heat & D in which terms have been replaced by con- cepts and the appropriate taxonomic relations air-conditioned have been detected. Figure 5 shows the #1 →@* A hyperonymy path > 1 domain concept tree D obtained from the lex- →& Similarity icalized tree T of Figure 2. Numbers for con- cepts are shown only when more than one Figure 4. Intersection between the semantic nets of air-conditioned#1 and room#1. semantic interpretation holds for a term, as The central nodes appear in bold. The blue and red arrows represent two instances of for coach service and bus service (for exam- metapatterns found at an intersection. ple, sense #3 of bus refers to old cars). At this point, a WordNet synset is associ- ated with each component of a complex term, of railway#1 contains the word organization The last step is finding taxonomic rela- and taxonomic relations connect the head and an hyperonymy path exists from com- tions. Initially, all the complex terms in a tree components of each complex concept. To pany to organization: company#1 →@ insti- T are independently disambiguated. Then, complete the interpretation process, we must tution#1 →@ organization#1. the algorithm detects taxonomic relations determine the semantic relations that hold Figure 4 shows an example intersection between concepts, as in ferry service →@ between the components of a complex con- between air-conditioned#1 and room#1 in boat service. OntoLearn infers this informa- cept. These relations provide richer semantic which the following semantic paths are found: tion from WordNet on the basis of the synsets information for many applications such as gloss @3 now associated with each component of the , query answering, and air−→→ conditioned## 1 heat 6 complex term. . 3@ In this phase, since all elements in T are artifact## 1← room 1 jointly considered, interpretation errors from Extracting semantic relations gloss the previous disambiguation step are cor- To extract the relations in a complex con- air−→ conditioned## 1 room 1. rected. In addition, certain concepts are fused cept, we must

Service

Transport service

Car serviceCoach service#2 Public transport service Car service#2 Boat service

Bus service#2 Taxi service Coach service, bus service Train service Coach service#3 Ferry service

Express service Express service#2 Car ferry service

Figure 5. A domain concept tree.

26 computer.org/intelligent IEEE INTELLIGENT SYSTEMS • Select an inventory of domain-appropriate semantic relations. If in modifier [knowledge_domain#1, knowledge-base#] =1 then relation THEME (63%) • Learn a formal model to select the relations Ex: arts festival, science center that hold between pairs of concepts, given ontological information on these concepts. If in modifier [building-material#] =1 then relation MATTER (50%) • Apply the model to semantically relate the Ex.: stave church, cobblestone street components. If in modifier [conveyance#3, transport#] =1 and in head [act#1, human_act#] =1 First, we selected an inventory of seman- then relation MANNER (92.2%) tic relations types. To this end, we consulted Ex.: bus service, coach tour John Sowa’s10 formalization on conceptual relations, as well as other studies conducted Figure 6. Three classification rules with confidence factors. within the CoreLex11 and EuroWordNet12 systems. Because the literature did not pro- vide systematic definitions for semantic rela- tions, we selected only the more intuitive and relation-concept triples (for example, wine For example, the feature vector for tourism widely used. (The WonderWeb project has ←OBJ←production), in which the type of rela- operator, in which tourism is the modifier begun work on providing a well-funded set and operator is the head, becomes the of rigorously defined semantic relations.) sequence of hyperonyms for tourism #1: To begin, we selected a kernel inventory, including the following 10 relations that we After WordNet attaches the {tourism#1,commercial_enterprise#2, found pertinent to the tourism domain (exam- commerce#1,transaction#1, ples of conceptual relations are expressed in domain concept trees under the group_action#1,act,human-action#1}. a style similar to Sowa’s conceptual graphs notation). appropriate nodes, all branches These were followed by the sequence of hyperonyms for operator#2: • Place (for example, room←PLACE← not containing a domain node service, that reads: “the service has place in {operator#2,capitalist#2,causal_agent#1, a room” or—when the arrows point to the entity#1,life_form#1,person- right—“the room is the place of service”) can be removed from the individual#1}. • Time (afternoon←TIME←tea) • Matter (ceramics←MATTER←tile) WordNet hierarchy. Features are converted in a binary represen- • Theme (art←THEME←gallery) tation to obtain vectors of equal length. • Manner (bus←MANNER←service) We ran several experiments, using two- • Beneficiary (customer←BENEF←service) tion is given only in the learning set. fold cross-validation and a tagged set of 405 • Purpose (booking←PURPOSE←service) We explored several possibilities for fea- complex concepts—a varying fragment for • Object (wine←OBJ←production) tures selection. We obtained the best result learning and the other for testing. Overall, • Attribute (historical←ATTR←town) when we represented each concept compo- the best experiment provided a 6 percent • Characteristics (first-class←CHRC←hotel) nent with the complete sequence of its error rate over 405 examples and produced hyperonyms (up to the topmost). We began around 20 classification rules. Figure 6 You can easily adapt this set or extend it to with a complex term (board of directors or shows three of the extracted rules, along other domains. transport company) and, after interpretation, with their confidence factors and a few To associate the appropriate relations that we had an ordered list of concepts in which examples. hold among the components of a domain the rightmost is the concept head (board or concept, we used inductive machine learn- company), while the modifiers are placed on Phase 3: Creating a specialized ing.13 In inductive learning, you first manu- the left. Therefore, we have, say, “director#2, view of WordNet ally tag a subset of domain concepts (the board#1” for board of directors and “trans- At the end of the last phase, we obtained a learning set) with the appropriate semantic port#3, company#1” for transport company. domain concept forest showing the taxo- relations and then let an inductive learner For each complex concept represented in this nomic and other semantic relations among build a tagging model. We selected the C4.5 way, we generated a feature vector in which complex domain concepts. We now integrate program14 because it produces a decision tree we follow the same order as for concept com- this DCF with a core domain ontology if or a set of rules as output. Unlike algebraic or ponents and represented each concept com- available or automatically create one from probabilistic learners, decision-tree or rule ponent with a list of all its hyperonyms. The WordNet, extending it with the DCF and learners produce output that humans can eas- features are then hyperonym lists. Therefore, pruning concepts not relative to the domain. ily understand and modify. we have a list of lists: After WordNet attaches the domain con- An inductive-learning system represents cept trees under the appropriate nodes, all 1 − n the instances of a domain through a feature feature_vector{{list_of_hyperonims}mod, branches not containing a domain node can vector. In our case, instances are concept- {list_of_hyperonims}head} be removed from the WordNet hierarchy. An

JANUARY/FEBRUARY 2003 computer.org/intelligent 27 Natural Language Processing

precise semantic information for term disam- entity#1 Root of domain tree biguation. In fact, while the inclusion of those Nodes to be pruned heuristics gives a precision of 84.56 percent object#1 Remaining WordNet nodes (hence this number represents the OntoLearn average precision for semantic disambigua- substance#1 artifact#1 tion), their exclusion decreases precision to 79 percent. The precision grows to about 89 fluid#1 food#1 entity#1 percent for highly structured subtrees, as those drug#1 in Figure 5. We also computed a baseline, object#1 selecting for each concept the first WordNet liquid#1 sense (ordered by frequency of use). The base- line obtained a precision of 75.06 percent. substance#1 artifact#1 beverage#1 drug of abuse#1 Although the results are encouraging, an objective evaluation of an ontology as a stand- fluid#1 food#1 alcohol#1 alone artifact is not fully satisfactory. The only possible success indicator is the subjec- tive acceptance–rejection rate provided by wine#1 wine#1 ontology engineers after inspecting automat- ically extracted information. An ontology can (a) (b) also be better evaluated when many users access it regularly and use this shared knowl- Figure 7. An intermediate step (a) and the final pruning step (b) in the domain concept edge to better communicate and access tree for wine#1. Circles indicate WordNet concepts. prominent information and documents. An ontology can also be objectively evaluated in the context of another software application in terms of performance improvement. To this intermediate node in WordNet is pruned cated Web sites, the system automatically end, we designed an ontology-based multi- whenever the following conditions hold extracted a terminology of 3,840 terms. word translation experiment. together: The node Domain experts manually evaluated these terms, obtaining a precision ranging from Applying OntoLearn to • Has no “brother” nodes 72.9 percent to approximately 80 percent and multiword term translation • Has only one direct hyponym a recall of 52.74 percent. The precision shift Automatic translation of complex nouns is • Is not the root of a domain concept tree is due to the fact that experts have different highly relevant to a variety of applications, • Is not at a distance ≤ 2 from a WordNet intuitions. The experts estimated the recall especially where technical terms occur fre- topmost concept (this is to preserve a min- after automatically extracting 14,000 termi- quently. The best-performing approaches imal top ontology) nological candidates from a fragment of the retrieve the correct translation by using bilin- corpus and then manually selecting the “true” gual parallel (that is, aligned) corpora. How- Figure 7 shows an example of pruning the domain terms. Our automatic terminology ever, parallel corpora are rare, especially when nodes located over the domain concept tree filter identified 52.74 percent of these terms. dealing with such sublanguages as medicine, with root wine#1. Because the participants in the EC project tourism, and commerce. Other methods use were not skilled in evaluating the semantic monolingual corpora in the source and target Evaluation of OntoLearn interpretation algorithm, we personally eval- language, exploiting context similarity to Although a complete field evaluation is uated it. We used a testbed of about 650 extract possible term correspondences. Sta- still progressing within the European Com- extracted complex terms that had been man- tistical, algebraic, and machine-learning meth- munity project Harmonise, some basic facts ually assigned to the appropriate WordNet ods select the appropriate translation. Dictio- indicate our method’s validity. After one concepts. These terms contributed to 90 syn- nary-based approaches use the translation of year, the tourism experts released the most tactic trees created out of a variable number constituent words to build the translation of general layer of the tourism ontology, com- of concepts (from two to about 30) and vari- the full complex term. To deal with the com- prising approximately 300 concepts. After able depth (the maximum is four). The num- bined ambiguity of two languages (each con- OntoLearn was introduced, the ontology ber and depth of generated trees is a property stituent has many senses and each one has sev- grew to approximately 3,000 concepts within of the sublanguage. In an economic domain, eral ), using evidence from corpora six months. Manual verification of automat- which is more technical than tourism is, we and statistical methods can reduce the set of ically acquired domain concepts actually found a much larger number of hierarchically possible translations. (See the “Related Work” took only a few days. related domain concepts. sidebar for further information.) We have also evaluated OntoLearn inde- An extensive evaluation of the whole To bypass the difficulty of obtaining cor- pendently from the ontology engineering semantic disambiguation process highlighted rect translations, cross-language information process. Starting from a one-million-word that some heuristics contribute more than oth- retrieval establishes translingual associations corpus of travel descriptions found on dedi- ers. In particular, rules that use glosses bring between the full query in the source language

28 computer.org/intelligent IEEE INTELLIGENT SYSTEMS Related Work

It is difficult to compare our work to other projects in the perimental data on the utility of ontologies is not available literature, because few papers on noun-phrase translation except for Ontology Server,5 which presents an analysis of user contain evaluation results. In one paper on complex noun distribution and requests. A better performance indicator would translation from English to Japanese,1 the results are quite have been the number of regular users that access Ontology poor: approximately 34 percent precision and 60 percent Server, but the authors mention that regular users are only a recall for a limited test of 10 compound nouns. (In our algo- small percentage. Although recent efforts are being made in rithm, the low recall is due to the poor dictionary. In terms of ontology evaluation tools and methods such as the principle, for a full mapping of synsets from source to target ONE-T system at Stanford University, results are methodologi- language, the recall is 100 percent.) Another project2 cal rather than experimental. recently obtained better results by using a Web search for generating translation candidates, much as we do, and then applying a filter based on a Bayesian classifier or on statistical filters. The performance obtained with the best methods References combination is 68.6 percent precision and 86.2-percent cover- 1. I. Sharhzad et al., “Identifying Translations of Compound Nouns age. Another paper3 reports using a Web search exclusively, Using Non-aligned Corpora,” Proc. Multilingual Information Pro- without additional filters, to achieve 86 to 87 percent preci- cessing and Asian Language Processing (MAL-99 Workshop); sion on a very-large-scale experiment. In all the papers, how- http://kibs.kaist.ac.kr/nlprs99/w_mal99.htm. ever, the translated terms are base noun phrases extracted from a dictionary or an encyclopedia rather than from an 2. Y. Cao and H. Li, “Base Noun Phrase Translation Using Web Data automatically extracted terminology. and the EM Algorithm,” Proc. 19th Int’l Conf. Computational Lin- These results confirm the extreme effectiveness of Web search guistics (COLING 2002), Morgan Kaufmann, San Francisco, pp. heuristics. Nonetheless, using an ontology can greatly reduce 127–133; http://research.microsoft.com/users/hangli/HP_files/ the search space with respect to word-to-word translation. Cao-Li-COLING02.pdf. Further related work concerns a large translingual informa- 4 3. G. Grefenstette, “The World Wide Web as a Resource for Example- tion retrieval (TIR) experiment. Here, the comparative recall- Based Machine Translation Task ASLIB,” Translating and the Com- precision graph for various methods reaches the maximum puter 21, www.xrce.xerox.com/competencies/content-analysis/ value of 35 percent precision at 60 percent recall, but the publications/Documents/P49030/content/gg_aslib.pdf. maximum precision is 74 percent at almost zero recall. More recent results from the TIR track of TREC 2001 (the annual 4. Y.Yang et al, “Translingual Information Retrieval: Learning Text Retrieval Conference) show a precision of 40 to 45 percent from Bilingual Corpora,” Artificial Intelligence, vol. 103, nos. for the best systems. 1-2, 30 Aug. 1998, pp. 323–345; http://www-2.cs.cmu.edu/ The literature concerning multiword term translation contains ~yiming/publications.html. a limited number of experiments,1 and the results seem modest. 5. A. Farquhar et al., “Collaborative Ontology Construction for Infor- Although the number of contributions in the area of ontol- mation Integration,” http://www.ksl.stanford.edu/KSL_Abstracts/ ogy learning and construction has considerably increased, ex- KSL-95-63.html.

and the answering text in the target lan- derived by OntoLearn for room service is • One or more hyponyms, hypernyms, or guage.15 However, because terminology con- near synonyms are available in the target veys most of a text’s meaning, complex terms synset_WN(room) = {room%1} language. need high-quality translations. gloss_WN({room%1}) = an area within • No correspondences exist (the most fre- a building enclosed by walls and floor quent case in Italian EuroWordNet). The experiment and ceiling OntoLearn makes the dictionary-based synset_WN(service) = {service%1} The algorithm selects a corresponding synset approach more feasible because it relates a gloss_WN({service%1}) = work done by only in the first case. complex term to an unambiguous complex one person or group that benefits A list of synonyms often serves as a basis concept, thus eliminating a major source of another for a synset translation, but usually only one ambiguity. To verify this claim, we per- {room%1}(PLACE){service%1} word is appropriate to generate it. For exam- formed an experiment of multiword term ple, the translation of sense #1 of center (cen- translation in the tourism domain, using where synset_WN(w) and gloss_WN(w) are, tre, center, middle, heart, eye) is (centro, EuroWordNet12 to translate English to Ital- respectively, the WordNet synset and Word- cuore), corresponding to words #1 and #4 of ian. We used OntoLearn to extract domain Net definition for a word w. the English synset. However, only centro is an terminology, associate each constituent of a For each synset, there are three possible appropriate translation in the context of the term to its EuroWordNet synset in the source cases: term health center. To select a translation, we language, and derive the appropriate seman- queried the Web with alternative translations tic relation. • The translation of the same synset is avail- and counted the hits in Google for each alter- A complete example of the information able in the target language. native. In contrast to other dictionary-based

JANUARY/FEBRUARY 2003 computer.org/intelligent 29 Natural Language Processing

Table 1. Results* of the automated translation experiment for complex domain goods and services,” as in trade center,is concepts that have corresponding synsets. tratta, which connotes more of the “patron- Quality of translation Good Acceptable Poor Total age” sense. In other cases, the preferred adjec- Manually corrected input 74% (84) 14% (16) 12% (13) 113 tival realization was not available in Italian WordNet. For example, tour operator is bet- OntoLearn-generated input 70% (71) 14% (14) 16% (16) 101 ter translated to operatore turistico (adjective) *In percentages and absolute numbers than to operatore del turismo (noun). In other cases, the English term is more popular than its reported Italian translation, as in and its translation cyberspazio. In all these methods, we were dealing with disambiguated Results cases, even though the translations are poor or terms, which reduces the number of alternate We conducted the experiment on the 405 barely acceptable, we decided not to prune translations. The hit-count method worked complex terms extracted from our manually them from our evaluation because this would well. In some cases, more than one translation validated tourism corpus. The validation con- be subjective. Rather, we plan to run our next was acceptable. cerns both semantic disambiguation and experiment by using other target languages The last step was to replace the source lan- extraction of semantic relations. The cumula- such as Spanish, for which EuroWordNet pro- guage construction for a term (usually a com- tive error rate of OntoLearn on these data was vides a richer bilingual dictionary. Table 1 pound or adjective noun in English) with the 16.8 percent, 2 percent worse than the perfor- summarizes our results. appropriate construction in the target lan- mance we computed over a set of 650 complex Note in Table 1 that the difference in pre- guage. We used the information provided by disambiguation terms. The semantic relation cision between the manually corrected and the OntoLearn semantic relations to do this. was inappropriate in only three cases. All other automatically generated input in the first col- In Italian, a compound corresponds to a post- errors were due to semantic disambiguation. umn is 4 percent, which is much less than the modified prepositional phrase, and the type However, as the translation experiment results OntoLearn error rate. This demonstrates that of preposition depends on the subsumed later proved, the sense selected by OntoLearn often an OntoLearn-generated synset is only semantic relation. We used mapping between was often very close to the manual one. subtly different in meaning from the manu- conceptual relations and prepositions to con- We submitted both the validated and the ally chosen synset. The translation algorithm struct the appropriate complex nominal. For error-prone data to the translation procedure selects the right term anyhow. In some cases, example, so that we could distinguish between errors two English synsets are coded with the same that occurred during semantic interpretation Italian synset. {room#1} (PLACE) {service#1} and those that occurred during the generation To better evaluate our results, we com- of translations. Unfortunately, the primary puted a baseline by selecting the most prob- becomes obstacle was the poor encoding of Italian able synset in WordNet for each complex EuroWordNet with respect to WordNet 1.5 and term component—for example, sense #1. We SERVIZIO in CAMERA 1.6. In the English version, 66,042 synsets are then evaluated the quality of translations. In coded for nouns and 17,944 for verbs. In Ital- general, the first-sense heuristics produced For some relations, more than one preposi- ian, these numbers drop to 31,806 and 3,873. the right sense combinations for 68 of 113 tion might be appropriate. We selected the Our 405 complex terms were made out of cases (60.1 percent). When the synsets are most frequently used, again using support- 335 different words. Several words occurred not dramatically different (this happens ing evidence from the Web. frequently: 44 complex terms with the head rarely, as when erroneously selecting the Using the pertonimy relation in WordNet center, 45 with the head service, and 16 with “transmission” sense for channel in channel 1.6 proved to be good heuristics. When a the head hotel. OntoLearn associated these island), the Web search heuristics “adjust” noun was related to its adjectival realization, words to 317 different synsets, since some the translation. Therefore, the overall trans- we created a postmodified adjectival phrase, words were synonyms. Of these synsets, only lation quality for the baseline is much higher usually the most appropriate translation in 164 had a correspondent in Italian WordNet. than expected, given the initial disambigua- Italian. For example, the English term folk- For 53, there was a near synonym, hyponym, tion error rate. For a fair comparison, we lore festival is better translated as sagra fol- or hyperonym; for 41, the synset was not selected the 60 complex terms for which both cloristica (folkloristic festival) than as fes- encoded, but other unrelated synsets of the OntoLearn and the baseline method could tival sul folklore (festival on folklore), source language word were available; and in produce a translation—that is, all the synsets although the prepositional realization is 59 cases, none of the English word synsets in a complex concept that had a correspon- acceptable. had a correspondent. dent synset in Italian. In these cases, the Adjectival constructs are preferable only Consequently, we had only 101 complex OntoLearn-based translation produced four for a subset of semantic relations. Even in concepts (113 when manually corrected) for errors (6.6 percent), whereas the first-sense this case, it would be better to produce more which our translation procedure determined heuristics had seven (11.6 percent). alternatives and use corpus evidence to the correspondent synsets in the target lan- Although not striking, these results are restrict the set depending upon usage, but we guage for all components. In some cases, the sufficient, and we believe they could improve could not apply this heuristic extensively due Italian translation of a synset was rather odd or significantly with a richer version of Italian to the limited number of adjectives encoded archaic. For example, the translation of trade EuroWordNet or with yet another EuroWord- in the Italian EuroWordNet. in the sense #1 “the commercial exchange of Net language.

30 computer.org/intelligent IEEE INTELLIGENT SYSTEMS e consider our results to be good, 2001 Workshop WordNet and Other Lexical Wgiven that the experiment is pre- Resources, Assoc. for Computational Lin- The Authors guistics, East Stroudsburg, Pa., 2001, pp. liminary and there is room for improvement. 125–131. The most serious problem is that our Roberto Navigli is a grant recipient in the approach works well for only those complex 8. E. Morin, “Automatic Acquisition of Seman- Department of Com- terms that correspond with their translation tic Relations between Terms from Technical puter Science at the on a part-by-part basis. We found few excep- Corpora,” Proc. 5th Int’l Congress Terminol- Università di Roma. ogy and Knowledge Eng. (TKE 99), Int’l Net- His research interests tions in our experiment, but the problem cer- work for Terminology, Vienna, 1999. tainly arises for languages such as German include natural lan- guage processing and or Japanese. In these cases, additional com- 9. E. Agirre et al., “Enriching Very Large On- knowledge representa- putational devices are needed to split or tologies Using the WWW,” ECAI 1st Ontol- tion. He received his Laurea degree in computer ogy Learning Workshop, 2000, http://ol2000. science from Università di Roma La Sapienza. merge the various parts. aifb.uni-karlsruhe.de. This translation experiment also sought to Contact him at Univ. di Roma La Sapienza, Dipart. di Informatica, Via Salaria 113, 00196 demonstrate the utility of an automatically 10. J. Sowa, Conceptual Structures: Information Roma, Italy; [email protected]. learned ontology within an application. This Processing in Mind and Machine, Addison- Wesley, Boston, 1984. is not a trivial result. To continue our analy- Paola Velardi is a full professor in the Depart- sis of ontology-sensitive applications, our 11. P. Buitelaar, “CoreLex: An Ontology of Sys- ongoing research focuses on ontology-based ment of Computer Sci- tematic Polysemous Classes,” Proc. Int’l ence at the Università information retrieval. Conf. Formal Ontology in Information Sys- di Roma. Her research tems (FOIS 98), www.dfki.de/~paulb/pub. interests include nat- Acknowledgments html. ural language process- ing, machine learning, The OntoLearn project was funded in part by 12. “EuroWordNet: Building a Multilingual the EC projects Fetish ITS-13015 and Harmonise and the Semantic Web. Database with for Several European She received her Laurea degree in electrical engi- IST-2000-29329 (http://dbs.cordis.lu). The project Languages,” www.illc.uva.nl/EuroWordNet. was also funded in part by other ontology-related neering from Università di Roma La Sapienza. Contact her at [email protected]. projects such as the OntoWeb thematic net- 13. T. Mitchell, Machine Learning, McGraw- work (http://ontoweb.aifb.uni-karlsruhe.de) and Hill, New York, 1997. WonderWeb IST-2001-33052 (http://wonderweb. Aldo Gangemi is a semanticweb.org). research scientist at the 14. J.R. Quinlan, C4.5: Programs for Machine Institute for Cognitive Learning, Morgan Kaufmann, San Francisco, Sciences and Technol- References 1993. ogy. His research inter- ests include conceptual 15. Y. Cao and H. Li, “Base Noun Phrase Trans- 1. A. Maedche and S. Staab, “Ontology Learning modeling, ontological lation Using Web Data and the EM Algo- for the Semantic Web,” IEEE Intelligent Sys- engineering, and the rithm, Proc. 19th Int’l Conf. Computational tems, vol. 16, no. 2, Mar./Apr. 2001, pp. 72–79. Semantic Web. He re- Linguistics (COLING 2002), Morgan Kauf- ceived his Laurea degree in philosophy from mann, San Francisco, 2002, pp. 127–133; 2. D. Fensel et al., “OIL: An Ontology Infra- Università di Roma La Sapienza. Contact him http://research.microsoft.com/users/hangli/ structure for the Semantic Web,” IEEE Intelli- at [email protected]. HP_files/Cao-Li-COLING02.pdf. gent Systems, vol. 16, no. 2, Mar./Apr. 2001, pp. 38–45.

3. A. Miller, “WordNet: An On-line Lexical Resource,” J. Lexicography, vol. 3, no. 4, Dec. 1990. 4. R. Navigli and P. Velardi, “Semantic Inter- QUESTIONS?QUESTIONS? pretation of Terminological Strings,” Proc. We’d like 6th Int’l Conf. Terminology and Knowledge Eng. (TKE 2002), INIST-CNRS, Vandoeu- vre-lès-Nancy, France, 2002, pp. 95–100. COMMENTS?COMMENTS?to hear 5. M. Missikoff, R. Navigli, and P. Velardi, “The Usable Ontology:An Environment for Build- ing and Assessing a Domain Ontology,” Proc. 1st Int’l Semantic Web Conf. (ISWC 2002), Lecture Notes in , no. 2342, IEEEIEEE IntelligentIntelligent from SystemsSystems wantswantsyou Springer-Verlag, Berlin, 2002, pp. 39–53. to hear from you! 6. R. Basili, M.T. Pazienza, and P. Velardi, “An to hear from you! Empirical Symbolic Approach to Natural Language Processing,” Artificial Intelligence, vol. 85, nos. 1–2, Aug. 1996, pp. 59–99. EMAIL

7. P. Vossen,“Extending, Trimming and Fusing WordNet for Technical Documents,” NAACL [email protected]

JANUARY/FEBRUARY 2003 computer.org/intelligent 31