Subword Clusters as Light-Weight for Multilingual Document Retrieval

Udo HAHN Kornel´ MARKO´ Stefan SCHULZ Jena University Freiburg University Hospital & Information Engineering Lab Medical Informatics Department F¨urstengraben30 Stefan-Meier-Str. 26 D-07743 Jena, Germany D-79104 Freiburg, Germany [email protected] [email protected]

Abstract creasingly getting access to these resources, with the Web adding an even more diversified crowd of We introduce a light-weight interlingua for a cross- searchers. Hence, mappings between different lin- language document retrieval system in the medical guistic registers are inevitable to serve the needs of domain. It is composed of equivalence classes of such a heterogeneous search community. Therefore, semantically primitive, language-specific subwords automatically performed intra- and interlingual lex- which are clustered by interlingual and intralingual ical mappings and transformations of equivalent ex- synonymy. Each subword cluster represents a basic pressions become an obvious necessity to support conceptual entity of the language-independent inter- these different user groups in an adequate manner. lingua. Documents, as well as queries, are mapped We here propose an approach which is intended to this interlingua level on which retrieval opera- to meet these particular challenges. At its core tions are performed. Evaluation experiments reveal lies a new type of interlingua the basic entities of that this interlingua-based retrieval model outper- which are composed of semantically minimal sub- forms a direct translation approach. words. From a linguistic perspective, subwords are often closer to formal Porter-style stems (Porter, 1 Introduction 1980) rather than to lexicologically orthodox ba- Medical document retrieval presents a unique com- sic forms, e.g., of verbs or nouns or linguistically bination of challenges for the design and imple- plausible stems. Hence, their merits have to be mentation of retrieval engines. Clinical document shown in (retrieval) experiments. These language- collections have increasingly become available in specific subwords form semantically defined equiv- electronic form (e.g., as Electronic Patient Records alence classes which capture intralingual as well as (EPRs)) and their size is rapidly growing, with es- interlingual (near) synonymy between all subwords timates ranging, for a single clinical site, on the in a single cluster. Thus, they abstract away from order of millions of documents in total, including subtle particularities within and between . hundreds to thousands new documents being added We do not claim to cover the lexicon of general lan- every day. Hence, there is a growing demand for guage but rather restrict ourselves to the terminol- automatic support for content-oriented access and ogy used in the medical domain. browsing of electronic files and EPRs. In Section 2, we elaborate on the lexicological Furthermore, medical document collections are foundation of this interlingua, i.e., the format of inherently multi-lingual. While clinical texts are subwords and their synonymy relations, and their usually written in the native language of the coun- role in the process of morphosemantic normaliza- try, searches in major bibliographic databases (e.g., tion. The usefulness of subwords will be shown in MEDLINE) require a substantial proficiency of retrieval experiments (Section 3), in which we con- (expert-level) English medical terminology. Non- trast our interlingua-based retrieval approach to one native speakers of English, however, often lack this which relies on direct translation only (Section 4). particular competence. Hence, some sort of bridg- ing between synonymous or, at least, related terms 2 Light-Weight Interlingua from different languages has to be provided to make We here introduce the notion of subwords (Section full use of the information these databases hold. 2.1), their organization in terms of an interlingua Finally, the user population of medical document (Section 2.2), some principles underlying the cre- retrieval systems and their search strategies are re- ation and maintenance of the lexicon as well as the ally diverse. Not only physicians, but also nurses, interlingua resource (Section 2.3), and the basic pro- medical insurance companies and patients are in- cedure for morphosemantic analysis (Section 2.4). 2.1 Subwords • Two types of meta relations can be asserted From a linguistic perspective, the proper choice of between synonymy classes: the granularity of the basic lexical units is usually (i) a paradigmatic relation has-meaning, guided by syntactic considerations, i.e., the syntax which relates one ambiguous class to its of words (e.g., inflection or derivation) or the syntax specific readings, as with: of sentences (e.g., in terms of subcategorization or {head} ⇒ {kopf,zephal,caput,cephal,cabec,cefal} valency frames). For the proper choice of subwords, OR {boss,leader,lider,chefe}. however, semantic considerations are key. Espe- (ii) a syntagmatic relation expands-to, which cially in scientific and technical sublanguages, we consists of predefined segmentations in case observe that semantically non-decomposable enti- of utterly short subwords, such as: ties and domain-specific suffixes (e.g., ‘-itis’ (Pacak {myalg} ⇒ {muscle,muskel,muscul} ⊕ {pain, et al., 1980)) are chained in complex word forms schmerz,dor}. such as in ‘pseudo⊕hypo⊕para⊕thyroid⊕ism’, Compared with relationally richer, e.g., WORD- ‘pancreat⊕itis’ or ‘gluco⊕corticoid⊕s’.1 We refer NET based, interlinguas used for cross-language in- to these self-contained, semantically minimal units formation retrieval (Gonzalo et al., 1999; Ruiz et as subwords and motivate their status primarily by al., 1999), we hence incorporate a much more lim- their usefulness for document retrieval rather than ited set of semantic relations and pursue a more by linguistic arguments. restrictive approach to synonymy. We also refrain The minimality criterion is often weaker than, from introducing additional hierarchical relations e.g., for , though it is hard to define in between MIDs because such links can be acquired a general way. For example, given the text token from domain-specific vocabularies, e.g., the Medi- ‘diaphysis’, a linguistically plausible - cal Subject Headings (MeSH, 2004) (cf. experimen- style segmentation might lead to ‘dia⊕phys⊕is’. tal evidence from Mark´oet al. (2004)). From a medical perspective, however, a segmen- tation into ‘diaphys⊕is’ seems much more reason- 2.3 Engineering the Lexicon and Interlingua able because the canonical linguistic decomposi- In the development workflow, the effects of changes tion is far too fine-grained and likely to create too of subword size and granularity are immediately fed many subword ambiguities (which would be harm- back to the developers using word lists to test and ful to precision). Comparable ‘low-level’ segmen- validate both the segmentation and the assignment tations of semantically unrelated tokens such as of MIDs. A collection of parallel texts (abstracts of ‘dia⊕lyt⊕ic’, ‘phys⊕iol⊕ogy’ lead to morpheme- medical publications in English plus either German, style subwords ‘dia’ and ‘phys’, which unwarrant- Spanish or Portuguese) are used to detect errors in edly match ‘dia⊕phys⊕is’, too. The (semantic) the assignment of MIDs. To impose common poli- self-containedness of the chosen subword is also of- cies on the lexicon builders, we developed a main- ten supported by the existence of a synonym, e.g., tenance manual which contains 31 rules. The most for ‘diaphys’ we have ‘shaft’. critical tasks they cover are listed below:

2.2 From Subwords to Interlingua • The proper delimitation of subwords (e.g., Subwords are assembled in a lexical repository, with ‘compat⊕ibility’ vs. ‘compatib⊕ility’); the following considerations in mind: • The decision whether an affix introduces a new meaning which would justify a new entry (e.g., • Subwords are listed, together with their at- ‘neur⊕osis’ vs. ‘neuros⊕is’); tributes such as language (English, German, Portuguese, Spanish) or subword type (stem, • Data-driven decisions, such as to add ‘-otomy’ prefix, suffix, invariant). Each subword is as a synonym of ‘-tomy’ in order to block erro- assigned one or more morpho-semantic class neous segmentations such as ‘nephrotomy’ into identifier(s), we call MID(s), representing the ‘nephr⊕oto⊕my’; corresponding synonymy equivalence class. • The decision to exclude short stems from seg- mentation (such as ‘my-’, ‘ov-’) in order to • Intralingual synonyms and interlingual trans- block false segmentations; lation synonyms of subwords are assigned the same equivalence class (judged within the con- • The decision to locate the appropriate level of text of only). semantic abstraction when equivalence classes are formed, e.g., by grouping {‘hyper-’, ‘high’, 1`⊕' denotes the concatenation operator. ‘elevate’} into the same class; High TSH values suggest the high tsh values suggest the Orthographic • The decision which function words and af- diagnosis of primary hypo- diagnosis of primary hypo- Normalization fixes are excluded from indexing, such as thyroidism ... thyroidism ... ‘and’, ‘-ation’, ‘-able’, and those which are not Erhöhte TSH-Werte erlauben die Orthographic erhoehte tsh-werte erlauben die Diagnose einer primären Hypo- Rules diagnose einer primaeren hypo- ‘dys-’, ‘anti-’, ‘-itis’. thyreose ... thyreose ... Original Morphosyntactic Parser Subword Lexicon In the meantime, the entire subword lexicon (as MID Representation #up# tsh #value# #suggest# high tsh value s suggest the Semantic #diagnost# #primar# #small# diagnos is of primar y hypo of July 2005) contains 68,615 entries, with 22,312 Normalization 2 #thyre# thyroid ism for English, 23,600 for German, 14,892 for Por- #up# tsh #value# #permit# er hoeh te tsh wert e erlaub en die tuguese, and 7,811 for Spanish. All of these entries #diagnost# #primar# #small# Subword diagnos e einer primaer en hypo Thesaurus are related in the thesaurus by 20,916 equivalence #thyre# thyre ose classes. We also found a well-known logarithmic Figure 1: Morphosemantic Processing Pipeline growth behavior as far as the increase of the num- ber of subwords are concerned (Schulz and Hahn, 3 2000). Under this observation, at least the English character substitutions to ease the matching of and German subword lexicons have already reached (parts of) text tokens and entries in the dictionary. their saturation points. The next step in the pipeline is concerned with Our project started from a bilingual German- morphosyntactic parsing. The parser segments the English lexicon, while the Portuguese part was orthographically normalized input stream into a se- added in a later project phase (hence, its size still quence of subwords as found in the lexicon. The lags somewhat behind). All three lexicons and the segmentation results stored in a chart are checked common thesaurus structure were manually con- for morphological plausibility using a finite-state structed, which took us about five person-years. automaton in order to reject invalid segmentations While we simultaneously experimented with vari- (e.g., segmentations without stems or beginning ous subword granularities as well as weaker and with a suffix). If there are ambiguous valid readings stronger notions of synonymy, this manual ap- or incomplete segmentations (due to missing entries proach was even heuristically justified. With a in the lexicon), a series of heuristic rules are applied, much more stable set of criteria for determining which prefer those segmentations with the longest subwords emerging from these experiments, we re- match from the left, the lowest number of unspeci- cently switched from a manual to an automatic fied segments, etc. Whenever the segmentation al- mode for lexicon acquisition. The Spanish sublex- gorithm fails to detect a valid reading, all extracted icon, unlike all other previously built sublexicons, stems of four characters or longer – if available – was the first one generated solely by an automatic are preserved and the remaining fragments are dis- learning procedure. It makes initial use of cognate carded. Otherwise, if no stem longer than four char- relations that can be observed for typologically re- acters can be determined during the segmentation, lated languages (Schulz et al., 2004) and has re- we recover the original word. This method was use- cently been embedded into a bootstrapping method- ful for the preservation of proper names, although a ology which induces new subwords that cannot be dedicated name recognizer is still a desideratum for found by considering merely cognate-style string our system. similarities. This extended acquisition mode makes In the final step, semantic normalization, each heavy use of contextual co-occurrence patterns in subword recognized is substituted by its corre- comparable corpora (Mark´oet al., 2005b). sponding MID. After that step, all synonyms within a language and all translations of semantically 2.4 Morphosemantic Processing equivalent subwords from different languages are Figure 1 depicts how source documents are con- represented by the same MID. verted into an interlingual representation by a three- Composed terms (such as ‘myalg⊕y’) which are step procedure. We start with orthographic nor- linked to their components by the expands-to rela- malization. A preprocessor reduces all capitalized tion are substituted by the MIDs of their compo- characters from input documents to lower-case char- nents, in the same way as if this were performed by acters and, additionally, performs language-specific the parser. Ambiguous classes, i.e., those related by a has-meaning link to two or more classes, produce 2 Just for comparison, the size of WORDNET assem- a sequence of their related MIDs (for interlingua- bling the lexemes of general English in the 2.0 version is on the order of 152,000 entries (http://wordnet. based disambiguation, cf. Mark´oet al. (2005a)). princeton.edu/man/wnstats.7WN, last visited on May 13, 2005). Linguistically speaking, the entries are basic 3For German, e.g., ‘ß’ → ‘ss’, ‘¨a’ → ‘ae’, ‘¨o’ → ‘oe’, ‘¨u’ forms of verbs, nouns, adjectives and adverbs. → ‘ue’ and for Portuguese ‘c¸’ → ‘c’, ‘´u’ → ‘u’, ‘˜o’ → ‘o’. QTR Approach: Machine Translation and Bilingual Dictionaries MSI Approach: Language Independent Morphosemantic Indexing ((E)nglish, (P)ortuguese, (S)panish, (G)erman)

QueryE/P/S/G QueryP/S/G QueryE DocE DocE

Machine Translation: Filtering Stop Words Google Translator

Orthographic Rules (E/P/S/G) Bilingual Morphosemantic UMLS Dictionary Subword- Lexicon (E/P/S/G) Normalization

Subword- Filtering Stop Words Thesaurus

QueryMSI Porter Stemmer DocMSI

Search Engine Search Engine

Index (stems) Index (MIDs)

Filtered and stemmed English documents (extract from 89270656): Filtered and morphosemantically indexed documents (extract from 89270656): Progestogen chosen addit estrogen replac import progestin advers influenc effect #progest# #choose# #overlay# #estrogen# #substitut# #important# #progest# #advers# oral estrogen lipid metabol #influenc# #oro# #estrogen# #lipid# #metabol# Filtered and stemmed English queries: Filtered and morphosemantically indexed English query: Q :#advers# #influenc# #lipid# #progest# #give# #estrogen# #substitut# #therapeut# Q :advers effect lipid progesteron given estrogen replac therapi 1 1 Filtered and morphosemantically indexed Portuguese query: Automatically translated, filtered and stemmed Portuguese queries: Q1:#exist# #influenc# #advers# #lipid# #progest# #give# #linkag# #therapeut# Q1:advers effect lipid exist progesteron given togeth spare therapi estrogen #substitut# #rek# #posit# #estrogen# Automatically translated, filtered and stemmed Spanish queries: Filtered and morphosemantically indexed Spanish query: Q :advers effect exist lipid progesteron given estrogen therapi availabl 1 Q1: #influenc# #advers# #lipid# #progest# #give# #therapeut# #substitut# #estrogen# Automatically translated, filtered and stemmed German queries: Filtered and morpho-semantically indexed German query:

Q1:unwant side effect lipidstoffwechsel gift progesteron östrogenersatztherapi Q1: #give# #non# #desir# #influenc# #collater# #lipid# #metabol# #dispensat# #progest# #estrogen# #substitut# #therapeut# Figure 2: Direct Query Translation (left: QTR) vs. Interlingual Morphosemantic Indexing (right: MSI)

3 Experimental Setting placement therapy?”. In Figure 2, the results of pro- cessing this query and an extract of one retrieved 3.1 Document Corpus document illustrate the two alternative approaches Our experiments were run on the OHSUMED cor- we discuss. (Bold terms co-occur in queries and the pus (Hersh et al., 1994), which constitutes one of document fragment.) the standard IR testbeds for the medical domain. The OHSUMED corpus contains only English- OHSUMED is a subset of the MEDLINE database language documents (and queries). This raises the which contains bibliographic information (author, question of how this collection (or MEDLINE, in title, abstract, index terms, etc.) of life science and general) can be accessed from other languages as biomedicine articles. Because we only considered well. It is a realistic scenario because, unlike in the title and abstract field for each bibliographic sciences with English as the lingua franca, among unit, we obtained a document collection comprised medical doctors native languages are still dominant of 233,445 texts. (115,121 out of all 348,566 doc- in their education and everyday practice. In order to uments contain no abstract and were therefore ig- solve this problem, medical practitioners might re- nored.) Our test collection is made of 41,924,840 sort to translating their native-language search prob- tokens, and the average document length is 179.6 lem to English with the help of current Web technol- tokens (with a standard deviation of 76.4). ogy, e.g., an automatic translation service available Since the OHSUMED corpus was created specif- in a standard Web search engine. Its operation might ically for IR studies, 106 queries are available (ac- further be enhanced by lexical resources as avail- tually 105, because for one query no relevant doc- able from the U.S. National Library of Medicine in uments could be found), including associated rel- support of various non-English languages, e.g. the evance judgments. The average number of query UMLS Metathesaurus (UMLS, 2004) (which cur- terms is 5.1 (with a standard deviation of 1.8). This rently supports – with considerable differences in is a typical query: “Are there adverse effects on coverage – German, French, Spanish, Portuguese, lipids when progesterone is given with estrogen re- Russian, and many others). As a matter of fact, this procedure, direct Query Translation (QTR), reduces 3.2 Search Engine the cross-language retrieval problem to a mono- For an unbiased evaluation, we ran several exper- lingual one. As an alternative, we consider the iments with LUCENE,6 a freely available open- interlingua-based cross-language approach in terms source search engine which combines Boolean of Morphosemantic Indexing (MSI) as introduced searching with a sophisticated ranking model based in Section 2. Both approaches will then be evalu- on TF-IDF. Beside its ranking facility, which ated on the same query and document set. As the achieves results that even can outperform ad- baseline for our experiments, we provide a retrieval vanced vector retrieval systems (Tellex et al., 2003), system operating with the Porter stemmer (Porter, LUCENE has another advantage: it supports a rich 4 1980) and language-specific stop word lists so that query language, like multi-field search, and more the system runs on (original) English documents than ten different query operators. In our experi- with (original) English queries. ments we made use of proximity search, which al- The (human or machine) translation of native- lows to find words within a specified window size. language queries into the target language of the doc- For example, given the query talar fracture∼3, ument collection to be searched (QTR) is a stan- LUCENE finds documents containing the words ‘ta- dard experimental procedure in the cross-language lar’ and ‘fracture’ within three words distance of retrieval community (Eichmann et al., 1998). In each other and allows word swaps (e.g., ‘fracture our experiments, the original English queries were of the talar bone’, ‘talar bone fracture’). In pre- first translated into Portuguese, Spanish and Ger- vious experiments, we discovered that this feature man by medical experts (all native speakers of increases the retrieval performance in any scenario, those languages, with a very good mastery of including the baseline condition. Especially, the ef- both general and medical English). In the sec- fect of considering a window of three items signifi- ond step, the manually translated queries were re- cantly increases the score of clustered matches. This translated into English using the GOOGLE TRANS- becomes particularly important in the segmentation LATOR.5 Admittedly, this tool may not be partic- of complex word forms.7 ularly suited to translate medical terminology (in fact, 17% of the German, 16% of the Portuguese, 4 Experimental Results and 14% of the Spanish query terms were not trans- Three different test scenarios can now be distin- lated). Hence, we additionally used bilingual lex- guished for our retrieval experiments: eme dictionaries derived from the UMLS Metathe- saurus with about 26,000 German-English entries, • BASELINE: The OHSUMED corpus serves as 14,200 entries for Portuguese-English, and 22,900 the baseline of our experiments both in terms for Spanish-English. If no English correspondence of its Porter-stemmed English queries and its could be found, the terms were left untranslated Porter-stemmed English document collection. (this, finally, happened to 7% of the German, as well as 5% of the Portuguese and Spanish query • QTR: German, Portuguese and Spanish terms). Just as in the baseline condition, the stop queries are automatically translated into En- words were removed from both the documents and glish ones using the GOOGLE TRANSLA- the automatically translated queries. The left side of TOR and the UMLS Metathesaurus, which are Figure 2 visualizes this approach which we refer to Porter-stemmed after the translation. These as QTR. queries are directly evaluated on the Porter- As an alternative to QTR, we tested MSI, the stemmed OHSUMED document collection. approach as described in Section 2. Unlike QTR, • MSI: German, Portuguese and Spanish the indexing of documents and queries using MSI queries are automatically transformed into the (after stop word elimination), yields a language- language-independent MSI interlingua (plus independent, semantically normalized index format. lexical remainders). The entire OHSUMED The right side of Figure 2 visualizes the basic com- document collection is also submitted to putation steps for MSI. the MSI procedure. Finally, the MSI-coded

4We used the stemmer available on http://www. 6http://jakarta.apache.org/lucene/docs/ snowball.tartarus.org, last visited on January 2005. index.html, last visited on January 2005. The incorporated stop word lists contained 172 English, 232 7 Otherwise, a document containing ‘append⊕ectomy’ and German, 220 Portuguese, and 329 Spanish entries. ‘thyroid⊕itis’, and another one containing ‘append⊕ic⊕itis’ 5 http://www.google.de/language tools, last and ‘thyroid⊕ectomy’ become indistinguishable after segmen- visited on January 2005. tation. queries are evaluated on the MSI-coded BASELINE 0.5 ENGLISH-MSI OHSUMED document collection, both at an interlingual representation level. 0.4 We take several measurements in comparing the 0.3 performance of QTR and MSI. The first one is the Precision 0.2 average precision at all eleven standard recall points (0.0, 0.1, 0.2, ..., 1.0). These values are depicted in 0.1 Figure 3 for all scenarios we considered. We also 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 calculate the average at the top two recall points Recall 0.5 (0.0 and 0.1). While this data was computed with BASELINE GERMAN-MSI consideration of the first 200 documents under each GERMAN-QTR 0.4 condition, we also calculated the exact precision scores for the top five and top 20 ranked documents. 0.3 As shown in Table 1 (first row), the English-

Precision 0.2 English baseline reaches 0.2 precision on the 11pt average. We also ran an experiment where we MSI- 0.1 indexed the original OHSUMED corpus and english 0.0 queries. This boosted the 11pt average to 0.22. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Clearly, this approach cannot be taken as the base- Recall 0.5 BASELINE line condition, since it confounds the notion of base- PORTUGUESE-MSI PORTUGUESE-QTR line with that of experimental conditions. It is inter- 0.4 esting though, because it reveals some of the poten- tial of MSI for (medical) indexing. 0.3

The German-English MSI result is almost on a Precision 0.2 par with the baseline (0.01 less (0.19)), whereas the German-English QTR result is more than 0.08 0.1 points worse (0.12). Hence, the MSI approach 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 achieved 95% of the baseline performance (quite Recall 0.5 a high score given CLIR standards), whereas QTR BASELINE SPANISH-MSI scored far lower (60%), resulting in a 35 percentage SPANISH-QTR points difference between the two approaches. 0.4

The difference turns out to be less dramatic, 0.3 but still noticeable, in comparing the Portuguese- English MSI and QTR results with the baseline Precision 0.2

(78% for MSI and 52% for QTR, hence, 26 percent- 0.1 age points difference). We may speculate that this 0.0 result is not due to any particularities of the Por- 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 tuguese language, but rather to the uneven invest- Recall ment of effort in building various lexicons (the size of the Portuguese subword lexicon is only two thirds Figure 3: Precision/Recall Graphs for, from top of the corresponding German and English ones; cf. to bottom, the English-English baseline, German- English, Portuguese-English, Spanish-English Section 2). For Spanish, QTR precision averages 40% of the monolingual baseline (0.08), whilst 69% of the baseline is reached for MSI (0.14). This is a significant win given that the Spanish dictionary Medical decision-makers are more often inter- was built in a fully automatic way. ested in a few top-ranked documents. Thus, the ex- Interesting from a realistic retrieval perspective act precision scores for these documents are more is the average gain on the top two recall points. indicative of the performance of the two approaches In Table 1 (second row), the German-English MSI in such a standard medical retrieval context (see Ta- condition achieves a precision of 0.41 (92% of the ble 1, third and fourth row). MSI exceeds QTR by baseline), the Portuguese-English condition yields a 12-17 percentage points for German, 9-10 percent- precision value of 0.36 (80% of the baseline). For age points for Portuguese, and even 21-25 percent- Spanish, still 78% of the monolingual baseline pre- age points for Spanish, considering the top 5, re- cision is reached. spectively top 20, ranked documents. English German Portuguese Spanish BASE MSI QTR MSI QTR MSI QTR MSI 11pt .2031 .2211 (108.9) .1209 (59.5) .1930 (95.0) .1051 (51.7) .1489 (77.6) .0811 (39.9) .1409 (69.0) top 2pt .4503 .4944 (109.8) .2926 (65.0) .4142 (92.0) .2779 (61.7) .3583 (79.6) .1900 (42.2) .3509 (77.9) top 5 .7566 .7301 (96.5) .4603 (60.8) .5528 (73.0) .4396 (58.1) .5094 (67.3) .3566 (47.1) .5170 (68.3) top 20 .6033 .6066 (100.5) .3741 (62.0) .4764 (79.0) .3528 (58.5) .4118 (68.3) .2726 (45.2) .4212 (69.8)

Table 1: Standard Precision/Recall Table (% of Baseline in Brackets)

5 Related Work pre-processing, including lemmatization. Although very good results are communicated, these are not After more than a decade of active research, cross- comparable to ours because the authors use a home- language information retrieval (CLIR) has made grown document and query collection, as well as considerable achievements (Grefenstette, 1998; baselines diverging from ours. Gey et al., 2002). From a methodological point Eichmann et al. (1998) report on CLIR experi- of view, the field is divided into dictionary-based ments for French and Spanish using the same test vs. corpus-based approaches (Oard and Diekema, collection as we do (OHSUMED), and the UMLS 1998). Since corpus-based approaches depend on Metathesaurus for query translation, achieving 71% the availability of large parallel corpora, which are of baseline for Spanish and 61% for French. With mostly out of reach for technical sublanguages, the vector space engine they employ, their overall most efforts in CLIR are centered around either 11pt performance (0.24) is slightly above the one query translation or document translation (Rosem- for the search engine we use (0.20). This, however, blat et al., 2003). McCarley (1999) reports on a does not compromise our results since our experi- translation model, which incorporates both query ments are aimed at comparing the performance of and document translation and outperforms either two different CLIR methods and not at comparing translation direction. A more recent strategy for ma- different search engine architectures. Moreover, the chine translation based CLIR is the use of commer- search engine we employ is more in line with cur- cial software for query processing (Savoy, 2003), rent clinical and Web retrieval engines and the re- which usually provides only poor support of tech- quirements they have to fulfil. nical sublanguages. For medical terminology and other sublanguages, non-specialized multilingual lexicons (e.g., based on WordNet) also offer lim- 6 Conclusions ited support only (Gonzalo et al., 1999). Hence, we We presented an interlingua approach to cross- were faced with the need to construct a multilingual language retrieval on a medical document collec- medical lexicon from scratch. tion. It is based on subword clusters, i.e., equiva- The success of dictionary-based CLIR largely de- lence classes of subwords which capture intra- and pends on the coverage of the lexicon, tools for con- interlingual synonymy.8 Compared with a direct- flating morphological variants, phrase and proper translation approach in which queries are translated name recognition as well as word sense disambigua- by online translators, the light-weight interlingua tion (Pirkola et al., 2001). We optimize the lexical approach fared well. We achieved a remarkable coverage by limiting the lexicon to semantically rel- benefit for German document retrieval in terms of evant subwords of the medical domain. This also 95% reaching the English baseline. The results for helps us in dealing with morphological variation, Portuguese and Spanish are weaker (78% and 69%, including single-word decomposition. The latter respectively), but this can be attributed to the current is a very common phenomenon, especially in Ger- underspecification of both lexicons we assume. man medical terminology (Schulz and Hahn, 2000) and cannot be sufficiently treated by dictionary-free techniques (Savoy, 2002). This might explain the 7 Acknowledgements poor results for German in the SAPHIRE retrieval This work was funded by Deutsche Forschungs- system which uses the UMLS Metathesaurus for se- gemeinschaft (DFG), grant Klar 640/5-2, and the mantic indexing (Hersh and Donohoe, 1998). EU Network of Excellence Semantic Interoperabil- The UMLS, together with WORDNET, is also ity and Data Mining in Biomedicine (NoE 507505). the lexical basis of the approach pursued by the 8 Please, check the MORPHOSAURUS system at http:// MUCHMORE project (Volk et al., 2002). Here, con- www.morphosaurus.net, our implemetation of the cept mapping occurs after various steps of linguistic interlingua-based MSI approach. References Annual Review of Information Science and Tech- nology (ARIST), Vol. 33: 1998, pages 223–256. D. Eichmann, M. Ruiz, and P. Srinivasan. 1998. Cross-language information retrieval with the M. Pacak, L. Norton, and G. Dunham. 1980. Mor- phosemantic analysis of -itis forms in medical UMLS Metathesaurus. In SIGIR’98 – Proceed- ings of the 21st Annual International ACM SIGIR language. Methods of Information in Medicine, Conference, pages 72–80. Melbourne, Australia, 19(2):99–105. August 24-28, 1998. A. Pirkola, T. Hedlund, H. Keskustalo, and K. J¨arvelin. 2001. Dictionary-based cross-lan- F. Gey, N. Kando, and C. Peters. 2002. Cross- guage information retrieval: Problems, methods, language information retrieval: A research and research findings. Information Retrieval, roadmap. SIGIR Forum, 36(1):72–80. 4(3/4):209–230. J. Gonzalo, F. Verdejo, and I. Chugur. 1999. Using M. Porter. 1980. An algorithm for suffix stripping. EUROWORDNET in a concept-based approach to Program, 14(3):130–137. cross-language text retrieval. Applied Artificial G. Rosemblat, D. Gemoets, A. Browne, and Intelligence, 13(7):647–678. T. Tse. 2003. Machine translation-supported G. Grefenstette, editor. 1998. Cross-Language In- cross-language information retrieval for a con- formation Retrieval. Kluwer. sumer health resource. In AMIA’03 – Proceed- W. Hersh and L. Donohoe. 1998. SAPHIRE In- ings of the 2003 Annual Symposium., pages 564– ternational: A tool for cross-language informa- 568. Washington, D.C., November 8-12, 2003. tion retrieval. In AMIA’98 – Proceedings of the M. Ruiz, A. Diekema, and P. Sheridan. 1999. CIN- 1998 Annual Fall Symposium., pages 673–677. DOR conceptual interlingua document retrieval: Orlando, FL, November 7-11, 1998. TREC-8 evaluation. In Proceedings of the 8th W. Hersh, C. Buckley, T. Leone, and D. Hickam. Text REtrieval Conference (TREC-8), pages 597– 1994. OHSUMED: An interactive retrieval eval- 606. Gaithersburg, MD, November 17-19, 1999. uation and new large test collection for research. J. Savoy. 2002. Recherche d’information dans In SIGIR’94 – Proceedings of the 17th Annual In- des corpus plurilingues. Ingenierie´ des Systemes` ternational ACM SIGIR Conference, pages 192– d’Information, 7(1/2):63–92. 201. Dublin, Ireland, 3-6 July 1994. J. Savoy. 2003. Report of CLEF-2003 multilin- K. Mark´o, U. Hahn, S. Schulz, P. Daumke, and gual tracks. In Working Notes for the 2003 CLEF P. Nohama. 2004. Interlingual indexing across Workshop. Trondheim, Norway, 21-22 August. different languages. In RIAO 2004 – Conference S. Schulz and U. Hahn. 2000. Morpheme-based, Proceedings, pages 82–99. Avignon, France, 26- cross-lingual indexing for medical document re- 28 April 2004. trieval. International Journal of Medical Infor- K. Mark´o,S. Schulz, and U. Hahn. 2005a. Unsu- matics, 59(3):87–99. pervised multilingual word sense disambiguation S. Schulz, K. Mark´o,E. Sbrissia, P. Nohama, and via an interlingua. In AAAI’05 – Proceedings of U. Hahn. 2004. Cognate mapping: A heuristic the 20th National Conference on Artificial Intel- strategy for the semi-supervised acquisition of a ligence, pages 1075–1080. Pittsburgh, PA, USA, Spanish lexicon from a Portuguese seed lexicon. July 9-13, 2005. In COLING 2004 – 20th International Confer- K. Mark´o,S. Schulz, A. Medelyan, and U. Hahn. ence on Computational , pages 813– 2005b. Bootstrapping dictionaries for cross- 819. Geneva, Switzerland, August 23-27, 2004. language information retrieval. In SIGIR 2005 S. Tellex, B. Katz, J. Lin, A. Fernandes, and G. Mar- – Proceedings of the 28th Annual International ton. 2003. Quantitative evaluation of passage re- ACM SIGIR Conference. Salvador, Brazil, Au- trieval algorithms for question answering. In SI- gust 15-19, 2005. GIR 2003 – Proceedings of the 26th Annual Inter- J. McCarley. 1999. Should we translate the docu- national ACM SIGIR Conference, pages 41–47. ments or the queries in cross-language informa- Toronto, Canada, July 28 - August 1, 2003. tion retrieval? In Proceedings of the 37th An- UMLS. 2004. Unified Medical Language System. nual Meeting of the ACL, pages 208–214. College Bethesda, MD: National Library of Medicine. Park, MD, USA, 20-26 June 1999. M. Volk, B. Ripplinger, S. Vintar, P. Buitelaar, MeSH. 2004. Medical Subject Headings. Beth- D. Raileanu, and B. Sacaleanu. 2002. Seman- esda, MD: National Library of Medicine. tic annotation for concept-based cross-language D. Oard and A. Diekema. 1998. Cross-language medical information retrieval. International information retrieval. In M. E. Williams, editor, Journal of Medical Informatics, 67(1/3):79–112.