Subword Clusters As Light-Weight Interlingua for Multilingual Document Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Subword Clusters as Light-Weight Interlingua for Multilingual Document Retrieval Udo HAHN Kornel´ MARKO´ Stefan SCHULZ Jena University Freiburg University Hospital Language & Information Engineering Lab Medical Informatics Department F¨urstengraben30 Stefan-Meier-Str. 26 D-07743 Jena, Germany D-79104 Freiburg, Germany [email protected] [email protected] Abstract creasingly getting access to these resources, with the Web adding an even more diversified crowd of We introduce a light-weight interlingua for a cross- searchers. Hence, mappings between different lin- language document retrieval system in the medical guistic registers are inevitable to serve the needs of domain. It is composed of equivalence classes of such a heterogeneous search community. Therefore, semantically primitive, language-specific subwords automatically performed intra- and interlingual lex- which are clustered by interlingual and intralingual ical mappings and transformations of equivalent ex- synonymy. Each subword cluster represents a basic pressions become an obvious necessity to support conceptual entity of the language-independent inter- these different user groups in an adequate manner. lingua. Documents, as well as queries, are mapped We here propose an approach which is intended to this interlingua level on which retrieval opera- to meet these particular challenges. At its core tions are performed. Evaluation experiments reveal lies a new type of interlingua the basic entities of that this interlingua-based retrieval model outper- which are composed of semantically minimal sub- forms a direct translation approach. words. From a linguistic perspective, subwords are often closer to formal Porter-style stems (Porter, 1 Introduction 1980) rather than to lexicologically orthodox ba- Medical document retrieval presents a unique com- sic forms, e.g., of verbs or nouns or linguistically bination of challenges for the design and imple- plausible stems. Hence, their merits have to be mentation of retrieval engines. Clinical document shown in (retrieval) experiments. These language- collections have increasingly become available in specific subwords form semantically defined equiv- electronic form (e.g., as Electronic Patient Records alence classes which capture intralingual as well as (EPRs)) and their size is rapidly growing, with es- interlingual (near) synonymy between all subwords timates ranging, for a single clinical site, on the in a single cluster. Thus, they abstract away from order of millions of documents in total, including subtle particularities within and between languages. hundreds to thousands new documents being added We do not claim to cover the lexicon of general lan- every day. Hence, there is a growing demand for guage but rather restrict ourselves to the terminol- automatic support for content-oriented access and ogy used in the medical domain. browsing of electronic files and EPRs. In Section 2, we elaborate on the lexicological Furthermore, medical document collections are foundation of this interlingua, i.e., the format of inherently multi-lingual. While clinical texts are subwords and their synonymy relations, and their usually written in the native language of the coun- role in the process of morphosemantic normaliza- try, searches in major bibliographic databases (e.g., tion. The usefulness of subwords will be shown in MEDLINE) require a substantial proficiency of retrieval experiments (Section 3), in which we con- (expert-level) English medical terminology. Non- trast our interlingua-based retrieval approach to one native speakers of English, however, often lack this which relies on direct translation only (Section 4). particular competence. Hence, some sort of bridg- ing between synonymous or, at least, related terms 2 Light-Weight Interlingua from different languages has to be provided to make We here introduce the notion of subwords (Section full use of the information these databases hold. 2.1), their organization in terms of an interlingua Finally, the user population of medical document (Section 2.2), some principles underlying the cre- retrieval systems and their search strategies are re- ation and maintenance of the lexicon as well as the ally diverse. Not only physicians, but also nurses, interlingua resource (Section 2.3), and the basic pro- medical insurance companies and patients are in- cedure for morphosemantic analysis (Section 2.4). 2.1 Subwords • Two types of meta relations can be asserted From a linguistic perspective, the proper choice of between synonymy classes: the granularity of the basic lexical units is usually (i) a paradigmatic relation has-meaning, guided by syntactic considerations, i.e., the syntax which relates one ambiguous class to its of words (e.g., inflection or derivation) or the syntax specific readings, as with: of sentences (e.g., in terms of subcategorization or fheadg ) fkopf,zephal,caput,cephal,cabec,cefalg valency frames). For the proper choice of subwords, OR fboss,leader,lider,chefeg. however, semantic considerations are key. Espe- (ii) a syntagmatic relation expands-to, which cially in scientific and technical sublanguages, we consists of predefined segmentations in case observe that semantically non-decomposable enti- of utterly short subwords, such as: ties and domain-specific suffixes (e.g., ‘-itis’ (Pacak fmyalgg ) fmuscle,muskel,musculg ⊕ fpain, et al., 1980)) are chained in complex word forms schmerz,dorg. such as in ‘pseudo⊕hypo⊕para⊕thyroid⊕ism’, Compared with relationally richer, e.g., WORD- ‘pancreat⊕itis’ or ‘gluco⊕corticoid⊕s’.1 We refer NET based, interlinguas used for cross-language in- to these self-contained, semantically minimal units formation retrieval (Gonzalo et al., 1999; Ruiz et as subwords and motivate their status primarily by al., 1999), we hence incorporate a much more lim- their usefulness for document retrieval rather than ited set of semantic relations and pursue a more by linguistic arguments. restrictive approach to synonymy. We also refrain The minimality criterion is often weaker than, from introducing additional hierarchical relations e.g., for morphemes, though it is hard to define in between MIDs because such links can be acquired a general way. For example, given the text token from domain-specific vocabularies, e.g., the Medi- ‘diaphysis’, a linguistically plausible morpheme- cal Subject Headings (MeSH, 2004) (cf. experimen- style segmentation might lead to ‘dia⊕phys⊕is’. tal evidence from Mark´oet al. (2004)). From a medical perspective, however, a segmen- tation into ‘diaphys⊕is’ seems much more reason- 2.3 Engineering the Lexicon and Interlingua able because the canonical linguistic decomposi- In the development workflow, the effects of changes tion is far too fine-grained and likely to create too of subword size and granularity are immediately fed many subword ambiguities (which would be harm- back to the developers using word lists to test and ful to precision). Comparable ‘low-level’ segmen- validate both the segmentation and the assignment tations of semantically unrelated tokens such as of MIDs. A collection of parallel texts (abstracts of ‘dia⊕lyt⊕ic’, ‘phys⊕iol⊕ogy’ lead to morpheme- medical publications in English plus either German, style subwords ‘dia’ and ‘phys’, which unwarrant- Spanish or Portuguese) are used to detect errors in edly match ‘dia⊕phys⊕is’, too. The (semantic) the assignment of MIDs. To impose common poli- self-containedness of the chosen subword is also of- cies on the lexicon builders, we developed a main- ten supported by the existence of a synonym, e.g., tenance manual which contains 31 rules. The most for ‘diaphys’ we have ‘shaft’. critical tasks they cover are listed below: 2.2 From Subwords to Interlingua • The proper delimitation of subwords (e.g., Subwords are assembled in a lexical repository, with ‘compat⊕ibility’ vs. ‘compatib⊕ility’); the following considerations in mind: • The decision whether an affix introduces a new meaning which would justify a new entry (e.g., • Subwords are listed, together with their at- ‘neur⊕osis’ vs. ‘neuros⊕is’); tributes such as language (English, German, Portuguese, Spanish) or subword type (stem, • Data-driven decisions, such as to add ‘-otomy’ prefix, suffix, invariant). Each subword is as a synonym of ‘-tomy’ in order to block erro- assigned one or more morpho-semantic class neous segmentations such as ‘nephrotomy’ into identifier(s), we call MID(s), representing the ‘nephr⊕oto⊕my’; corresponding synonymy equivalence class. • The decision to exclude short stems from seg- mentation (such as ‘my-’, ‘ov-’) in order to • Intralingual synonyms and interlingual trans- block false segmentations; lation synonyms of subwords are assigned the same equivalence class (judged within the con- • The decision to locate the appropriate level of text of medicine only). semantic abstraction when equivalence classes are formed, e.g., by grouping f‘hyper-’, ‘high’, 1`⊕' denotes the concatenation operator. ‘elevate’g into the same class; High TSH values suggest the high tsh values suggest the Orthographic • The decision which function words and af- diagnosis of primary hypo- diagnosis of primary hypo- Normalization fixes are excluded from indexing, such as thyroidism ... thyroidism ... ‘and’, ‘-ation’, ‘-able’, and those which are not Erhöhte TSH-Werte erlauben die Orthographic erhoehte tsh-werte erlauben die Diagnose einer primären Hypo- Rules diagnose einer primaeren hypo- ‘dys-’, ‘anti-’, ‘-itis’. thyreose ... thyreose ... Original Morphosyntactic Parser Subword Lexicon In the