Adding a Medical Lexicon to an English Parser Peter Szolovits, Phd MIT Laboratory for Computer Science, Cambridge, MA

Adding a Medical Lexicon to an English Parser Peter Szolovits, PhD MIT Laboratory for Computer Science, Cambridge, MA ABSTRACT half-million (494,762) word corpus. This is despite the fact that LP has one of the larger lexicons We present a heuristic method to map lexical distributed with such tools, holding syntactic (syntactic) information from one lexicon to another, definitions of 49,121 words (including about 1,000 and apply the technique to augment the lexicon of the short phrases). Except as noted otherwise, we treat Link Grammar Parser with an enormous medical vo- phrases exactly the same as words. cabulary drawn from the Specialist lexicon developed by the National Library of Medicine. This paper pre- One of the richest available sources of medical sents and justifies the mapping method and addresses lexical information today is the UMLS’s Specialist technical problems that have to be overcome. It Lexicon [4]. The 2001 version, which we use here, illustrates the utility of the method with respect to a contains lexical information on 235,197 words large corpus of emergency department notes. (including 75,121 short phrases). It is a more complete lexicon than others for purposes of INTRODUCTION analyzing medical text. For the ED corpus, for Computer-readable clinical data are today largely example, 7,328 of the distinct words (54%) appear stored as unstructured text.1 Although a few projects directly in the Specialist Lexicon. Of these words, report impressive progress toward turning such text 2,284 do not appear in the LP lexicon, yet account for into structured data [2,5], much remains to be done to 17% of the distinct words in the corpus.3 capture a comprehensive representation of the Our goal in this paper is to describe a method for content of medical texts by computer. expanding the usable lexicon of LP (the target) by Because many other research and application adding lexical definitions of words present in the communities face the same problem, we have an Specialist Lexicon (source) but not in LP. Because opportunity to share tools with others to help parse, the nature of lexical descriptions varies greatly formalize, extract and organize meaning from text. A among different lexicons, one cannot simply copy practical impediment to such sharing, however, is definitions from one to another. Methods similar to that most language processing tools built by those those presented here should be applicable to other outside medical informatics have no knowledge of pairs of language processing systems that offer the medical lexicon, and therefore treat a large significantly different coverage of vocabularies. The fraction of medical words as if they bore no syntactic next section describes lexical information available information. This deficiency makes such tools for categorizing words in both the Link Parser and practically inapplicable to medical texts. For Specialist lexicons, and the method by which we map example, in trying to parse unstructured text notes Specialist words to Link Parser words. Later, we recorded in the Emergency Department of a large describe the results of this mapping and show a urban tertiary care pediatric hospital (ED), the Link dramatic increase in the lexicon of the LP. Finally, Grammar Parser (LP) [3,6] was able to recognize we discuss ways to address inconsistency between only 5,168 of the 13,689 distinct words2 (38%) in a lexicons, capitalization, still missing words, and the treatment of words with unique lexical descriptions. METHODS 1 Advanced clinical information systems store at least some of the following types of data in coded and structured form: Every lexicon uses some language of lexical laboratory data, recorded signals and images, pharmacy descriptors to specify lexical information about each orders, and billing-related codes. If other clinical data are word or word sense. Although a language processing captured at all, items such as medical history, physical system’s knowledge about a particular word may be examination findings, progress notes, discharge summaries, radiology and pathology notes, etc., are nevertheless stored quite extensive, often the bulk of the knowledge is simply as text. about the meaning rather than the lexical constraints 2 This assumes that different capitalizations of a word are nevertheless instances of the same word. If we distinguish different capitalizations of a word, there are 17,372 distinct 3 It may be surprising that even a lexicon developed for words in the corpus. We discuss issues of capitalization medical use is missing 46% of the distinct words in this below. corpus. We examine this in the Discussion section. on the word. Therefore, lexical descriptions tend to Both Specialist and LP usually contain separate be relatively small, and many words, though they entries in the lexicon for a word that has senses may differ greatly in meaning, will have the same corresponding to different parts of speech. For lexical description. Central to our method of example, “running” is marked in Specialist as either a mapping lexical structure is the notion of verb present participle or as a third-person count or indiscernibility. Two words are indiscernible in a uncount noun. LP’s analysis makes it either a verb or specific lexicon just in case that lexicon assigns the a gerund. Consequently, we will treat each such same lexical descriptors to the words. word/part-of-speech pair as a word sense, and apply our heuristic mapping method to such word senses. Heuristic mapping. The basic idea of mapping is Note that neither lexicon contains distinct entries for simple: Assume that w is a word of the source traditional word senses such as the distinction lexicon for which no lexical information is known in between a bank (financial institution) and a (river) the target lexicon. If there is a word x in the source bank, perhaps because these are distinctions in lexicon that is indiscernible from w and if x has a meaning, not lexical structure. Specialist has a total lexical definition in the target lexicon, then it is at of 246,014 word senses, and LP has 58,926. least plausible to assign the same lexical definition in the target lexicon to w as x has. Within the Specialist lexicon, we consider two word senses to be indiscernible if and only if they have This simple picture is complicated by a handful of exactly the same set of lexical descriptions according difficulties: lexical ambiguity and inconsistency to the knowledge sources enumerated here. For between different lexicons, different lexical treatment example, though the words “execute” and of capitalized and lower-case words, words whose “cholecystectomize” may share little in meaning, lexical descriptions are unique, and words that don’t they do have exactly the same lexical markings in appear at all in either lexicon. To make mapping Specialist. Both are verbs, marked as “infinitive” and practical, we have had to overcome, at least in part, "pres(fst_sing, fst_plur, thr_plur, second)", and admit each of these. The following sections describe in complement structures "tran=np" or "ditran=np, detail the lexical descriptors available for words in pphr(for, np)". both the Specialist and LP lexicons, and then address each of the above-listed difficulties, in turn. The Specialist lexicon contains 3,357 distinct sets of indiscernible entries among its 246,014 total word Structure of the Specialist lexicon. The following senses. Fully 2,574 of these sets are singletons, lexical knowledge relevant to our approach is meaning only one entry is contained in the set, encoded in Specialist. We extract the information because the lexical markings of that entry are unique. from the listed relational tables. (See discussion, below.) By contrast, the largest 1. Part of speech. (lragr) indiscernible set contains 64,149 entries—about ¼ of 2. Agreement/Inflection Code. First, second and the entire lexicon. Figure 1 shows the distribution of third person; singular and plural; tense and bin sizes after excluding singletons. Thus, there are negation (for verbs, modals and auxiliaries); 417 indiscernible sets of sizes 2 or 3, 200 of size 4- count/uncount for nouns and det’s. (lragr) 10, 85 of size 11-32 … and 2 of size > 32,000. 3. Complements. A complex system for describing the types of complements taken by verbs, nouns and adjectives, including linguistic types of the 450 various complementation patterns, prepositions, 400 etc. (lrcmp) 350 4. Position and modification types for adjectives 300 and adverbs. (lrmod) 250 5. Other features. (lrmod for adjectives and 200 150 adverbs) Frequency 100 We ignore lexical markers that relate only to “closed” 50 classes of words such as pronouns, because we 0 expect that such words will already be defined and 0.511.522.533.544.55 lexically described in the target lexicon. We also ignore inflectional markers because both source and Fig. 1. Histogram of the number of indiscernible target lexicons already include inflected word forms. lexical sets in Specialist plotted against the size of For example, both “run” and “running” appear as the set. Abscissa is log of the maximum set size to distinct entries in both lexicons. 10 fall into that bin. Successive bins correspond to units of 0.5 in log of 10 X I(f(x )) the bin size. w 1 v1 x1 Structure of the LP lexicon. The LP lexicon w associates with each word sense a complex Boolean x2 formula of features. These are drawn from a set of x3 I(f(x2)) 104 basic features, augmented by additional markers v2 that specify whether a matching feature is expected on a word to the left or right in the sentence, whether Source ⊥ Target a match on this feature is optional, required or possible but penalized, and many additional sub- specializations of the basic features to enforce further Figure 2.

Load more