2nd Workshop on Building and Evaluating Resources for Biomedical

Tuesday, 18th March 2010

Valletta, Malta

Organisers:

Sophia Ananiadou Kevin Cohen Dina Demner-Fushman

Workshop Programme 9:15 – 9:30 Welcome

9:30 – 10:30 Invited Talk (chair: Sophia Ananiadou) Pierre Zweigenbaum, Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI-CNRS), France

10:30 – 11:00 Coffee break

11:00 – 12:30 Session 1 (chair: Kevin Cohen)

11:00 Spelling Correction in Clinical Notes with Emphasis on First Suggestion Accuracy Jon Patrick, Mojtaba Sabbagh, Suvir Jain and Haifeng Zheng

11:25 Automatically Building a Repository to Support Evidence Based Practice Dina Demner-Fushman, Joanna Karpinski and George Thoma

11:50 An Empirical Evaluation of Resources for the Identif cation of Diseases and Adverse Effects in Biomedical Literature Harsha Gurulingappa, Roman Klinger, Martin Hofmann-Apitius and Juliane Fluck

12:15 A Task-Oriented Extension of Chinese MeSH Concepts Hierarchy Xinkai Wang and Sophia Ananiadou

12:40 – 14:10 Lunch break

14:10 – 15:10 Invited talk (chair: Sophia Ananiadou) Simonetta Montemagni, Istituto di Linguistica Computazionale (ILC-CNR), Italy

15:10 – 16:55 Session 2 (chair: Dina Demner-Fushman)

15:10 Structuring of Status Descriptions in Hospital Patient Records Svetla Boytcheva, Ivelina Nikolova, Elena Paskaleva, Galia Angelova, Dimitar Tcharaktchiev and Nadya Dimitrova

15:35 Annotation of All Coreference in Biomedical Text: Guideline Selection and Adaptation K. Bretonnel Cohen, Arrick Lanfranchi, William Corvey, William A. Baumgartner Jr., Christophe Roeder, Philip V. Ogren, Martha Palmer and Lawrence Hunter

16:00 – 16:30 Coffee break

16:30 Contribution of Syntactic Structures to Single Word Term Extraction Xing Zhang and Alex Chengyu Fang

16:55 – 17:10 Concluding remarks (chair: Sophia Ananiadou)

i

Workshop Organisers

 Sophia Ananiadou, NaCTeM, , UK

 Kevin Cohen, Center for Computational Pharmacology and The MITRE Corporation, USA

 Dina Demner-Fushman, National Library of Medicine, USA

Workshop Programme Committee

 Olivier Bodenreider, National Library of Medicine, USA

 Wendy Chapman, University of Pittsburgh, USA

 Aaron Cohen, Oregon Health and Science University

 Liu Hong Fang, Georgetown University Medical Center, USA

 Martin Krallinger, National Biotechnology Center, Spain

 John McNaught, NaCTeM, University of Manchester, UK

 John Pestian, Computational Medicine Center, University of Cincinnati, USA

 Andrey Rzhetsky, University of Chicago, USA

 Jian Su, Institute for Infocomm Research, Singapore

 Junichi Tsujii, University of Tokyo, Japan and NaCTeM, University of Manchester, UK

 Yoshimasa Tsuruoka, JAIST, Japan

 Karin Verspoor, Center for Computational Pharmacology, University of Colorado, USA

 Xinglong Wang, University of Manchester, UK

 Bonnie Webber, University of Edinburgh, UK

 John Wilbur, NCBI, NLM, NIH, USA

 Pierre Zweigenbaum, LIMSI-CNRS, France

ii Table of Contents

Foreword Sophia Ananiadou, Kevin Cohen and Dina Demner Fushman ………………... 1

Spelling Correction in Clinical Notes with Emphasis on First Suggestion Accuracy Jon Patrick, Mojtaba Sabbagh, Suvir Jain and Haifeng Zheng ……………….. 2

Automatically Building a Repository to Support Evidence Based Practice Dina Demner-Fushman, Joanna Karpinski and George Thoma ………………. 9

An Empirical Evaluation of Resources for the Identif cation of Diseases and Adverse Effects in Biomedical Literature Harsha Gurulingappa, Roman Klinger, Martin Hofmann-Apitius and Juliane Fluck ………………………………………………………………. 15

A Task-Oriented Extension of the Chinese MeSH Concepts Hierarchy Xinkai Wang and Sophia Ananiadou …………………………………………… 23

Structuring of Status Descriptions in Hospital Patient Records Svetla Boytcheva, Ivelina Nikolova, Elena Paskaleva, Galia Angelova, Dimitar Tcharaktchiev and Nadya Dimitrova …………………………………. 31

Annotation of All Coreference in Biomedical Text: Guideline Selection and Adaptation K. Bretonnel Cohen, Arrick Lanfranchi, William Corvey, William A. Baumgartner Jr., Christophe Roeder, Philip V. Ogren, Martha Palmer and Lawrence Hunter …………………………………………. 37

Contribution of Syntactic Structures to Single Word Term Extraction Xing Zhang and Alex Chengyu Fang …………………………………………… 42

iii Author Index

Sophia Ananiadou, 23 Galia Angelova, 31 William A. Baumgartner Jr., 37 Svetla Boytcheva, 31 K. Bretonnel Cohen, 37 William Corvey, 37 Dina Demner-Fushman, 9 Nadya Dimitrova , 31 Alex Chengyu Fang, 42 Juliane Fluck, 15 Harsha Gurulingappa, 15 Martin Hofmann-Apitius, 15 Lawrence Hunter, 37 Suvir Jain, 2 Joanna Karpinski, 9 Roman Klinger, 15 Arrick Lanfranchi, 37 Ivelina Nikolova, 31 Philip V. Ogren, 37 Martha Palmer, 37 Elena Paskaleva, 31 Jon Patrick, 2 Christophe Roeder, 37 Mojtaba Sabbagh, 2 Dimitar Tcharaktchiev, 31 George Thoma, 9 Xinkai Wang, 23 Xing Zhang, 42 Haifeng Zheng, 2

iv FOREWORD

This volum e contains the pap ers accepted at the 2nd w orkshop on Building and Evaluating Resources for Biomedical Text Mining held at LREC 2010, Malta. Biomedical text mining over the last decade has become one of the d riving application areas for the NLP community, resulting in a series of very successful yearly specialist wo rkshops at ACL since 2002, BioNLP, as well as the launch of the BioMed special interest group in 2008.

In the past, most of the work has focused on solving specific problems, often using task-tailored and private data sets. This data was rarely reused, in particular outside the efforts of the providers. This has changed during the last years, as many research groups have made available resources that have been built either purposely or as by-products of research or ev aluation efforts. A num ber of projects, initiatives and organisations have been dedicated to building and providing biomedical text mining resources (e.g., the GENIA suite of corpora, PennBioIE, TREC Ge nomics track, BioCreative, Yapex, LLL05, BOOTStrep, JNLPBA, KDD data, Medstract, BioText, etc.). There is an increasing need to provide comm unity-wide discussions on the desi gn, a vailability and interoperability of resources for bio-text m ining, following on from specific applications such as gene name identification and protein-protein inter actions in BioCreative I/II, the BioNLP'09 shared task on event extraction etc., and recognition of clinically relevant entities and relations in the i2b2 challenges to the use of common resources and tools in real life applications.

The papers in this volu me reflect the current tran sitional state of the b iomedical text m ining field and range f rom named entity r ecognition in ME DLINE abstracts (Gurulingappa et a l., Zhang and Fang) to reuse of annotation tools and guidelines to fully annotate 97 journal articles (Cohen et al.). The corpus of 97 fully annotated articles will be m ade publicly ava ilable upon com pletion of syntactic, semantic, and discourse structure annotation with the emphasis on coreference annotation.

The biom edical text m ining field is expanding in two directions : enrichm ent and use of cross- lingual resources and work in resource-poor languages on the one hand, and significant inroads in processing clinical narrative on the other hand. The enrichm ent and evaluation of cross-lingual resources a re exem plifies in the work on ex tension of the Chinese Medi cal Su bject Headings vocabulary (W ang and Ananiadou). Spanning a re source-poor language and clinical dom ain, Boytcheva et al. focus o n the mining of Bulgarian c linical notes for descriptions of patients’ status and com plications. Patrick et al. focus on spelling errors in clinic al notes and pres ent a detailed description of a spelling suggestion generator. Clin ical notes are linked to MEDLINE citations in the collection under construction at NIH, as described by Demner-Fushman et al.

We wish to thank the authors f or subm itting papers f or considera tion, and the mem bers of th e program committee for their time and effort during the review process. We would also like to thank our invited speakers, Simonetta Montemagni and Pierre Zweigenbaum for their contribution.

Sophia Ananiadou, Kevin Cohen and Dina Demner-Fushman

1 Spelling correction in Clinical Notes with Emphasis on First Suggestion Accuracy

Jon Patrick, Mojtaba Sabbagh, Suvir Jain, Haifeng Zheng

Health Information Technology Research Laboratory School of IT, University of Sydney [email protected], [email protected], [email protected], [email protected]

Abstract Spelling correction in a large collection of clinical notes is a relatively lesser explored research area and serious problem as typically 30% of tokens are unknown. In this paper we explore techniques to devise a spelling suggestion generator for clinical data with an emphasis on the accuracy of the first suggestion. We use a combination of a rule-based suggestion generation system and a context-sensitive ranking algorithm to obtain the best results. The ranking is achieved using a language model created by bootstrapping from the corpus. The described method is embedded in a practical work-flow that we use for Knowledge Discovery and Reuse in clinical records.

1. Introduction cludes the following resources: Our work specializes in processing corpora of medical texts 1. Systematised Nomenclature of Medicine-Clinical ((Wang and Patrick, 2008), (Ryan et al., 2008), (Zhang and Terms, i.e., SNOMED-CT: 99860 words Patrick, 2007) and (Asgari and Patrick, 2008)). These cor- pora usually contain many years of patient records from 2. The Moby Lexicon: 354992 words hospital clinical departments. Discovered knowledge about the texts needs to be reused immediately to automate cer- 3. The Unified Medical Language System, i.e., UMLS: tain language extraction tasks that support the workflow of 427578 words staff in the hospital. However, to be able to extract the required knowledge, first 4. Lists of abbreviations and acronyms discovered in pre- various lexical verification steps need to be undertaken, viously processed clinical corpora: 1483 words such as expansion of abbreviations and acronyms, recog- nition of named entities, drug dosage details etc. An ini- 5. Gazetteers built of named entities discovered in previ- tial analysis of a corpus is to identify unknown words and ously processed clinical corpora: 19264 words non-words (tokens containing any non-alphabetic charac- ters) which typically constitute 30% of a corpus. A large 6. List of the gold standard corrections of misspellings in majority of unknown words are misspellings but a signif- previously processed texts: 78889 words icant number (10%-15%) are meaningful and need to be correctly interpreted. Combined, this knowledge base provides us with 758,481 The nature of the requisite processing techniques is more unique words at our disposal. The fact that the test corpus complex than simple spelling correction because the med- of 20 million tokens which contained 57,523 words and ini- ical texts also contain various abbreviations, acronyms, tially had 24,735 unknown tokens, i.e., tokens which did names of people, places and even a number of neologisms. not exist in our knowledge base, only shows the infinite di- In such a text, every word that does not appear in a dictio- versity in clinical notes and the magnitude and complexity nary need not be an invalid word. Therefore, during the pro- of the problem. cess of knowledge discovery, extreme care has to be taken that valid tokens are not misclassified as spelling errors. For abbreviations, acronyms and named entities, we apply sep- 2. Related Work arate workflows for their detection and resolution. Various works have explored the problem of spelling cor- Our study is based on those words in the corpus that have rection in the past but literature on spelling correction in been designated by medical experts as genuine spelling clinical notes, which is a distinct problem, is quite sparse. A mistakes. These words include both medical and non- previous work (Crowell et al., 2004) used word frequency medical words. The diverse nature of these words renders based sorting to improve the ranking of suggestions gen- traditional spell checkers ineffective and requires a novel erated by programs like GSpell and Aspell. Their method spell checking system for their resolution. does not actually detect any misspellings nor generate sug- Currently the nature of spelling correction of this system gestions. Work (Ruch, 2002) has also been done to study is ‘offline’, i.e., these suggestions are not being generated contextual spelling correction to improve the effectiveness during entry of data in the hospitals. Instead, these methods of an IR system. are applied to the corpus as a whole. We look to implement Another group (Tolentino et al., 2007) created a prototype this system as an ‘online’ process in the near future. spell checker using UMLS and Wordnet as their sources of Our knowledge base to begin the knowledge discovery in- knowledge. See (Mykowiecka and Marciniak, 2006) for a

2 program for automatic spelling correction in mammogra- those misspellings where the typist forgot to press a phy reports. They utilized edit distances and bigram prob- key or exchanged adjacent letters. abilities and their method is applied to a very specific do- The suggestions are sorted according to their fre- main. None of these methods scale up satisfactorily to the quency in the corpus or the knowledge base, the dif- size and diversity of our problem. ferences of which are evaluated later. Due to the lack of previous work on spelling correction in our very large medical corpus (60 million tokens), we were It must be noted that these suggestions are not passed motivated to devise our own method. to the subsequent context-sensitive ranking algorithm. The remainder of this paper is organized in the follow- This is a heuristic that we have employed. The sort- ing manner. Section 3 contains details about the sugges- ing according to the word frequency greatly increased tion generation, context sensitive ranking algorithm and the the accuracy of the suggestions generated by this pro- heuristics employed by our method. Section 4 gives details cess(see Table 2). In the results we have shown the ac- about the data sets that we used in our experiments. Sec- curacy achieved by sorting both with corpus frequen- tion 5 presents the experiments with the results. Sections 6, cies (CFSORT) and knowledge base frequencies (KB- 7 and 8 include the interpretations, possible improvements SORT) and their combination (CFKBSORT). in future work and conclusion. 2. Missing white space between 2 words: In this step, 3. Method most of the tokens which consist of two valid words concatenated due to missing white-space, are identi- Our method for spelling correction uses a combination of fied. This is achieved by inserting a space between rule-based suggestion generation and a context-sensitive every adjacent letter pair and then checking if the two ranking algorithm. substrings exist in the knowledge base. The set of sug- gestions generated by this function is passed on to the 3.1. Suggestion Generation ranking algorithm. The suggestion generator that we used follows a series of patientrefused rules to choose the suggestions that would be ranked in sub- Misspellings like : (patient refused) are sequent processing. The following are various rules that corrected. are applied to the unknown word to choose valid sugges- 3. Edit Distance 2: In this step we compute all the tions. It should be noted that if the algorithm finds that a strings within edit distance 2 of the misspelling. This rule matches atleast one valid suggestion, it does not pro- is achieved by finding all strings within edit distance 1 cess the subsequent rules. It simply passes on the computed of the unknown word and repeating the process for all set of suggestions to the ranking algorithm. For example, if generated words. This method too has a good cover- the ‘Edit Distace One’ algorithm finds atleast one sugges- age and covers misspellings which might have a single tion that matches a word in our knowledge base, the sub- error at 2 distinct positions in the word. It also covers sequent rules are not processed. Otherwise, the next rule those words which have multiple errors at a single site. would have been processed and so on. Of these, all those strings that exist in our database are Table 1 provides the average individual contribution of each passed to the ranking algorithm. of the following rules if they are used in isolation from the rest. Misspellings like statarating: (saturating) are cor- rected. 1. Edit Distance 1: Edit distance (also known as Damerau-Levenshtein distance (Damerau, 1964), 4. Phonetic: This process generates all the valid strings (Levenshtein, 1966)) is a “distance” (string metric) which are phonetically similar to the misspelling. The between two strings, i.e., finite sequence of symbols, set of strings generated is verified against the knowl- given by counting the minimum number of operations edge base and all those strings which are valid are needed to transform one string into the other, where an passed on to the ranking algorithm. operation is defined as an insertion, deletion, or sub- The algorithm used here is the “Double Metaphone” stitution of a single character, or a transposition of two algorithm.The Double Metaphone search algorithm is characters. The suggestions are generated by simulat- a phonetic algorithm and is the second generation of ing deletion, transposition, alteration and insertion the Metaphone algorithm (Phillips, 2000). It is called in one character position of the unknown word and, all “Double” because it can return both a primary and a those strings that are verified as legal from our knowl- secondary code for a string; this accounts for some edge base are put into the bag of suggestions. ambiguous cases as well as for multiple variants of Misspellings like blokage: (blockage), patholigical: surnames with common ancestry. For example, en- (pathological) and maliase: (malaise) are corrected. coding the name “Smith” yields a primary code of SM0 and a secondary code of XMT, while the name This method is based on Damerau’s(Damerau, 1964) “Schmidt” yields a primary code of XMT and a sec- findings, according to which a majority of all spelling ondary code of SMT–both have XMT in common. mistakes are represented by these actions. This list covers common typographic errors such as those Misspellings like simptamotolgi: (symptomatology) caused due to pressing adjacent keys. It also covers are corrected.

3 5. Phonetic edit distance 1: In our medical lexicon, many 3.3. Complex Frequency Ranking Algorithms words are long and complex. Typists often tend to The basic ranking algorithm for candidate corrections is a miss one syllable or do a contraction. With this func- frequency-based technique. (Crowell et al., 2004) posed tion, such words will be coded by phonetics and then their frequency-based model to improve the spelling sug- we generate all words with edit distance one for this gestion rank in medical queries. Our trained dictionaries phonetic code. The suggestion list is populated by all have two kinds of frequency values. The first is a knowl- words which have the same phonetic code as we gen- edge based word frequency which computes the number erated here. It will largely extend our suggestion list. of times the same word is repeated when we are training Misspellings like amns: (amounts) and ascultion: the dictionary set. The second is a corpus word frequency (auscultation) are corrected. which uses the word frequencies in the corpus. As shown in previous works, frequency-based methods can improve the 6. Edit distance 1 and concatenated word: The combi- first suggestion accuracy of spelling correction, however, nation of rules 1 and 2 is used. There are often words the dictionary configuration is a real problem, and is fairly in the text where a valid word is joined with an in- hard to optimize. At this stage, we train our frequency val- valid one, which is usually within 1-edit distance of ues based on the following three methods: the first is based its correction. Such cases are dealt with at this stage. on corpus word frequency; the second is based on knowl- As before, the set of suggestions is passed on to the edge base word frequency; the third is based on the com- ranking algorithm. bined frequency in the knowledge base and corpus. Con- sidering there are many misspellings from missing white Misspellings like affectinback: (affecting back) are space(s), it is not fair to compute the frequency of the mis- corrected. spelling as a whole or just compute the frequency of the first 7. Multiple concatenated words: Sometimes, 3 or more separated word. This paper introduces two new frequency words have been concatenated due to missing white ranking algorithms: space. Such words are filtered here and the sugges- 1. First two words ranking: This algorithm computes the tions are passed on to the ranking algorithm. frequency of the first two words when a suggestion Misspelling like explorationforbleedingand: (explo- contains more than one word. ration for bleeding and) is corrected. 2. First and last word ranking: This algorithm computes 3.2. Context-Sensitive Ranking Algorithm the combined frequency of the first and last word of a suggestion. Previous studies (see (Elmi and Evens, 1998), (Golding and Schabes, 1996) and (Kukich, 1992)) have suggested that 3.4. Heuristics Employed spelling correction is much more effective when the method A set of the interesting heuristics that was evaluated to im- takes into account the context in which the word occurs. We prove the first suggestion’s accuracy is also provided. take into account the context considering the misspelling, one word before it and one word after it. Unlike some pre- 1. Suggestions generated by Edit Distance 1 were not vious studies, we did not experiment with context defined ranked by the context-sensitive ranking algorithm. as adjacent letters. Since our corpus is large, we could get They were instead sorted by their word frequencies in reliable estimates if we considered the surrounding words the knowledge base. The rationale behind this strategy as the context. was that suggestions that occur more frequently in the For this purpose, we used the CMU-Cambridge Statisti- knowledge base are more likely to occur in the corpus. cal Language Modelling Toolkit (Clarkson and Rosenfeld, The accuracy was found to drop sharply when we tried 1997). It is a suite of UNIX software tools to facilitate the to rank the suggestions by this method. construction and testing of statistical language models. It includes tools to process general textual data to word fre- 2. The misspellings of adverbs ending with ‘ly’ were quency list and vocabularies, n-gram counts and backoff not being corrected in a satisfactory manner. For this models. purpose, we devised a simple heuristic. If a mis- The toolkit was used to build a vocabulary using all the spelling ended with ‘ly’, then the first suggestion in words in the corpus which occurred more than once. Using the suggestion-set ending with ‘ly’ was arbitrarily pro- these words, a language model was constructed. This tool moted to the first position in the suggestion set. The was used to calculate all unigram, bigram and trigram prob- application of this heuristic enabled us to correct al- abilities along with their corresponding backoff values. The most all such misspelt adverbs. probability values were smoothed to increase the reliability. For example, if the misspelling was accidentaly, the One of the salient features of our method is that the cor- first suggestion might be accidental whereas the cor- pus was bootstrapped to correct its own mistakes. This was rect suggestion accidentally might be third or fourth in done by ranking the suggestions using the language model the list. and the frequency of the misspellings in the corpus itself. This method helped promote the more relevant suggestions 3. The orthographic case for the misspelling was not con- in the suggestion set to the top positions. This method sidered. All words were converted to lower case. It greatly increased the accuracy of the first suggestion. was noted that although words beginning with upper

4 case do give an idea if the word is a named entity, was calculated. This method gave an accuracy of 60.52% many such named entities did not begin with upper on our test data and 63.78% on the training data. We then case. Therefore, to simplify processing the case was incrementally added the different heuristics to further im- ignored for all words. prove the accuracy. The percentage increase with the ad- dition of each step is presented in Table 2. Finally, the ac- 4. The Edit Distance methods used Inverse Edit Dis- curacy reached 62.31% on the training data and 75.11% on tance instead of direct edit distance due to the large the test data. size of our dictionaries. The use of Direct Edit Dis- At this point sorting of the suggestions based on their fre- tance would have led to unnecessarily long process- quency in the knowledge base was incorporated. Words ing times. Instead Inverse Edit Distance allowed us which are more frequent in the knowledge base should log- to calculate all the possible strings and then check the ically be ranked higher in the suggestion list. dictionaries for them. This option was only available because of the large knowl- 5. In some cases, one or more suggestions have the same edge base. Therefore, the suggestions based on their word probability when we try to rank them by language frequencies in the corpus was also used. In both cases, the model. This was due to sparseness in corpus. In accuracy improved by a significant margin. But the sorting such cases, we summed frequency of suggestions in based on corpus frequencies provided slightly higher accu- the corpus and frequency of the suggestion in knowl- racy than that of knowledge base frequencies. edge base and included them in computation of the Looking at an example to illustrate why sorting based on language model. This helped us get more reliable knowledge base frequencies is useful. The word haemo- ranking for suggestion. dynamically has been misspelt in 381 different ways in pre- viously processed texts. In the ICU corpus alone, it occurs 4. Data Set 30,162 times in its correct form. This makes it highly prob- 1. The training data consisted of misspellings in a cor- able that if ‘haemodynamically’ is present anywhere in the pus of clinical records of the Emergency Department suggestion list, it deserves to be among the top suggestions. at the Concord Hospital, Sydney. The total numbers of Similarly, sorting based on corpus word frequencies is unique words in the corpus was 57,523. This corpus also helpful and is a method which can be be used for any contained 7442 misspellings for which we already had large corpus. Also, the texts belong to a single department gold standard corrections verified by expert analysis. and a great deal of medical terms that are used in one de- We found that this set served our purpose quite well partment would not have been used at all in another. There- because it contained a good mixture of medical as well fore this method helps improve the ranking especially for as non-medical words. It also contained many rare words that are typical of the specific corpus. Phone1 algo- words and some very convoluted misspellings which rithm works better in the Concord-ED collection because it could not be corrected by usual spelling correction contains more complex errors than ‘Train’ and the RPAH- techniques. ICU data set. Interestingly, due to this reason the methods show a Sorting based on combined frequencies in corpus and higher accuracy on the test data than the training data. knowledge base was slightly more effective than these methods individually(See Table 2). 2. The test data contained 2 separate data sets as de- After trying a number of different methods to generate sug- scribed below. gestions, the next improvement was generated from exper- • The first data set consisted of 12,438 words in the imenting with ranking suggestions. When the number of suggestions is large ranking is difficult because there are a Concord Emergency Department Corpus which lot of high-frequency small words which are within ‘Edit has been described above. These words were Distance 1’ or ‘Edit Distance 2’ of many misspellings. So, hitherto unknown words which did not exist in our knowledge base. These words were manually some dictionaries were progressively removed from our corrected and along with their gold standard they knowledge base. As shown in Table 3, removing each dic- serve as one of the test data sets of our method. tionary improves the result slightly. While having rich re- sources may increase the coverage (recall) however it in- • The second data set consists of 65,429 mis- jects noise into the suggestions and leads to a decrease in spellings from the Intensive Care Unit patient accuracy. We need to find a tradeoff between the coverage records at the Royal Prince Alfred Hospital, Syd- and the number of suggestions for each misspelling. By re- ney, the RPAH-ICU corpus. This corpus con- ducing the number of dictionaries, number of suggestions is tained a total of 164,302 unique words. reduced. The last row of this table shows the effect of a lan- 5. Experiments and Results guage model on the overall accuracy. For this, a language model was created from all the lexically verified words that 5.1. Experiments occurred more than once in the specific medical text that is Initial experiments started with a simple spell checker to being processed. The language model stores all the word establish a baseline. At this stage the spellchecker only unigrams, bigrams and trigrams with their corresponding checked for a known word, word with 1-edit distance or 2- backoff values and probability values. The language model edit distance, as described in section 3.1. The first sugges- probabilities were added to the corpus and knowledge base tion was matched with the gold standard and the accuracy frequencies as explained in Heuristic 5. The reason RPAH-

5 Rule Percentage Description Edit distance 1 69.80 Edit distance one Two words 6.73 Missing white space Edit distance 2 85.52 Contains Edit distance one and two Phonetic 42.52 Pure Phonetic Two word edit distance 1 7.09 Contains Two words and Two edit distance one Multi word 6.86 Contains two or more concatenated words Phonetic edit distance 1 84.78 Pure Phonetic and Phonetic with Edit distance one

Table 1: Mean Individual Contribution of each Spelling Correction Algorithm to the Concord-ED Corpus

Rule Training data Concord-ED RPAH-ICU (Baseline)Edit Distance 1 and 2 60.62 63.78 62.28 Adding ‘Two words’ 62.31 74.99 66.42 Adding ‘Phonetic’ 62.31 75.11 66.57 Adding ‘Two word edits 1’ 62.31 75.11 66.58 Adding ‘Multi word’ 62.31 75.11 66.61 Adding ‘Sorting based on corpus word frequency’ 87.19 93.17 81.24 Adding ‘Sorting based on knowledge base word 83.41 93.07 79.99 frequency’ Adding ‘Sorting based on combined frequency in 87.25 93.54 81.45 knowledge base and corpus’ Replace ‘Phonetic’ with ‘Phonetic edit distance 1’ 87.24 93.43 81.69 Adding ‘First two words ranking’ 87.25 92.89 81.75 Adding ‘First and last word ranking’ 87.26 93.05 81.83

Table 2: Percentage Accuracy After Addition of each Process Described in Section 3.1

ICU data collection was used here is, it contains a complete analysis of an individual data set’s results shows that set of all errors and so is a good test set for this method. the increase with the addition of sorting and/or lan- The language model was applied to the suggestions com- guage model was equally pronounced. The training puted after application of each of the sorting strategies. The set has some from very complex misspellings which result drops a little bit because the frequency ranking is very could not be corrected by a general heuristic. The good and adding each ranking method may worsen the ac- Concord Test Corpus had more non-medical words curacy. and fewer medical terms. The RPAH-ICU corpus had more of a mix of medical terms and non-medical words.

2. As shown in Table 2, both ‘First two words rank- ing’ and ‘First and last word ranking algorithms’ im- proved the accuracy, but obviously the latter is bet- ter. According to the analysis of over 200 multiple concatenated word cases, the small word coverage of ‘First two words ranking’ is 89 against 81 of ‘First and last word ranking algorithms’. The small words are defined as prepositions, conjunctions, pronouns and some common adjectives, adverbs, verbs and nouns, which have fairly short word length and fairly high frequency value. The short length makes them eas- ier to be considered as suggestions, and the high fre- Fig 1. Variation of Accuracy of Suggestions with Word quency value can adversely affect the suggestion rank- length ing. Thus, FLW ranking is more suitable as it skips the short words and only considers the first and last word. 5.2. Observations 1. Each of the three data sets that has been used in this 3. An analysis of the results showed that the accuracy of experiment is unique in itself. This is the reason that our method was found to increase as the word length the accuracies are very different for each data set. But increased. This is because the bag of suggestions for

6 Rule Correct Num (out of RPAH-ICU corpus 65429) Remove UMLS 54038 82.59 Remove SNOMED-CT 53769 82.18 Remove Both 54857 83.84 Remove Gazetteers 53614 81.94 Remove All 55197 84.36 Reomve All + Adding Language Model 54985 84.04

Table 3: Percentage Accuracy After Removing Dictionaries and Adding Language Model

longer words contains less number of suggestions and 5. A limitation of the context sensitive ranking algorithm we can easily rank them. was that the suggestion generated did not always ex- Figure 1 shows the gradual increase of the first sug- ist in the corpus itself. We tried countering this error gestion’s accuracy with increasing word length. It also by adding the Knowledge Base word frequencies to shows the decrease in accuracy of second suggestion the trigram probabilities. A semantic approach to such and manual corrections with increasing word length. ambiguous cases would be more effective and, at the Spikes at word length 17 and 20 are anomalies due to same time, elegant. low frequency of very long tokens in the corpus. 7. Conclusions 4. The results show that an exhaustive knowledge base is We have presented a method for spelling correction based a boon for a spell checker. on combining heuristic-based suggestion generation and 5. With a large corpus, both corpus word frequencies and ranking algorithms based on word frequencies and trigram context based probabilities are quite effective. A com- probabilities. We have achieved high accuracies on both bination of these two techniques yields improved re- the test data sets with both the methods. We believe the sults. results are satisfactory and an improvement upon previous work. We look to implement more sophisticated methods in 6. Possible Improvements the future to utilize the context in a more effective manner. The authors believe that the method described in this paper This spelling suggester has been incorporated in the group’s can be improved by one or more of the following: workflow and is used in the manual process of spelling cor- rection to great effect. 1. Both language model based sorting and simple word frequency sorting to improve the ranking of sugges- 8. References tions, have given us comparable results. This does not P. Asgari and J. Patrick. 2008. The new frontier - analysing mean that the use of language model is futile. We be- ][ clinical notes for translation research - back to the fu- lieve that there is room for improvement in the lan- ture. In Violaine Prince and Mathieu Roche, editors, In- guage model. We have used only trigram probabilities formation Retrieval in Biomedicine: Natural Language but context is often beyond trigrams. In the future, Processing for Knowledge Integration., pp 357–378. IGI we plan to implement a more sophisticated language Global. model to augment the spelling correction system. P.R. Clarkson and R. Rosenfeld. 1997.][ Statistical language 2. Some words could have been properly corrected with modeling using the cmu-cambridge toolkit. In Proceed- a good US-UK spelling disambiguation system. It was ings ESCA Eurospeech, pp 2701–2710, September. seen that sometimes the first suggestions was anes- Jonathan Crowell, Qing Zeng, Long Ngo, and Eve-Marie thetize but the corresponding gold standard was anaes- Lacroix. 2004.][ A frequency-based technique to improve thetize. In our case, the obvious work-around to this the spelling suggestion rank in medical queries. J Am problem was adding both to the gold standard. In Med Inform Assoc., 11:179–185. some cases, we looked at the Wordnet synonym list F.J. Damerau. 1964.][ A technique for computer detection of the first suggestion. For non-medical terms, this and correction of spelling errors. Communications of the method often worked fine. But far better than all such ACM, 3(7):171–176, March. methods, would have been the use of a robust US-UK Mohammed Ali Elmi and Marthe Evens. 1998.][ Spelling spelling disambiguation system with good coverage. correction using context. In Proceedings of the 17th international conference on Computational linguistics, 3. Sometimes when the misspelling was that of a verb, volume 98, pp 360–364. the spelling suggestions preserved the root but not the A. Golding and Y. Schabes. 1996. Combining trigram- morphology. A system which incorporates such func- ][ based and feature-based methods for context-sensitive tionality would be another possible improvement. spelling correction. In Proc. 34th ACL, pp 71–78. 4. A small percentage of words in the gold standard are K. Kukich.[ 1992.] Techniques for automatically correcting erroneous and this slightly lowered the accuracy. words in text. ACM Computing Surveys, 24(4):377–349.

7 V.I. Levenshtein.[ 1966.] Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Dok- lady, 10:707. Agnieszka Mykowiecka and Magorzata Marciniak.[ 2006.] Domaindriven automatic spelling correction for mam- mography reports. Intelligent Information Processing and Web Mining, 5:521–530. Lawrence Phillips.[ 2000.] The double metaphone search al- gorithm. Technical report, C/C++ Users Journal. Patrick Ruch.[ 2002.] Using contextual spelling correction to improve retrieval effectiveness in degraded text col- lections. In Proceedings of the 19th International con- ference on Computational linguistics., volume 1, pp 1–7. Association for Computational Linguistics. A. Ryan, J. Patrick, and R. Herkes.[ 2008.] Introduction of enhancement technologies into the intensive care service, royal prince alfred hospital, sydney. Health Information Management Journal, 37(1):39–44. Herman D Tolentino, Michael D Matters, Wikke Walop, Barbara Law, Wesley Tong, Fang Liu, Paul Fontelo, Ka- trin Kohl, and Daniel C Payne. 2007.][ A umls-based spell checker for natural language processing in vaccine safety. BMC Med Inform Decis Mak., 7(3), February. F. Wang and J. Patrick.[ 2008.] Mapping clinical notes to medical terminologies at point of care. In D. Demner- Fushman, S. Ananiadou, K. Bretonnel Cohen, J. Pes- tian, J. Tsujii, and B. Webber, editors, In Proc of BioNLP2008, pp 102–103. Association for Computa- tional Linguistics (ACL). Y. Zhang and J. Patrick.[ 2007.] Extracting semantics in a clinical scenario. In Health Knowledge and Data Min- ing Workshop, Ballarat, Research and Practice in In- formation Technology, volume 249, pp 217–224. Health Knowledge and Data Mining Workshop, Ballarat, Re- search and Practice in Information Technology.

8 Automatically Building a Repository to Support Evidence Based Practice

Dina Demner-Fushman, Joanna L. Karpinski, George R. Thoma Lister Hill National Center for Biomedical Communications (LHNCBC) U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD 20894 {ddemner|karpinskij|gthoma}@mail.nih.gov

Abstract Our long-term goal is to find, store, update, and provide access to key facts needed to support clinical decision making. Presently, the facts are extracted automatically from clinical narrative and biomedical literature sources, primarily MEDLINE, and stored in the Repository for Informed Decision Making. We envision expert community validation of the extracted facts and peer-reviewed direct deposit of key facts in the future. The key facts reported in publications do not change and can be extracted in advance and retrieved as needed. We chose an alternative approach to building the repository: extraction of key facts for specific clinical tasks and clinical scenarios. Combined with providing extracted facts (linked to the original publications) as service at the point of care, this approach allows capturing clinicians’ relevance judgments and leads to gradual, weakly-supervised construction of a collection of facts and documents pertaining to specific clinical situations and expert judgments of relevance and quality of the documents. In this paper, we demonstrate our approach to corpus construction using the clinical task of patient care plan development. We provide an overview of the process and focus on the automatic construction of PubMed queries – an essential step in finding documents containing key facts.

Section 5 presents the structure of our repository and the 1. Introduction mechanism for obtaining relevance judgments as part of Collections containing biomedical documents, key facts the healthcare workflow. We conclude with a discussion extracted from the documents, and judgments on of the preliminary results of our approach to building the relevance and value of the documents to specific clinical repository. tasks and questions are essential for many reasons, including clinical decision support and further 2. Patient Care Plan Development development of biomedical natural language processing methods. Our research into methods for building Mobility Problem: limited mobility due to huge mass right arm. collections of key clinical facts relatively fast and at low Skin Problem: st 3 sacral decub cost is motivated by the well-known desire of clinicians to GI Problem: nausea/vomiting-new onset have research evidence provided in the form of Respiratory Problem: emphysema /smoking cessation Psychosocial Problem: depression bottom-line advice (Ely et al., 2005) on the one hand, and Neurological/Cognition Problem: Declining cognitive function significant manual efforts presently needed to find and Pain Problem: pain rigth arm summarize key facts in the biomedical domain, on the Pain Problem: R arm tumor pain other hand. For example, creation of the 2007 Text Pain Problem: patient c/o intermittent pain to surgical site REtrieval Conference (TREC) Genomics track collection Pain Goals: pt able to do ADL with minimal pain involved extensive interviewing of biologists to obtain Pain Goals: Pt able to rate pain <3/10. real-life questions of interest to biological domain, as well Pain Goals: pt with pca hydromorphone. cont. dose. also as recruiting judges with significant domain knowledge, reciving bupivicaine 0.25% via epineural typically in the form of a PhD in a life science (Roberts et Pain Interventions: cont with pca dosing, prn hydromorphone al., 2009). also to be given for break through pain We propose inferring clinical questions using formal Respiratory Interventions: pt moved to ICU early AM for persistent dyspnea unknown etiology includ wheezing, representation of a patient’s case, rather than actively tachypnea despite ok saturations and adequate pain control: plan soliciting information needs. The second step of our for further diagnostics repository building process is fairly typical for building collections and consists of literature retrieval. Our Figure 1: Semi-structured interdisciplinary team note. The retrieval process is complicated by the fact that rather than format combines problem types restricted to controlled having a short description of information need provided vocabulary (shown in bold) and free text description of the by a user or a relevant document (which a search engine problems entered by the team members could use to find similar documents) we are presented with the description of the patient’s status and need to find Care plan development starts with assessment of a documents relevant both to the patient’s status and the patient’s status1. The assessment results are documented clinical task to be performed. We describe our approach to automatic construction of expert queries in Section 3 and 1 The assessment part of the note entered into a patient’s chart is the evaluation of the method in Section 4. The rest of the sometimes preceded by patient’s description of the patient's paper is organized as follows: in Section 2 we introduce current condition (mostly in narrative form) and registration of the clinical task of patient care plan development and the objective conditions, such as vital signs, patient’s status framework for formal representation of a patient’s case. observed by the clinician during examination, results of laboratory tests, and other observations.

9 in the care plan as descriptions of the patient’s problems. relevant results, presented the most relevant results at the The form of the description ranges from a list of problems top of the results display, and retrieved a total number of selected from a controlled vocabulary to a narrative hits that could be easily perused by a busy clinician in two summary of the problems. Once the problems are minutes or less. These strings were constructed using the established, the clinician reviews each problem and patient’s primary diagnosis (Chief Complaint) and establishes goals to be achieved while addressing the problems found in the interdisciplinary notes (IDP problems, and plans interventions to achieve the goals. An Problem). example of a de-identified note derived from the interdisciplinary notes used to build our collection is 3.1 PICO Representation of Patient Records shown in Figure 1. Ideally, a clinician would seek The Problem and Intervention extraction modules of the evidence support for all three steps of care plan CQA system were used to represent patients’ cases. Given development. a clinical note, the system automatically generates a The elements of the care plan are in essence the same question frame using one of the Named Entity elements that are used in the evidence based practice Recognition (NER) tools (MetaMap (Aronson, 2001), (EBP) framework for finding information to ensure the NER modules of the NLM experimental search engine best possible care in a given clinical situation. The Essie (Ide et al., 2007) or CQA internal dictionary-based elements of the framework, called PICO, are: the NER module) and a set of rules for extraction of the description of the patient and problem, intended elements of a clinical scenario. interventions and comparisons, and desirable outcomes (Richardson et al., 1995). Clearly, these are the problems, IDP Pain Problem: patient c/o intermittent pain to surgical site interventions, and goals sections of the care plan and we, IDP Pain Interventions: cont with pca dosing, prn hydromorphone therefore, can use the framework developed within EBP also to be given for break through pain for construction of well-formed clinical questions to NER (MetaMap Mappings): Intermittent pain [Sign or Symptom]; formally represent the clinical situation. We use an Surgical Site (Operative site) [Spatial Concept]; Hydromorphone existing EBP-based question-answering system, CQA, [Organic Chemical,Pharmacologic Substance]; …; Pain [Sign or (Demner-Fushman and Lin, 2007) to find and extract key Symptom] facts from publications relevant to the patient’s case. The CQA Problem(s): Intermittent pain CQA Patient: CQA system showed good performance answering CQA Intervention(s): Hydromorphone clinical questions when queries and clinical scenario frames were developed manually. Manual formulation of Figure 2: Clinical question frame is used to formally the query is not ideally suited for use in clinical setting represent a patient’s note using NER because it interrupts the workflow and is often perceived as less useful than spending the time with the patient (Bond, 2007). To generate question frames (see Figure 2 for an example), To replace manual initiation of the search for key facts, the CQA system extracts from the NER output concepts we developed an automatic process for extracting that belong to the following semantic groups: information from a patient’s record and properly Problems/findings (meant to represent a patient’s problem formulating a query to identify appropriate evidence list), Interventions, and Anatomy (which provides details using the National Library of Medicine (NLM) resource, about the patient). The semantic groups are based on the PubMed®. Our solution to automatic construction of Unified Medical Language System® (UMLS®) clinical questions and initiation of the search process is (Lindberg et al., 1993) Metathesaurus semantic types. The described next. Problems/findings semantic group is based on the UMLS semantic group Disorders (McCray et al., 2001). The Interventions group includes therapeutic and diagnostic 3. Query Formulation Algorithm procedures, drugs, and drug delivery devices. The Anatomy group includes semantic types in the anatomy To identify search strategies that would yield relevant and physiology groups excluding those on the cell and results, we manually developed a set of reference PubMed molecular level (for example, Cell or Molecular search strings to analyze for text elements, query forms, Function). and search processes that are most likely to yield successful search results. The set was developed by a 3.2 Analysis of Reference Queries medical librarian (the second author) using 254 records of We evaluated the manually constructed search string patient encounters from 52 patients selected from a 2 dataset of more than 4500 patient encounters. The 254 using SAS® hypothesis testing (we used the SAS 9.0 records were selected because the formal representations SURVEYLOGISTIC Procedure with the Cumulative Logit logistic regression model and Fisher's Scoring of the encounters using the PICO framework and simple 3 searches (that combined all identified PICO elements) optimization) and SPSS linear and cubic regression retrieved at least one MEDLINE® citation or other analysis. The set of citations retrieved by each search evidence (for example, a MedlinePlus® article). The 2 reference queries retrieved the greatest proportion of http://www.sas.com/ 3 http://www.spss.com/

10 string was evaluated on the following criteria: to the Chief Complaint string) Overall success of the query (evaluated by the second 3.2. Identify terms in the IDP Problem field that are author on a scale of zero to four, 0=no hits, 4=ideal explicitly subheadings. Apply those to the Chief results set) as a function of the number of relevant Complaint string. (eg. Text phrase ―surgery on results in the top 10, number of hits retrieved, and the Tuesday‖ would give the subheading ―surgery‖ positions of all relevant results to the Chief Complaint terms.) Total number of citations retrieved 4. Construct IDP Problem string: Number of relevant citations in top 10 4.1. Use dependency parser to identify relationships Position of the first relevant citation between MeSH terms Number of queries executed prior to success or 4.2. Remove second-child terms termination 4.3. Combine parent terms with Boolean ―AND‖ Number of review article citations retrieved 4.4. Combine first-child terms with Boolean ―OR‖ Total number of relevant review article citations in and nest this string. top 10 5. Combine Chief Complaint and IDP Problem strings Position of first relevant review article citation. with Boolean ―AND.‖

Five variables of query construction were evaluated for Chief Complaint: Diffused B Cell Lymphoma their effects on quality of search results: IDP Problem Text: Maintain perfusion of right hand and 1. Increased use of Medical subject headings (MeSH® fingers, preserve neural function right hand and wrist, track terms) -- controlled vocabulary terms assigned to hematoma for status MEDLINE citations during NLM manual indexing Baseline Query: Diffused B Cell Lymphoma AND perfusion process. AND right hand AND fingers AND neural function AND wrist 2. Varied use of the Chief Complaint AND hematoma 3. Increased use of advanced search strategies (use of Advanced Query: (Diffused B Cell Lymphoma) AND subheadings when appropriate; identification of (Perfusion AND (Hand OR Fingers) AND (Nerves OR Wrist) AND hematoma) terms to search as major topic headings) 4. Use of complex query forms (use of Boolean Figure 3: Automatic query formulation strategies AND/OR/NOT; addition of nested search strings) 5. Application of search limits to retrieve only review articles to reduce total number of results retrieved 6. Run iterative searches as necessary: when the retrieved set is too large. 6.1. If, when Chief Complaint and IDP Problem strings are combined and hits retrieved are less SAS hypothesis testing identified variable 1, increased than or equal to 5, then combine any multiple use of MeSH terms, variable 3, increased use of advanced Chief Complaint terms with Boolean ―OR.‖ search strategies, and variable 4, use of complex query Re-execute search. forms, as statistically significant (p < 0.001) to a 6.2. If retrieved set is less than or equal to three, then successful search outcome. SPSS regression analysis was re-execute the search using only the IDP then performed on the number of MeSH terms in the Problem string. query, revealing that searches that used between two and 6.3. If the results set retrieved from step (a) is five MeSH terms were far more likely to be successful between 20 and 100 hits, then search for all than searching using fewer than two or more than five Chief Complaint terms as major topic headings. MeSH terms. These factors were used to guide the 6.4. If retrieved set is still greater than or equal to 75, development of the query formulation process. limit results display to review articles. We evaluated the developed algorithm on 30 additional 3.3 Query Formulation Rules randomly-selected patient encounter records using a naïve ANDing of all identified PICO elements as the baseline. Based on the experience gained during creation of the Figure 3 demonstrates the differences in the advanced and reference queries and the SAS and SPSS analysis results, baseline query formulation. the second author derived the following rules for the automatic query formulation: 4. Experimental Evaluation of the Query 1. Identify MeSH terms Formulation Algorithm 2. Construct Chief Complaint String: If multiple MeSH terms are identified, combine terms with Boolean We evaluated citation sets retrieved by the baseline and ―AND‖. advanced search strategies using the evaluation criteria 3. Prior to constructing IDP Problem string for the reference searches (scale of zero to four; zero 3.1. Extract any subheading terms (Identify terms of being the lowest, with no results retrieved, and four being an identical semantic type to PubMed the highest, with the greatest overall relevancy and subheadings and apply to Chief Complaint usability of the results). Table 1 presents the results of this string (i.e. A drug name in the IDP Problem evaluation. field would give the subheading ―drug therapy‖

11 Advanced for the placement of the first relevant result. Of the ten Relevance Ranking Baseline search sets derived with the baseline algorithm that contained 0 (no results retrieved) 4 0 relevant citations in the top ten results, the mean location was 5.5; median location was 6. Of the 30 sets derived 1,2 (not relevant) 26 20 with the advanced search algorithm, mean location of the 3,4 (relevant) 0 10 first relevant result was 2.4, with a median of 2. We observed that placement of the first relevant result was Table 1: Comparison of results retrieved using the elevated by a mean of four citations when retrieved using baseline and advanced search strategies. the advanced search algorithm. The overall number of relevant results in the top ten Assuming that a search query that retrieved results citations retrieved increased from a mean of 0.4 relevant receiving a 0, 1, or 2 relevance score were not likely to be citations in the top ten using the baseline algorithm to a useful to a clinician, and that only those receiving a 3 or 4 mean of 2.3 relevant citations in the top ten using the should be considered successful retrievals that would be advanced search algorithm. useful for patient care plan development, none of the We attribute better performance of the advanced search to results retrieved by the baseline strategy would thus be reduction of the number of search terms and establishing considered useful for care plan development, compared better relations between the terms due to dependency with 33.3% (10) of the results retrieved with the updated parsing and rules (as opposed to ANDing all terms found algorithm. in the note.) Result sets retrieved for all 30 queries were also evaluated

Figure 4: Obtaining expert judgments at the point of care

5. Obtaining Relevance Judgments at the The next section provides an overview of the system Point of Care that delivers evidence to an EHR and, at the same time, Obtaining relevance judgment by the intended provides information to the system that automatically consumers of evidence at the point of care is a builds the repository for informed decision making. non-trivial task. Ideally, it has to involve minimal effort and be perceived as part of the workflow. We hope to 5.1 System for Evidence Based Practice achieve this goal by providing a service that delivers Support key facts extracted by the above described tools Delivery of evidence to the point of care starts when a directly to an electronic patient record (EHR).

12 clinician requests evidence 4 . The EHR generates a evidence deliveries. request to our service. The request sent to our service Recently, clinicians requested that the system allows contains the Chief Complaint and the de-identified modifying the automatically constructed query and patient’s note (similar to the one shown in Figure 1). repeating the search. The search box shown in the top Our system then extracts the PICO elements and MeSH part of Figure 4 is about to be deployed to the EHR. We terms found in the note using the CQA modules are looking forward to augmenting our collection with described in Section 3.1; constructs the query following manually corrected searches. rules described in Section 3; searches MEDLINE; and responds with an overview of the retrieved evidence to 6. Related Work be displayed in the EHR. Figure 4 shows the part of the To the best of our knowledge, the proposed approach to information dashboard delivered to the EHR that building a collection is new. The mechanism for contains the overview of evidence. An overview of the capturing experts’ judgments is related to collaborative evidence delivery system is provided in (Demner- filtering widely used in commercial systems to predict Fushman et al., 2008). a user’s interest by collecting information about many The evidence provided to clinicians in the information users’ taste in music, books, etc. (Goldberg et al., 1992), dashboard consists of the titles of MEDLINE citations and adaptive information retrieval (Jose et al., 2008). (linked to the citation and full text paper, if available) Our capturing of relevance judgments provided to and the summary of key facts extracted from the improve colleagues experience is related to secondary citation using the CQA system. use of biomedical literature, such as using inclusion or Logging of the clinicians’ navigation of the dashboard citation of a paper by the American College of and their judgments is described next. Physicians (ACP) Journal Club as an indication of the paper’s high quality and relevance to a specific clinical 5.2 Capturing Expert Judgments task (Aphinyanaphongs et al., 2005); use of MeSH The key facts are displayed under the title of a heading indexing as the reference standard in MEDLINE citation on demand, The ―thumbs up‖ and information extraction task (Aronson et al., 2008); and ―thumbs down‖ icons to the left of each article allow use of journal descriptors for word sense for a quick one-click judgment. When a clinician clicks disambiguation (Humphrey et al., 2005). one of the thumbs icons, the smiley- or sad-face icons We also believe to have developed a new approach to are displayed to illustrate judgment results. At the same automatic query creation. Several approaches have time, judgments linked to the citation PubMed unique been previously taken to the process of automating identifier and the unique identifiers of the clinical query construction using contextual information while scenario are stored in the repository, maintained as filtering for the best quality information. MySQL database. The system also registers if the Of these systems, many retrieve relevant information judgment is based on viewing the title and the summary by relying on the categorization of information within alone, or after following the link to the full citation. the EHR to automatically identify terms and use those to populate the search fields of an appropriate evidence 5.3 Preliminary Results resource. KnowlegeLink system of drug information The system is under evaluation through delivering retrieval provides users with a search button within the evidence to an EHR at a major clinical center since EHR record where medication names appear (Maviglia August 2009. Although viewing evidence is the third et al., 2006). The system identifies drug names using popular activity (after accessing information about text parsing and automatically populates the URL of drugs and the specifics of the patient’s case), the one of two pre-specified drug databases (MicroMedEx absolute number of viewed citations is small (the 350 or SkolarMD) with the name of the drug. Within the followed links constitute less than 0.02% of all individual drug databases, users are able to select the interactions with the information dashboard). At the type of information about the drug they wish to search same time, we obtained expert relevance judgments for for (therapy, adverse effects, etc.) Cimino’s (2007) 267 citations (116 positive and 151 negative). The InfoButtons/InfoButton Manager system links the EHR relatively large proportion of papers with judgments to a repository of clinical questions and answers based compared to the total number of viewed papers is not on the user’s selection of a limited number of different surprising. The primary goal of judging is to improve categories of information from the electronic record. A other interdisciplinary team members experience when system designed by Rosenbloom et al. (2005) for use looking for evidence support: papers judged negatively with Vanderbilt University’s CPOE system similarly for a given scenario are lowered in rank (based on the provided users with a method of automatically cumulative judgment scores), whereas papers judged retrieving information relevant to patient care. Using positively are promoted in rank at the subsequent terms derived from the active diagnosis and medication categories in the CPOE system, basic keyword searches

4 could be constructed in PubMed at the point and The approach to initiating the process largely depends on moment of care. the EHR. In our current setting, the request is issued when a clinician clicks on the EBP tab of the EHR.

13 7. Conclusions Demner-Fushman D, Seckman C, Fisher C, Hauser SE, This paper presents an approach and an ongoing effort Clayton J, Thoma GR. A Prototype System to towards building a collection of clinical scenarios (that Support Evidence-based Practice. AMIA Annu serve as topics of interest), combined with documents Symp Proc. 2008 Nov 6:151-5. automatically retrieved to augment the scenarios with Ely JW, Osheroff JA, Chambliss ML, Ebell MH, evidence, and expert judgments on the relevance of Rosenbaum ME. (2005) Answering physicians' retrieved documents to the clinical scenario. The clinical questions: obstacles and potential solutions. benefits of the proposed approach are in the relatively Journal of the American Medical Informatics low cost and minimal supervision in construction of the Association, 12(2), pp. 217--224. collection, as well as obtaining expert judgments at the Goldberg D, Nichols D, Oki BM, Terry D. (1992) point of care. Using collaborative filtering to weave an Some of the drawbacks of the approach are in slow rate information tapestry, Communications of the ACM, of obtaining judgments, sparseness of judgments and 35 (12), pp. 61—70 lack of reasons for judgments. The last two issues have Humphrey SM, Rogers WJ, Kilicoglu H, Demner- been extensively studied in the context of TREC Fushman D, Rindflesch TC. (2005). Word Sense (Voorhees and Harman, 2005), but might need Disambiguation by Selecting the Best Semantic re-evaluation in the context of clinical applications. We Type Based on Journal Descriptor Indexing: plan to speed up the collection process through using Preliminary Experiment. Journal of the American the system in EBP educational sessions at the clinical Society for Information Science and Technology, center. During the sessions we hope not only to obtain 57(1), pp. 96--113. more relevance judgments faster, but also to capture Ide NC, Loane RF, Demner-Fushman D. Essie: a reasoning behind the judgments. concept-based search engine for structured We are less concerned about the quality of obtained biomedical text. J Am Med Inform Assoc. 2007 judgments – those are provided by members of May-Jun;14(3):253-63. tightly-knit teams for other members with the purpose Jose JM, Joho H, van Rijsbergen CJ. (2008). Adaptive of improving provided care, therefore we expect Information Retrieval, Information Processing & high-quality judgments. Management, 44(6), pp. 1819—1821 Lindberg DA, Humphreys BL, McCray AT. (1993) The 8. Acknowledgements Unified Medical Language System. Methods of information in medicine, 32(4), pp. 281--291. This work was supported by the Intramural Research Maviglia SM, Yoon CS, Bates DW, Kuperman G. Program of the NIH, National Library of Medicine. (2006). KnowledgeLink: impact of context-sensitive 9. References information retrieval on clinicians’ information needs. Journal of the American Medical Informatics Aphinyanaphongs Y, Tsamardinos I, Statnikov A, Association, 13(1), pp. 67--73. Hardin D, Aliferis CF. (2005). Text Categorization McCray AT, Burgun A, Bodenreider O. Aggregating Models for High-Quality Article Retrieval in UMLS semantic types for reducing conceptual Internal Medicine. Journal of the American Medical complexity. Stud Health Technol Inform. 2001;84(Pt Informatics Association, 12(2), pp. 207--216. 1): 216–20. Aronson AR. (2001) Effective mapping of biomedical Richardson W, Wilson MC, Nishikawa J, Hayward text to the UMLS Metathesaurus: the MetaMap RSA. (1995). The well-built clinical question: a key program. Proc AMIA Symp, pp. 17-21. to evidence-based decisions. ACP Journal Club, 123, Aronson AR, Mork JG, Neveol A, Shooshan SE, pp. A-12. Demner-Fushman D. (2008). Methodology for Roberts PM, Cohen AM, Hersh WR. (2009). Tasks, Creating UMLS Content Views Appropriate for topics and relevance judging for the TREC Biomedical Natural Language Processing. In Genomics Track: five years of experience evaluating Proceedings of the Annual Symposium of the biomedical text information retrieval systems. American Medical Informatics Association (AMIA). Information Retrieval, 12(1), pp. 81 -- 97. Washington, DC: AMIA, pp. 21--25. Rosenbloom ST, Geissbuhler AJ, Dupont WD, Giuse Bond CS. (2007). Nurses and computers. An DA, Talbert DA, Tierney WM, Plummer WD, Stead international perspective on nurses’ requirements. WW, Miller RA. (2005). Effect of CPOE user Medinfo, 12(1), pp. 228—232. interface design on user-initiated access to Cimino JJ. (2007). An integrated approach to educational and patient information during clinical computer-based decision support at the point of care. care. Journal of the American Medical Informatics Transactions of the American Clinical and Association, 12(4), PP. 458--473. Climatological Association, 118, pp. 273--288. Voorhees EM, Harman DK. (2005). TREC: Experiment Demner-Fushman D, Lin J. (2007). Answering Clinical and Evaluation in Information Retrieval. Cambridge, Questions with Knowledge-Based and Statistical MA: The MIT Press Techniques. Computational Linguistics, 33(1) pp. 63--103.

14 An Empirical Evaluation of Resources for the Identification of Diseases and Adverse Effects in Biomedical Literature

Harsha Gurulingappa∗†, Roman Klinger∗, Martin Hofmann-Apitius∗†, and Juliane Fluck∗

∗Fraunhofer Institute for Algorithms and Scientific Computing Schloss Birlinghoven, 53754 Sankt Augustin, Germany †Bonn-Aachen International Center for Information Technology Dahlmannstraße 2, 53113 Bonn, Germany [email protected], {roman.klinger, martin.hofmann-apitius, and juliane.fluck}@scai.fraunhofer.de

Abstract The mentions of human health perturbations such as the diseases and adverse effects denote a special entity class in the biomedical literature. They help in understanding the underlying risk factors and develop a preventive rationale. The recognition of these named entities in texts through dictionary-based approaches relies on the availability of appropriate terminological resources. Although few resources are publicly available, not all are suitable for the text mining needs. Therefore, this work provides an overview of the well known resources with respect to human diseases and adverse effects such as the MeSH, MedDRA, ICD-10, SNOMED CT, and UMLS. Individual dictionaries are generated from these resources and their performance in recognizing the named entities is evaluated over a manually annotated corpus. In addition, the steps for curating the dictionaries, rule-based acronym disambiguation and their impact on the dictionary performance is discussed. The results show that the MedDRA and UMLS achieve the best recall. Besides this, MedDRA provides an additional benefit of achieving a higher precision. The combination of search results of all the dictionaries achieve a considerably high recall. The corpus is available on http://www.scai.fraunhofer.de/disease-ae-corpus.html

1. Introduction condition that impairs the bodily functions and is associated with physiological discomfort or dysfunction. Similarly, an In the field of biomedical sciences, a huge amount of un- adverse effect is a health impairment that occurs as a re- structured textual data is generated every year in the form sult of intervention of a drug, treatment or therapy (Ahmad, of research articles, patient health records, clinical reports, 2003). The severity of adverse effects can range from mild medical narratives and patents (Karsten and Suominen, signs or symptoms such as nausea and abdominal discom- 2009; Cohen and Hersh, 2005). Enormous efforts have fort to irreversible damage such as perinatal death. There- been invested in parallel to extract potentially useful infor- fore, the mentions of both diseases and adverse effects in mation from these textual records (Wang et al., 2009; Chen free texts denote special entity classes for the medical ex- et al., 2008). Therefore, automatic processing of literature perts, clinical professionals as well as health care compa- data has gained popularity since over a decade, for exam- nies (Hauben and Bate, 2009; Forster et al., 2005). This ple named entity recognition or key concept identification not only helps in understanding the underlying hypotheti- (Smith et al., 2008). cal causes but also provide rationale means to prevent or di- Named entity recognition serves as a basis for biomedi- agnose such abnormal medical conditions. Specially in the cal text mining in order to have key entities tagged before clinical scenario, recognizing the adverse effects in medical they can be subjected to relationship mining or semantic literature can support the clinical decision making (Stricker text interpretation. It deals with the identification of bound- and Psaty, 2004). aries of terms in the text that represent biologically mean- Some research work has been done in the past for the ingful objects of interest such as genes, proteins, or dis- identification of diseases and adverse effects. Jimeno et eases. Quite a lot of work has been done for the recogni- al. (2008) proposed a statistical solution for the identifi- tion of gene and protein names. For example, the BioCre- cation of diseases in a corpus of annotated sentences. They AtIvE competitions address the challenges associated with reused the corpus that was provided by Ray and Craven the gene name recognition and normalization (Krallinger et (2001) but the corpus has a limitation of being restricted al., 2008). Nevertheless, some groups have proposed dif- to OMIM1 diseases only that mostly include genetic disor- ferent solutions for the identification of other interesting ders. Neveol et al. (2009) utilized the same corpus as well classes of biomedical entities such as drug names (Segura- as PubMed2 user queries for the detection of disease names. Bedmar et al., 2008; Hettne et al., 2009) or disease names They adapted a statistical model and a natural language pro- (Jimeno et al., 2008). However, in comparison to the gene cessing algorithm within their framework. Leaman et al. and protein name recognition, only a little work has been (2009) proposed a machine learning based technique for invested for the recognition of disease names and particu- the identification of diseases in a corpus containing over larly adverse effects in the free texts. This is partly due to a fact that the availability of annotated corpora is limited and 1Online Mendelian Inheritance in Man (OMIM): they are of high cost for generation. http://www.ncbi.nlm.nih.gov/omim/ A disease in the context of human health is an abnormal 2http://www.ncbi.nlm.nih.gov/pubmed/

15 2,500 sentences from PubMed. This corpus is made pub- thesaurus serves as a reference terminology and an on- licly available as the Arizona Disease Corpus (AZDC)3 but tology providing a broad coverage of cancer domain in- the annotations are restricted to the diseases only and do cluding cancer related diseases, findings, abnormalities, not contain information about adverse effects. Curino et al. gene products, drugs, and chemicals. Similarly, there (2005) proposed a machine learning based solution for min- are databases that include very specific organ or disease ing adverse effects of specific drugs from the web pages. class related information such as the autoimmune disease They generated an adverse effect dictionary from the re- database (Karopka et al., 2006) and the DSM-IV Codes10 sources provided by the FDA4. However, the corpus uti- which is specific to mental disorders. On the other hand, lized by Curino et al. (2005) is not openly available. Mc- sources such as the ICD-10, the UMLS and the MedDRA11 Cray et al. (2001) proposed a statistical solution for map- provide a wider coverage of diseases, signs, symptoms, and ping the terms in the corpus to the UMLS concepts. They abnormal findings irrespective of any kind of disease or any determined the likelihood of a given UMLS string being affected organ system. All these resources have their own found or not found in the corpus. A classical example of a advantages and areas of applicability. Therefore, the survey tool for mapping the text to biomedical concepts in UMLS5 made here includes only those resources that encompass in- meta-thesaurus is the MetaMap program (Aronson, 2001). formation about medical abnormalities that are associated Several terminological resources are available that pro- with the entire human physiology. vide information about diseases and adverse effects. Few From all the resources introduced here, individual dictio- well known examples include the MeSH6 thesaurus, the naries were generated and evaluated over a manually anno- UMLS7 meta-thesaurus, the ICD-108, and the NCI9 the- tated corpus. Although, the MeSH, ICD-10, MedDRA, and saurus. These resources serve as a good basis for the SNOMED CT are already included as source vocabularies dictionary-based named entity recognition in text but not within the UMLS, these resources were separately down- all of them essentially suit the text mining needs. Although loaded from their respective official websites. The main some of these resources have been utilized individually in reason is because when the terms from the source vocab- the past for the detection of disease names (Jimeno et al., ularies are imported into the UMLS, they undergo a series 2008; Chun et al., 2006), there is no common platform of term modification steps 12. This generates an impression where most of these resources have been collectively eval- that the terms present in the UMLS may not be identical to uated. the terms present in the source vocabularies. Therefore, in The aim of this work is to provide an overview of the dif- order to validate the hypothesis of suitability of the individ- ferent data sources and evaluate the general usability of the ual resources for text mining, they were treated as indepen- contained disease and adverse effect terminology for named dent terminologies. entity recognition. Although, a small set of corpus is avail- Medical Subject Headings (MeSH) is a controlled vocab- able that contain sentences annotated with disease names, ulary thesaurus from the NLM13. It is used by NLM for in- there is no freely available corpus containing the PubMed dexing articles from the PubMed database as well as books, abstracts that are annotated with diseases as well as adverse documents, and audiovisuals acquired by the library (Co- effects. Therefore, a newly annotated corpora is made pub- letti and Bleich, 2001). In MeSH, the terms are arranged licly available. in a hierarchical order that are associated with synonyms and term variants. A subset of MeSH that corresponds to 2. Terminological Resources the category Diseases (tree concepts with node identifiers starting with ‘C’) was extracted to generate a dictionary Dictionary-based named entity recognition approaches covering diseases and adverse effects. The MeSH dictio- rely on comprehensive terminologies containing frequently nary contains over 4,500 entries. used synonyms and spelling variants. Such resources in- Medical Dictionary for Regulatory Activities (Med- clude databases, ontologies, controlled vocabularies and DRA) is a standardized medical terminology that was thesauri. This section gives an overview of the available developed to share regulatory information internationally data sources for diseases and adverse effects. Examples of about medical products used by human (Merrill, 2008). synonyms and term variants associated with the MeSH dis- It provides a hierarchical structure of terms that include ease concepts are provided in Table 1. signs, symptoms, diseases, diagnosis, therapeutic indica- Different resources have been designed to meet the needs tions, medical procedures, and familial histories. The Med- of different user groups whereas some of them include cer- DRA dictionary contains over 20,000 entries associated tain disease specific information. For example, the NCI with synonyms and term variants. International Classification of Diseases (ICD-10) is 3http://diego.asu.edu/downloads/AZDC/ 4Food and Drug Administration (FDA): http://www.fda.gov/ 10Diagnostic and Statistical Manual of 5http://www.nlm.nih.gov/research/umls/ Mental Disorders (DSM) 4th Edition: 6Medical Subject Headings (MeSH): http://www.psych.org/mainmenu/research/dsmiv/dsmivtr.aspx http://www.nlm.nih.gov/mesh/ 11Medical Dictionary for Regulatory Activities (MedDRA): 7Unified Medical Language System (UMLS): http://www.meddramsso.com/ http://www.nlm.nih.gov/research/umls/ 12http://www.nlm.nih.gov/research/umls/knowledge sources/ 8International Classification of Diseases Edition-10 (ICD-10): metathesaurus/source faq.html#what involved http://apps.who.int/classifications/ apps/icd/icd10online/ 13National Library of Medicine (NLM): 9National Cancer Institute (NCI): http://nciterms.nci.nih.gov/ http://www.nlm.nih.gov/

16 14 100000 maintained by WHO and it is used to classify diseases MeSH MedDRA and heath problems recorded in many types of health and ICD-10 SNOMED CT UMLS vital reports including death certificates and health records. 10000 The ICD-10 provides terms that are hierarchically ordered according to the organ system that is being affected. Un- like other resources, the ICD provides a flat list of terms 1000 and does not include synonyms or term variants. The com- plete ICD-10 was used for generating the dictionary and it 100 contains over 70,000 entries altogether. Systematized Nomenclature of Medicine–Clinical 10 Terms (SNOMED CT) 15 is a comprehensive clinical ter- minology that is maintained and distributed by IHTSDO16 1 (Cornet, 2009). It covers most areas of clinical information Count of entities with a certain number synonyms (logarithmic scale) 20 40 60 80 100 120 140 such as diseases, findings, procedures, microorganisms, Number of synonyms pharmaceuticals etc. The SNOMED CT concepts are organized into hierarchies and the sub-hierarchy that corre- Figure 1: Plot of the synonym count distribution for all the sponds to Disorder was used to generate a dictionary. The analyzed dictionaries SNOMED CT dictionary contains over 90,000 concepts associated with synonyms and term variants. less than 20 synonyms. Few entries in the UMLS, MeSH, Unified Medical Language System (UMLS) is a very and MedDRA17 are associated with as much as more than large, multipurpose, and multilingual meta-thesaurus that 60 synonyms. Resources with high number of synonyms contains information about biomedical and health related are of great value for dictionary-based named entity recog- concepts (Browne et al., 2003). Overall, the UMLS has nition approaches. They help to overcome a high false neg- more than 2 million concepts that are associated with syn- ative rate but may pose a risk of high number of false posi- onyms and relationships between them. The concepts in tives requiring a dedicated curation. the UMLS are categorized into semantic groups. The se- Since UMLS is the largest resource, a survey was con- mantic group Disorders contains semantic subgroups such ducted to check the percentage of synonyms that over- as Acquired Abnormality, Disease or Syndrome, Mental or lap with synonyms in rest of the resources. The syn- Behavioral Dysfunction, Sign or Symptom, etc. Although, onym comparison between the different resources was per- the downloadable subset of the UMLS enclose large subsets formed using a simple case-insensitive string match (i. e. of concepts from sub-thesauri such as the ICD-9, ICD-10, only complete string matches were accepted). About 96 % SNOMED CT, and MeSH, the level of ambiguity it con- of the MeSH and 23 % of the MedDRA synonyms are tains has been well demonstrated (Aronson, 2000; Rind- present in UMLS. Only 4 % of the ICD-10 and 13 % of the flesch and Aronson, 1994). Therefore, we presumed to test SNOMED CT synonyms are covered by UMLS. Hence, the the UMLS separately in addition to its constituent sources. outcome of this survey showed that integrating the smaller All concepts in the Disorders semantic group of the UMLS resources with UMLS would account for an enhanced ter- were used to generate a dictionary. This dictionary contains minology coverage. over 120,000 entries altogether. Although, there is an enormous variation in size of the dic- 3. Dictionary Characteristics tionaries used, their adaptability for finding terms in the text is questionable. A manual survey was performed con- The dictionaries generated for the recognition of diseases cerning the quality of information contained in each of and adverse effects were analyzed with regard to the fol- these dictionaries. The UMLS and SNOMED CT contained lowing properties: over 20,000 terms each that had special characters such as • Total number of entries, ‘@’, ‘#&’, ‘[X]’, etc. enclosed within the terms. Exam- ples of such ambiguous terms found in the UMLS are 5- • Number of synonyms provided, and @FLUOROURACIL TOXICITY and Congestive heart fail- • Availability of mappings to other data sources ure #&124. A large subset of terms were too long and descriptive composed of more than 10 words. Such syn- Table 2 provides a quantitative estimate of the entities onyms are seldom found in the text. An example of such present in the raw dictionaries. The UMLS has the largest descriptive term found in ICD-10 is Nondisplaced fracture collection of disease and adverse effect data followed by the of lateral condyle of right femur, initial encounter for closed SNOMED CT. Figure 1 shows the distribution of synonyms fracture. ICD-10 has nearly 35,000 long descriptive terms for all the analyzed dictionaries. Since the ICD-10 does not provide synonyms and term variants, it is visible only as a 17MedDRA, the Medical Dictionary for Regulatory Activities point in Figure 1. A large part of all the dictionaries contain terminology is the international medical terminology developed under the auspices of the International Conference on Harmoniza- 14World Health Organization (WHO): http://www.who.int/en/ tion of Technical Requirements for Registration of Pharmaceuti- 15http://www.nlm.nih.gov/research/umls/Snomed cals for Human Use (ICH). MedDRA is a registered trademark of 16International Health Terminology Standards Development the International Federation of Pharmaceutical Manufacturers and Organisation (IHTSDO): http://www.ihtsdo.org/ Associations (IFPMA)

17 ID Concept Synonyms D000292 Pelvic Inflammatory Disease Adnexitis, Inflammatory Disease; Pelvic, Inflammatory Pelvic Disease; Pelvic Disease, Inflammatory D002534 Brain Hypoxia Anoxia, Brain; Anoxic Brain Damage; Brain Anoxia; Brain Hypoxia; Cere- bral Hypoxia; Encephalopathy, Hypoxic; Hypoxic Brain Damage; Hypoxic Encephalopathy

Table 1: Examples of synonyms and term variants associated with the concepts in the MeSH database.

MeSH MedDRA ICD-10 SNOMED CT UMLS No. of entries 4,350 20,515 74,830 92,376 112,341 No. of synonyms (incl. concepts) 42,631 69,121 74,830 170,561 295,773 Percentage of synonyms covered by UMLS 96 % 23 % 4 % 13 % 100 % Mappings no yes no yes yes

Table 2: A quantitative analysis of the dictionaries generated for the disease and side effect named entity recognition. Total number of entries, number of synonyms, percentage of synonyms covered by UMLS, and the availability of inter data source mappings for individual dictionaries are reported. For the UMLS coverage, all synonyms of all the entries were compared.

which constitutes nearly 50 % of the entire dictionary. Ac- (more seriously) agranulocytosis are associated with pro- cording to the experience of curators, MeSH and MedDRA cainamide, and a frequent adverse effect requiring cessa- were regarded as the specialized resources with consider- tion of therapy is the development of systemic lupus ery- ably low level of ambiguity. Nevertheless, few vague en- thematosus. (PMID: 2285495), the term systemic lupus tries such as Acting out, Alcohol Consumption, and Child- erythematosus occurs as an adverse effect associated with hood were encountered in these dictionaries. procainamide treatment. In contrary, the sentence IL-17 expression was found to be associated with many inflam- 4. Corpus Characteristics and Annotation matory diseases in humans, such as rheumatoid arthritis, asthma, systemic lupus erythematosus and allograft rejec- For evaluating the performance of named entity recognition tion and many in vitro studies have indicated a proinflam- systems, an annotated corpus is necessary. Since, there is matory function for IL-17. (PMID: 20338742) contains no freely available corpus that contains annotations of dis- systemic lupus erythematosus as a disease associated with ease and adverse effect entities, a corpus containing 400 certain gene function. In such cases, the annotators were randomly selected MEDLINE abstracts was generated us- strictly insisted to use the contextual information for an- ing ‘Disease OR Adverse effect’ as a PubMed query. This notating the entities. Entities that overlap with semantic evaluation corpus was annotated by two individuals who classes disease and adverse effect are difficult to be recog- hold a Master’s degree in life sciences. All the abstracts nized unless a context-based disambiguation is performed. were annotated with two entity classes, i. e., disease and Altogether, there were 178 annotated entities had an over- adverse effect. In order to obtain a good estimate of the lap with the classes disease and adverse effect. level of agreement between the annotators, they were in- sisted to carry out the task independently. First, one anno- 5. Results of Dictionary Performance tator participated in the development of a guideline for an- notation. The corpus was iteratively annotated by this per- For the identification of named entities in text, the ProMiner son along with the standardization of the annotation rules. (Hanisch et al., 2005) system was used along with different Later, the second person annotated the whole corpus based dictionaries. The text searching with ProMiner was per- on the annotation guideline generated by the first annota- formed using the raw or unprocessed dictionaries as well as tor. This procedure formed an evaluation corpus of 400 with the processed dictionaries. The search was performed abstracts containing 1428 disease and 813 adverse effect using case-insensitive, word order-sensitive and the longest annotations. Recognizing the boundaries without consider- string match as constraints. ing the different classes in the evaluation corpus, the inter- The performance of the ProMiner runs with different dic- annotator agreement F1 score and kappa (κ) between the tionaries was evaluated using the Precision and Recall. The two annotators are 84 % and 89 % respectively which indi- evaluations were performed for the complete match as well cates a substantial agreement. as partial match between the annotated entities and the dic- The annotation of disease and adverse effect entities were tionary terms. A partial match is a situation where either the performed very sensitively taking the context into account. left boundary or the right boundary of the annotated entity Several instances occurred where the disease names and ad- and the ProMiner search result are matched. verse effect names were the same. For example, in the sen- The results with raw dictionaries and such a simple search tence Hypersensitivity reactions including fever, rash and strategy gives a rough estimate of the coverage of different

18 MeSH MedDRA ICD-10 SNOMED CT UMLS No. of entries 4,335 18,273 37,263 84,292 100,871 No. of synonyms (incl. concepts) 42,531 57,017 37,263 146,545 243,602

Table 3: A quantitative analysis of the curated dictionaries applied for the disease and side effect named entity recognition. Total number of entries and number of synonyms present within the individual dictionaries are reported.

Raw Curated Disambiguation Dictionary Match type All DIS AE All DIS AE All DIS AE MeSH Complete 0.54/0.43 0.46 0.40 0.61/0.43 0.46 0.40 0.61/0.43 0.46 0.40 Partial 0.73/0.58 0.64 0.51 0.80/0.57 0.62 0.51 0.80/0.57 0.62 0.51

MedDRA Complete 0.48/0.62 0.64 0.59 0.57/0.61 0.63 0.59 0.60/0.61 0.62 0.59 Partial 0.55/0.72 0.76 0.68 0.67/0.72 0.75 0.68 0.69/0.71 0.74 0.68

ICD-10 Complete 0.46/0.10 0.10 0.10 0.57/0.15 0.10 0.19 0.57/0.15 0.10 0.19 Partial 0.59/0.15 0.15 0.14 0.66/0.19 0.14 0.23 0.57/0.19 0.14 0.23

SNOMED CT Complete 0.38/0.18 0.18 0.18 0.40/0.20 0.22 0.18 0.43/0.18 0.20 0.15 Partial 0.66/0.28 0.33 0.23 0.69/0.34 0.39 0.28 0.71/0.34 0.39 0.28

UMLS Complete 0.18/0.58 0.60 0.55 0.33/0.57 0.60 0.54 0.36/0.57 0.60 0.54 Partial 0.25/0.73 0.74 0.71 0.43/0.72 0.73 0.71 0.46/0.72 0.73 0.71

Combined Complete 0.12/0.75 0.80 0.70 0.18/0.76 0.81 0.71 0.19/0.76 0.80 0.71 Partial 0.14/0.92 0.92 0.91 0.21/0.91 0.92 0.89 0.22/0.91 0.92 0.89

Table 4: Comparison of the performance of different dictionaries tested over the evaluation corpus. The results are reported for the complete matches and partial matches of annotated classes disease (DIS), adverse effect (AE) and a combination of both the classes (All). For a combination of both the classes, i. e. All, the precision and recall values are reported. For the classes DIS and AE, only the recall values are reported. ‘Combined’ indicates the performance achieved by combining the results of all the dictionaries. dictionaries and the effort that has to be invested to curate by the annotation guideline. Perhaps, our principle anno- them. Table 4 shows the search results obtained with every tators would annotate such a textual description with Spas- individual dictionary when complete matches and partial tic paraplegia and T-cell lymphotropic virus - 1 infection as matches were considered. The highest recall for complete two distinct entities rather than annotating the entire phrase matches were achieved by the MedDRA dictionary (62 %) as a single entity. and the UMLS dictionary (58 %). The recall of ICD-10 Comparison of the results of complete matches and par- was the lowest of all dictionaries covering only 10 % of the tial matches in Table 4 shows the granularity of informa- entities annotated in the corpus. Unlike the other dictionar- tion covered by different data sources and the textual ex- ies, ICD-10 lacks information about the synonyms and term plications. The UMLS and MedDRA achieved an over- variants which hinders it from covering different types of all recall of 73 % and 72 % respectively for the partial variants mentioned in the text. The combination of results matches whereas the combined results of all the dictio- of all the dictionaries lead to a promising recall of 75 %. naries achieved a highest recall of 92 %. This provides Another important observation is the low recall (18 %) at- an indication that the terms contained in these dictionar- tained by the SNOMED CT dictionary. Although, this dic- ies cover the head nouns associated with the disease and tionary contains over 90,000 entries with 170,561 different adverse effect entities but does not include different enu- terms, its usability for finding entities in the text seems ex- merations used in the literature. For example, in the case of tremely limited. One reason is because of the descriptive progressive neurodegenerative disorder, only neurodegen- nature of most of the terms present in the SNOMED CT erative disorder was identified whereas the adjective pro- vocabulary such as Spastic paraplegia associated with T- gressive was not covered. Based on the experience of the cell lymphotropic virus - 1 infection. Although such long curators and the results from Table 4, nearly 10 % of the descriptive terms provide substantial information about the mismatches are caused by the medical adjectives such as medical condition, they are not quite often used in the lit- chronic, acute, and idiopathic that are frequently used in erature. Additional reasons are the perception of named texts but not provided by the resources. Another source of entities in annotator’s mind as well as the style adopted mismatch is the anatomical information often attached to

19 the disease entity in texts. For example, in the case of vagi- 5.1. Dictionary Curation nal squamous cell carcinoma, only the squamous cell car- The dictionaries were processed and filtered based on a sub- cinoma was recognized whereas the remaining anatomical set of pre-defined rules in order to reduce the level of ambi- substring remained unidentified. guity associated with them. Most of the rules were adapted The highest precision rates for the complete matches were from Hanisch et al. (2005) and Aronson (1999). The rules achieved by the MeSH dictionary (0.54) and the Med- that were applied for processing the dictionaries are listed DRA dictionary (0.48) hence validating the curator’s opin- below. All the rules were used in common to all the ana- ion about the quality of these resources. The lowest preci- lyzed dictionaries. sion of 18 % was achieved by the UMLS dictionary. The Remove very short tokens: Single character alphanumeri- precision after combining the results of different dictio- cals that appear as individual synonyms were removed. For naries was considerably low due to the overlapping false example, ‘5’ was mentioned as a synonym of the concept positives generated by different dictionaries. The low pre- Death Related to Adverse Event in the UMLS. cision is due to the presence of noisy terms such as dis- Remove terms containing special characters: Remove ease or response within the dictionaries. The amount of all the terms that contain unusual special characters such such noisy terms considerably varies among the different as ‘@’, ‘:’ and ‘&#’. An examples of such term resources with UMLS having the highest. Therefore, the in SNOMED CT is Heart anomalies: [bulbus/septum] curation of dictionaries is necessary in order to achieve bet- [patent foramen ovale] . ter performance. Experiences from the previously reported Remove underspecifications: Substrings such as NOS, dictionary-based named entity approaches let us assume NES and not elsewhere classified were removed away from that the precision could be greatly improved by the dictio- the terms. Such strings were often encountered at endings nary curation. of the dictionary terms. An example of such a term from Since the MedDRA dictionary achieved the highest recall, MedDRA is Congenital limb malformation, NOS the true positive matches obtained with this dictionary were Remove very long terms: Very long and descriptive terms mapped to the MedDRA level-2 superclasses in order to that contains more than 10 words were removed. An ex- analyze the distribution of disease and adverse effect termi- ample of such a term found in SNOMED CT is Pancreas nology over the complete MedDRA hierarchy. The analysis multiple or unspecified site injury without mention of open of distribution of annotated entities over the MedDRA sub- wound into cavity. Although such long terms do not appear hierarchies is shown in Table 5 and Table 6. From the Med- in the text, filtering them from the dictionary gradually re- DRA tree distribution of disease or adverse effect matches, duces the run time of the process. it is difficult to understand whether the entity is of kind dis- Remove unusual brackets: Unusual substrings that often ease or an adverse event. Here an additional context will appear within the brackets were removed from the terms. be necessary to classify the matches into their respective Examples of such terms found in SNOMED CT include classes. [X]Papulosquamous disorders and [D]Trismus. Remove noisy terms: The ProMiner with different dictio- MedDRA Superclass No. of annotated entities naries was run over an independent corpus of 100,000 ab- Infections and infestations 110 stracts that were randomly selected from MEDLINE. The Psychiatric disorders 83 500 most frequently occurring terms matched with the in- Neoplasms benign, malig- 83 dividual dictionaries were manually investigated to remove nant and unspecified the most frequently occurring false positives. This process Nervous system disorders 47 will improve the precision of entity recognition during the Blood and lymphatic system 38 subsequent runs. disorders In addition to dictionary curation, the configuration of the ProMiner system was readjusted to match the possessive Table 5: Analysis of the top five most frequently occurring terms (e. g. Alzheimer’s disease) that contain ‘’s’ substring disease entities distributed over different MedDRA level-2 at the word endings. After the end of the dictionary pro- superclasses. cessing and filtering, the number of entries and synonyms that remained in the individual dictionaries can be found in Table 3. The MeSH dictionary sustained minimum changes with only 15 entries being removed whereas ICD-10 un- MedDRA Superclass No. of annotated entities derwent a large noticeable change. The size of the ICD-10 Cardiac disorders 96 dictionary was reduced to nearly half of the previously used Infections and Infestations 93 raw dictionary. The search results obtained with every in- Injury, poisoning and proce- 29 dividual curated dictionary can be found in Table 4. dural complications As the result of dictionary curation, the performance of all Vascular disorders 23 the dictionaries improved remarkably well. For the com- Gastrointestinal disorders 19 plete matches, the precision of UMLS dictionary raised by 15 % with a drop in recall by just 1 %. Other dictionaries Table 6: Analysis of the top five most frequently occurring that benefited well from the curation process are ICD-10 adverse effect entities distributed over different MedDRA and MedDRA with raise in their precision by 11 % and 9 % level-2 superclasses. respectively. SNOMED CT showed only 2 % increase in

20 the precision. The recall of all the dictionaries changed DRA being substantially smaller than UMLS reported com- marginally except for ICD-10. Processing the synonyms paratively low false positive rate. Finally, a combination of of ICD-10 increased its recall on adverse effect entities by all the dictionaries reported the highest recall indicating the 9 % with an overall raise in the recall by 5 % for both the diversity of terms provided by different resources. annotated classes. 6. Conclusions 5.2. Acronym Disambiguation A survey of the performance of different resources for the In spite of processing the dictionaries by removing the identification of diseases and adverse effects in texts was noisy terms as well as lexical modification of the synonyms, performed. An outcome of the survey upheld the MedDRA the acronyms present in the dictionaries turned out to be an- as a compatible resource for the text mining needs hav- ALL other source of frequent false positives. For example, ing its recall competitive to the UMLS meta-thesaurus with Acute Lymphoid Leukemia which is an acronym for gen- considerably fair precision upon processing. The UMLS erated a considerable noise. Therefore, acronyms present being the largest resource does not include all the names in all the dictionaries that have two to four characters were that are covered by the smaller resources. Hence, the com- collected in a separate acronym list. Whenever there is a bination of the search results from all the terminologies lead match between the term in the acronym list and the text to- to a high increase in recall. This indicates a need for intel- kens, a rule was defined in order to accept or neglect the ligent ways to integrate and merge the information spread match. This disambiguation facility is available within the across different resources. The amount of work that needs ProMiner system. The acronym disambiguation rule ac- to be invested to curate very large resources such as the cepts the match based on two criteria and they are: SNOMED CT and UMLS is also shown. • The match should be case sensitive. In addition to the performance comparison, the effect of dictionary curation and a limited manual investigation of • The acronym as well as any one of its synonym in the noisy terms shows to be effective. A rule-based process- the respective dictionary should co-occur anywhere ing coupled with the dictionary curation can substantially within in the same abstract. improve the performance of the named entity recognition. In future, we will investigate more enhanced dictionary cu- For example, the term ALL is associated with 17 synonyms ration methods for improving the performance of dictionar- in the MedDRA dictionary. Any case sensitive match be- ies. Nevertheless, the performance of rule-based and ma- tween the ALL and tokens in the text would be accepted if chine learning-based approaches for identifying the disease any one synonym of the ALL occurs within the same ab- and adverse effect named entities needs to be tested. stract. The search results obtained with the individual cu- rated dictionaries in addition to the acronym disambigua- 7. References tion can be found in Table 4. Considering the complete matches, the acronym disambiguation raised the precision S. R. Ahmad. (2003). Adverse drug event monitoring at of MedDRA, SNOMED CT and UMLS dictionaries by 3 % the Food and Drug Administration. Journal of General each. The performance of MeSH and ICD-10 remain un- Internal Medicine, 18(1), pp. 57–60. affected indicating the presence of less acronyms within A. R. Aronson. (1999). Filtering the UMLS Metathe- them. There was a marginal decline (less than 2 %) in the saurus for MetaMap. Technical report, National recall of the dictionaries after applying the disambiguation Library of Medicine, MD, USA. Available at rule. http://skr.nlm.nih.gov/papers/references/filtering99.pdf. In summary, the experiments demonstrated that the perfor- A. R. Aronson. (2000). Ambiguity in the UMLS mance of a simple search strategy using individual dictio- Metathesaurus. Technical report, National Li- naries for the identification of diseases or adverse effects is brary of Medicine, MD, USA. Available at low. However, the precision of the dictionary look-up can http://skr.nlm.nih.gov/papers/references/ambiguity00.pdf. be improved with the help of curation as well as rule-based A. R. Aronson. (2001). Effective mapping of biomedical filtering (e. g. the one adopted here for disambiguating the text to the UMLS Metathesaurus: the MetaMap program. acronyms). When the performance of different dictionar- Proceedings of the AMIA Symposium, pp. 17–21. ies was compared, the MeSH and the MedDRA showed the A. C. Browne, G. Divita, A. R. Aronson, and A. T. McCray. highest quality with comparably low false positive rate and (2003). UMLS language and vocabulary tools. Proceed- low ambiguity. The UMLS and SNOMED CT having the ings of the AMIA Symposium, p. 798. size five times as greater than MedDRA or MeSH reported E. S. Chen, G. Hripcsak, H. Xu, M. Markatou, and C. Fried- low precision although there was an improvement after the man. (2008). Automated acquisition of disease drug subsequent curation. Depending on the user-specific needs, knowledge from biomedical and clinical documents: an the UMLS and MedDRA cover large parts of the elemen- initial study. Journal of the American Medical Informat- tary disease names but does not include sufficient medical ics Association, 15(1), pp. 87–98. adjectives and anatomical specifications within the terms. H. Chun, Y. Tsuruoka, J. Kim, R. Shiba, N. Nagata, Although, a sufficient effort has been invested to curate the T. Hishiki, and J. Tsujii. (2006). Extraction of gene- SNOMED CT and UMLS, the amount of noise they contain disease relations from Medline using domain dictionar- overweighs their performance. The MedDRA and UMLS ies and machine learning. Pacific Symposium on Bio- dictionaries demonstrated a competitive recall but the Med- computing, pp. 4–15.

21 A. M. Cohen and W. H. Hersh. (2005). A survey of current G. H. Merrill. (2008). The MedDRA paradox. AMIA An- work in biomedical text mining. Briefings in Bioinfor- nual Symposium Proceedings, pp. 470–474. matics, 6(1), pp. 57–71. A. Neveol, W. Kim, J. W. Wilbur, and Z. Lu. (2009). Ex- M. H. Coletti and H. L. Bleich. (2001). Medical subject ploring two biomedical text genres for disease recogni- headings used to search the biomedical literature. Jour- tion. In BioNLP ’09: Proceedings of the Workshop on nal of the American Medical Informatics Association, BioNLP, pp. 144–152. 8(4), pp. 317–323. S. Ray and M. Craven. (2001). Representing sentence R. Cornet. (2009). Definitions and qualifiers in SNOMED structure in hidden markov models for information ex- CT. Methods of Information in Medicine, 48(2), traction. In IJCAI’01: Proceedings of the 17th inter- pp. 178–183. national joint conference on Artificial intelligence, pp. C. A Curino, Y. Jia, B. Lambert, P. M. West, and C. Yu. 1273–1279, San Francisco, CA, USA. Morgan Kauf- (2005). Mining officially unrecognized side effects of mann Publishers Inc. drugs by combining web search and machine learning. T. C. Rindflesch and A. R. Aronson. (1994). Ambigu- In Proceedings of the 14th ACM international confer- ity resolution while mapping free text to the UMLS ence on Information and knowledge management, pp. Metathesaurus. Proceedings of the Annual Symposium 365–372. on Computer Applications in Medical Care, pp. 240– A. J. Forster, J. Andrade, and C. van Walraven. (2005). 244. Validation of a discharge summary term search method I. Segura-Bedmar, P. Martinez, and M. Segura-Bedmar. to detect adverse events. Journal of the American Medi- (2008). Drug name recognition and classification in cal Informatics Association, 12(2), pp. 200–206. biomedical texts. A case study outlining approaches un- D. Hanisch, K. Fundel, H. Mevissen, R. Zimmer, and derpinning automated systems. Drug Discovery Today, J. Fluck. (2005). Prominer: rule-based protein and 13(17-18), pp. 816–823. gene entity recognition. BMC Bioinformatics, 6 Suppl L. Smith, L. K. Tanabe, R. J. Ando, C. J. Kuo, I. F. 1, pp. S14. Chung, C.N. Hsu, Y. S. Lin, R. Klinger, C. M. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C. A. Stru- M. Hauben and A. Bate. (2009). Decision support meth- ble, R. J. Povinelli, A. Vlachos, W. A. Baumgartner, ods for the detection of adverse events in post-marketing L. Hunter, B. Carpenter, R. T. Tsai, H. J. Dai, F. Liu, data. Drug Discovery Today, 14(7-8), pp. 343–357. Y. Chen, C. Sun, S. Katrenko, P. Adriaans, C. Blaschke, K. M. Hettne, R. H. Stierum, M. J. Schuemie, P. J. Hendrik- R. Torres, M. Neves, P. Nakov, A. Divoli, M. Mana- sen, B. J. Schijvenaars, E. M. van Mulligen, J. Kleinjans, Lopez, J. Mata, and W. J. Wilbur. (2008). Overview of and J. A. Kors. (2009). A dictionary to identify small BioCreative II gene mention recognition. Genome Biol- molecules and drugs in free text. Bioinformatics, 25(22), ogy, 9 Suppl 2, pp. S2. pp. 2983–2991. B. H. Stricker and B. M. Psaty. (2004). Detection, verifica- A. Jimeno, E. Jimenez-Ruiz, V. Lee, S. Gaudan, tion, and quantification of adverse drug reactions. British R. Berlanga, and D. Rebholz-Schuhmann. (2008). As- Medical Journals, 329(7456), pp. 44–47. sessment of disease named entity recognition on a corpus X. Wang, G. Hripcsak, M. Markatou, and C. Friedman. of annotated sentences. BMC Bioinformatics, 9 Suppl 3, (2009). Active computerized pharmacovigilance using pp. S3. natural language processing, statistics, and electronic T. Karopka, J. Fluck, H. Mevissen, and A. Glass. (2006). health records: a feasibility study. Journal of the Amer- The Autoimmune Disease Database: a dynamically com- ican Medical Informatics Association, 16(3), pp. 328– piled literature-derived database. BMC Bioinformatics, 337. 7, pp. 325. H. Karsten and H. Suominen. (2009). Mining of clini- cal and biomedical text and data: editorial of the spe- cial issue. International Journal of Medical Informatics, 78(12), pp. 786–787. M. Krallinger, A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valencia. (2008). Eval- uation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biology, 9 Suppl 2, pp. S1. R. Leaman, C. Miller, and G. Gonzalez. (2009). Enabling Recognition of Diseases in Biomedical Text with Ma- chine Learning: Corpus and Benchmark. In Handbook of the 3rd International Symposium on Languages in Bi- ology and Medicine. A. T. McCray, O. Bodenreider, J. D. Malley, and A. C. Browne. (2001). Evaluating umls strings for natural lan- guage processing. AMIA Annual Symposium Proceed- ings, pp. 448–452.

22 A Task-Oriented Extension of the Chinese MeSH Concepts Hierarchy

Xinkai Wang and Sophia Ananiadou National Centre for Text Mining School of Computer Science University of Manchester United Kindom E-mail: [email protected], [email protected]

Abstract The Medical Subject Headings (MeSH) thesaurus is used not only as a vocabulary for indexing and cataloguing biomedical articles, but also constitutes an important linguistic resource for text mining in the biomedical domain. As part of our research, the Chinese translation of English MeSH, Chinese Medical Subject Headings (CMeSH), will be used to provide essential terms and concepts to a Chinese-English c ross-lingual i nformation re trieval s ystem. T he original MeSH uses a small, accurate and c oncise set of terms designed for indexing and cataloguing. Such a set of terms does not, however, lend itself well to information retrieval applications, in which the ability to expand queries using term synonyms is important to maximise relevant search results. In this paper, we propose a new approach to extending the MeSH concept hierarchy (the MeSH Tree) based on an online version of the CMeSH term list. We use Google to collect synonyms for Chinese terms and calculate weights for each Chinese term and its synonyms. Our extension has been evaluated on a Chinese-English biomedical information re trieval a pplication. The r esults i ndicate t hat the extension of CM eSH improves the performance of the information retrieval application. Furthermore, the extended resource should also be helpful in other related research.

small num ber o f studies ha ve so f ar attempted to use 1. Introduction CMeSH to improve the performance of natural language The Medical Subject Heading (MeSH) thesaurus, which processing (NLP) applications, such as information was developed and released by the N ational L ibrary o f retrieval or information extraction systems. Qin and Feng Medicine ( http://www.nlm.nih.gov/mesh/), is widely (1999) apply CMeSH t erms t o i mprove t he indexing accepted as t he s tandard vocabulary u sed f or i ndexing, quality of Chinese abstracts from 1977 concerning family cataloguing, a nd s earching for bi omedical and planning and gynecology, whilst Li et al. (2001) develop health-related information and documents. For example, an information retrieval system with the help of CMeSH Lowe a nd B anett ( 1994) r eport us ing M eSH t o i ndex terms. The main reason why MeSH terms have not been medical l iterature. Co oper a nd M iller (1 998) compare more w idely ad opted in C hinese bi omedical text lexical a nd s tatistical methods u sed t o e xtract a l ist o f processing l ies i n the p hilosophy of MeSH de sign. As suggested M eSH t erms f rom t he n arrative p art of the MeSH t erms are intended t o i ndex a nd c atalogue the electronic patient medical records. More recently, many biomedical literature, they must be represented succinctly, researchers (Guo et al, 2004; Abdou and Savoy, 2007; Lu concisely, a nd accu rately. The C hinese t ranslation of et al, 2008) have employed MeSH t erms t o ev aluate o r MeSH, CMeSH, inherits these features. This means that improve biomedical information retrieval applications. In the o riginal CM eSH terms h ave no synonyms - each addition, M eSH t erms ar e t reated as t he s tandard English term has one and only one Chinese translation. vocabulary t o w hich t erms from o ther r esources a re In order to expand the coverage of terms in CMeSH, and mapped (Elkin et al, 1988; Shultz, 2006). MeSH also to make it a more useful resource for our research, we vocabulary has also been employed in the construction of have extended the original CMeSH with s ynonyms a nd Chinese medical ontologies. Zhou et al. (2007) attempt to term w eights, and have integrated the e xtended set o f discover novel gene networks and functional knowledge terms i nto the M eSH co ncept h ierarchy, i .e. t he MeSH of ge nes using a significant bibliographic l iterature Tree. A n onl ine version o f t he CMeSH database of traditional Chinese medicine. In their research, (http://www2.chkd.cnki.net/kns50/Dict/dict_list.aspx?fir MeSH disease headings are applied to generate the index stLetter=A) is t he s tarting p oint o f our work. We have data for gene and disease MEDLINE literature. designed an algorithm which exploits the Google search As part o f own r esearch, w e have m ade use of a engine to automatically collect synonyms of each original MeSH-related r esource, i .e., Chinese M edical S ubject CMeSH term an d calculate t he frequency of each Headings (CMeSH) to evaluate and im prove the extracted term. Based o n these frequencies, w e h ave performance an d e ffectiveness o f a Chinese-English defined a formula to compute a weight for each Chinese biomedical cr oss-lingual i nformation r etrieval term. T he corresponding English t erm’s weight is n ot application. computed, for the reasons discussed in Section 4. Finally, CMeSH, which has been translated and is maintained by a Ch inese-English c ross-lingual i nformation r etrieval The I nstitute o f Medical I nformation of the Chinese system (CLIR), based on the Lemur toolkit, has b een Academy o f Medical S ciences, r etains the terms and employed to evaluate the extended CMeSH Tree. concepts of the English MeSH and their relations. Only a

23 CMeSH is published by The Institute of Medical Information of the Chinese Academy of Medical Sciences, 2. Background consisting of two different versions, i.e., a paper version and an electronic version. The official CMeSH contains 2.1 MeSH three p arts: a Chinese t ranslation of M eSH, t raditional Chinese m edical s ubject h eadings an d Special MeSH consists of a controlled vocabulary, coupled with a Classification for Medicine of China Library hierarchical tree structure. T he controlled v ocabulary Classification. The usual usage of CMeSH is to index and contains s everal t ypes o f c oncepts, namely Publication catalogue biomedical literature in a library, or to provide types, G eographics, Q ualifiers, De scriptors, and standard keywords to d escribe journal articles and Supplementary Concept Records. conference papers. • Publication types o r Publication c haracteristics An online version of the CMeSH term list is available at: are used to indicate the genre of the indexed item, http://www2.chkd.cnki.net/kns50/Dict/dict_list.aspx?first rather than its contents, e.g., ‘Historical Article’. Letter=A. Th e e xample b elow i llustrates t he Ch inese • Geographics i nclude c ontinents, regions, counterpart o f t he English Me SH example presented countries, a nd ot her geographic s ubdivisions; above. they are not used to characterize subject content

but rather physical location. Dementia 痴呆 • Descriptors are the main concepts or headings. AIDS Dementia Complex 艾滋病痴呆复合征 They a re us ed to ind ex the catalogue, a nd to Alzheimer Disease 阿尔茨海默病 search biomedical documents. Examples include Aphasia, Primary Progressive 失语, 原发进行性 ‘Dementia’ and ‘Carcinoma in Situ’. Creutzfeldt-Jakob Syndrome 克-亚综合征 • Qualifiers, also known as Subheadings, are used Dementia, Vascular 痴呆, 血管性 for indexing and cataloging in conjunction with CADASIL 大脑常染色体显性动脉病合并皮层下 Descriptors. T here are 9 5 Qu alifiers (MeSH 梗塞及脑白质病 2009), w hich provide a co nvenient means

grouping t ogether t hose documents w hich ar e In co ntrast to research ach ievements using t he o riginal concerned with a particular aspect of a s ubject. MeSH, the usage of CMeSH is currently largely limited to For ex ample, ‘ Liver/drug effects’ i ndicates t hat acting as a gold s tandard for i ndexing a nd c ataloging the a rticle or book i s n ot a bout t he ‘liver’ i n biomedical documents or for assigning indexing terms in general, but about the effect of drugs on the liver. IR s ystems. T here is v ery little work that r eports on • Supplementary Co ncept Re cords ( SCRs) a re evaluating c ross-lingual i nformation r etrieval w ith used to index chemicals, drugs, a nd other CMeSH o r o n i mproving information e xtraction via concepts. Unlike Descriptors, SCRs have no tree CMeSH t erms. Analyzing the s ocial o r ec onomical numbers (see below). factors w hich limit t he u sage o f CM eSH is the duty of The MeSH Tree is a hierarchy of MeSH descriptors, in economists; our focus is on the deficiencies of the original which each descriptor is allocated a tree number, which CMeSH, based on our task-oriented requirements. From represents t he p osition of t he node i n t he t ree. T he the above example, we can conclude that: following e xample illustrates the s tructure an d • There are no term weights for CMeSH terms. organization of the MeSH Tree. • Each English term has one and only one Chinese …… translation. Dementia;C10.228.140.380 Term w eights ar e essential t o text mining or N LP AIDS Dementia Complex;C10.228.140.380.070 algorithms based on probabilistic and statistical models. Alzheimer Disease;C10.228.140.380.100 Without term weights, CMeSH can thus function only as a Aphasia, Primary Progressive;C10.228.140.380.132 traditional word lis t. In the cross-lingual i nformation Creutzfeldt-Jakob Syndrome;C10.228.140.380.165 retrieval t ask, our experiments have shown the high Dementia, Vascular;C10.228.140.380.230 degree t o w hich t erm w eights c ontribute t owards the CADASIL;C10.228.140.380.230.124 improvement of retrieval performance (see section 5.2). …… Another i ssue of the original CM eSH is t hat many Chinese t ranslations are missing. L ike ot her l anguages, On each line, the text before the semi-colon constitutes a the Chinese language can express a particular concept in MeSH t erm. In the re mainder o f the p aper, we refer to multiple w ays. F or ex ample, ‘ Alzheimer D isease’ i s these as ‘English terms’. After the semi-colon, the string translated as ‘ 阿尔茨海默病’ i n t he original CM eSH. starting with Latin letter and followed by digits and dots However, it can also be written as ‘Alzheimer 病’, ‘阿滋 represents a t ree n umber, which encodes the term’s 海默症’, ‘ 老年性痴呆’, o r ‘ Alzheimer 氏病’. T he position within the tree. The version of the MeSH Tree original CM eSH thus lacks the a bility t o p rovide used i n t his s tudy i s the MeSH Tree 2 008, w hich h as synonyms for a particular term. Our results have shown 24,763 unique terms and 48,442 tree nodes. that the availability of such synonyms can also increase task performance. 2.2 CMeSH

24 The research described in this paper attempts to overcome the above-mentioned issues o f t he original CM eSH. English Alignment Chinese Firstly, Google is used to collect web pages which may MeSH Tree MeSH Terms contain candidate translations. Following this, linguistic rules and a t erm e xtraction t ool are a pplied t o identify candidate terms from these web pages, and the frequency Google search engine Linguistic filtering of each term is calculated. Finally, each term’s weight is computed according to frequency of the term and that of its English equivalent. Frequencies Frequencies C-value filtering for English for Chinese 3. Related Work on Ontology Evaluation MeSH Terms translations The e xtended C MeSH T ree constitutes an ontology. Chinese Evaluating the effectiveness of this ontology is a critical translation step of o ur r esearch. In g eneral, o ntology e valuation Weight calculating s cannot be c ompared t o e valuation t asks i n i nformation retrieval or classic natural language processing tasks such as part-of-speech (POS) tagging, be cause the n otion of Extended precision a nd r ecall can not easily be de fined. CMeSH Tree Methodologies used to evaluate ontologies generally fall under one of the following approaches: • Testing t he ontology i n a n a pplication and Figure 1: Extension workflow evalutating the result (Porzel and Malaka, 2004); also called application-based evaluation; 4.1 Alignment • Comparing t he o ntology t o a ‘gold s tandard’ Alignment is the operation by which the English MeSH (Maedche and Staab, 2002); Tree terms are matched with the corresponding Chinese • Human evaluation of the ontology according to a MeSH t erms. In our experiments, w e found th at set of predefined c riteria, s tandards, approximately 3% of English terms in the MeSH Tree had requirements, et c. ( Lozano-Tello a nd no translation in t he o nline CMeSH t erm l ist, and t hat Gómez-Pérez, 2004); about 8.1% of Chinese terms had no matching MeSH tree • Comparing the ontology with a set of data (e.g., a terms. In order to resolve this issue, both English terms collection of documents) from the domain to be without Chinese translations, and Chinese CMeSH terms covered by the ontology (Brewster et al., 2004); without E nglish c ounterparts were ignored in t he also called data-driven evaluation. subsequent steps of processing. Following alignment, the Evaluation of ontologies in general is carried out at three MeSH Tree has the following appearance. basic levels: vocabulary, taxonomy, and (non-taxonomic) semantic relations. We are not intending to evaluate the Dementia;C10.228.140.380 isa hierarchy (taxonomy) and the non-taxonomic relations 痴呆 (semantic r elations) of the extended CMeSH Tree, AIDS Dementia Complex;C10.228.140.380.070 because our w ork d oes n ot add new tree no des to the 艾滋病痴呆复合征 MeSH concept hierarchy. Moreover, based on the fact that Alzheimer Disease;C10.228.140.380.100 MeSH Tree, as a part of The Unified Medical Language 阿尔茨海默病 System (UMLS), h as b een as sessed b y h uman experts Aphasia, Primary Progressive;C10.228.140.380.132 against a set of criteria (Kumar and Smith, 2003; Smith, 失语, 原发进行性 2006), our evaluation of the extended CMeSH Tree will Creutzfeldt-Jakob Syndrome;C10.228.140.380.16 serve only to evaluate the enhanced ontology vocabulary. 克-亚综合征 In order to do this, we have tested the extended CMeSH Dementia, Vascular;C10.228.140.380.230 tree within a CLIR application. 痴呆, 血管性 4. Extension of CMeSH CADASIL;C10.228.140.380.230.124 大脑常染色体显性动脉病合并皮层下梗塞及脑白质病 The F igure 1 illustrates the w orkflow o f the process o f extending CMeSH. In t his chart, the dashed lines 4.2 Linguistic Filtering represent the steps of obtaining the frequencies of English Each Chinese term in the aligned MeSH tree was queried terms, while solid l ines correspond to the steps of using Google. This caused a set of relevant documents to extending the original CMeSH, including computing the be r eturned, most of which were written in Chinese. In weights of C hinese t ranslations. The d etails of t he order t o extract can didate t erms f rom t he r eturned algorithm are explained in the subsections that follow. documents, a set of two-level linguistic rules were applied

to filter the document set and to discover candidate terms.

25 The rules take the form of regular expressions, which are rules can be grouped into two levels: the first level rules, based o n the following l inguistic c haracteristics i n using maximum length matching strategy, are employed Chinese biomedical texts. to e xtract t he terms whose p enultimate symbols are i n 1). M ost C hinese t erms ha ve s uffixes. I n t he a bove SUFF or which are followed by the symbols in the SYMB example, ‘ -复合征’ (meaning ‘complex’), ‘ -综合征’ list. The second level rules, based on the result of the first (meaning ‘syndrome’), and ‘-病’ (the general name for all level rules, determine the start points of candidate terms. kinds of disease) are all suffixes. 2). Some Chinese terms contains ‘inner’ keywords, which 4.3 C-value Filtering can help to identify terms. For example, in ‘失语, 原发进 C-value ( Frantzi a nd A naniadou, 2 000) i s a s imple b ut 行性’ a nd ‘ 痴呆, 血管性’ ( Their normal formats are ‘原 effective tool to extract terms, especially cascaded terms, 发进行性失语’ a nd ‘ 血管性痴呆’ respectively.), ‘-性’ i s from free texts. We use C-value to discover such cascaded an i mportant ch aracter t hat can i ndicate, w hen u sed terms and also to filter high-scoring candidate terms. The between two adjacent verbs and nouns, indicates that the C-value algorithm requires syntactic features. However, first word describes the term after it, thus indicating a high in t his s tudy, we d o not a pply a ny P OS t agging to the probability of the presence of a term. results of l inguistic f iltering. T he r easons are: 1) POS 3). Compared to the clear suffixes of Chinese terms, it can taggers trained on Chinese b iomedical corpora are not be more difficult to recognize the start of a term from a currently available, and taggers trained on newswire are stream o f c haracters. F ortunately, terms are o ften likely to introduce errors and thus affect the performance followed by synonyms, which are often indicated using a of the tool. 2) The output of linguistic filtering consists of particular set of phrases. For instance, in the sentence of short phrases, most of which have already been identified ‘阿尔茨海默病(Alzheimer disease, AD), 又称早老性痴 as terms or parts of terms. Therefore, in the current work, 呆,……’, ‘又称’, which means ‘that is’ or ‘i.e.’, can be a each candidate term resulting from the linguistic filtering good i dentifier t o determine t he be ginning of t he t erm. step is assigned the noun phrase POS tag. The maximum Other similar phrases are ‘简称’ (abbreviated as), ‘也叫’ number of t erms s elected f rom t he l ist i s 2 0. After (that is), ‘也称’ (that is), ‘还叫’ (that is), ‘叫做’ (named as implementing the C -value p rocessing with th e a bove or called), etc. parameters, t he members o f t he resulting l ist a re 4). Some symbols (e.g. brackets and parentheses) can play considered as the synonyms of the original CMeSH terms. the role of delimiters which define the boundaries of term. In the sentence of ‘…可以减少人们患早老性痴呆(阿尔 4.4 Term Weight Calculation 茨海默氏症) 的危险,… ’, t he phrase b etween In this study, only Chinese term weights are calculated. parentheses i s a t erm, w hose meaning i s ‘Alzheimer This is because the purpose of the extended CMeSH tree Disease’. Such symbols, may, however, cause ambiguity. is to provide an enhanced set of Chinese terms to improve For example, the chemical term ‘1-(4-氟苯基)-1,3-二氢 a Ch inese-English CLIR application. Q ueries are -5-异苯并呋喃腈’ (citalopram) contains brackets a nd translated or/and expanded using CMeSH terms, and comma. W ithout special rules, t he extracted candidates Chinese term weights are directly passed to the translated should be ‘4-氟苯基’ and ‘1-(4-氟苯基)-1’, w hich are English q uery t erms. The o riginal we ights of E nglish clearly incorrect and not terms. Th us, it is n ecessary to terms, if assigned, were not used in this study. apply constraints to rules that exploit these symbols. The weight formula used is a variant of the one proposed 5). Many Chinese terms start with an English word. For by Lynam et al. (2001). example, ‘ 阿尔茨海默病’ can a lso be w ritten a s

‘Alzheimer 症’ or ‘AD 症’.

According t o t hese l inguistic features, w e f irstly define w'+ 1.0, if ffct >> et 0 = four word lists: wct  w', otherwise • SUFF This list contains 347 suffixes, such as ‘-

复合征’, ‘-病’, ‘-腈’, ‘-烃’, and etc.

• SYMB This is a list of symbols which may ++ log10 (()()ffct 0.5et 0.5 ) function as delimiters, e.g. ‘(’, ‘)’, ‘[’, ‘]’, ‘,’, ‘,’, w' = Exp −− Exp   ‘。’, and etc. 2  • PREF This list defines the phrases which are

considered as prefix indicators, like ‘又称’, ‘别 where f is the frequency of Chinese translation and f 名 ’, ‘ 还叫’, et c. Note t hat these p hrases ct et is t he f requency of E nglish t erm. B oth frequencies themselves cannot be one part of a term. correspond t o t he number of oc currences returned by • INPT This i s a l ist of the s pecial C hinese Google. T he weight o f the Chinese t ranslation c an be characters or wo rds w hose appearance in a computed by the sigmoid function w’. If the frequency of phrase i ndicates a high pr obability that the the Chinese translation is greater than that of English term phrase is a term. For example, ‘-性-’, ‘-化-’, ‘- (which m eans t hat C hinese translation i s more p opular 式-’, ‘-特发-’, and etc. are included in the list. than the English equivalents), then we increase its weight. Twenty-three regular e xpression r ules have be en constructed based on these four sets of characters. These

26 4.5 The Final CMeSH Resource terms with their tree number. After m erging t he we ight values with Ch inese and English t erms, t he final CMeSH Tree has t he follwing 5. Evaluation representation. A CLIR system has been employed to evaluate the scope of v ocabulary in the extended CMeSH Tree. The Dementia;C10.228.140.380 difference b etween cr oss-lingual i nformation r etrieval 痴呆:0.343881162222 and m ono-lingual i nformation retrieval i s t hat CL IR 痴呆症:1.425371472485 requires a stage in which either the queries are translated 失智:0.314097771253 from t he s ource l anguage i nto t he t arget l anguage (in AIDS Dementia Complex;C10.228.140.380.070 which the document set is written), or else the document 艾滋病痴呆综合征:0.099850335887 set is translated into the language in which the queries are 痴呆综合征 AIDS :0.050615806911 expressed. In this study, we translate Chinese queries into AIDS 痴呆症候群:0.018721486519 English a nd carry ou t information r etrieval o n English 艾滋病痴呆复合征:0.080638707889 艾滋病痴呆复合症:0.00000853004 documents. The CMeSH Tree is used to translate or/and 爱滋病痴呆复合症:0.00000461272 expand the Chinese query terms. Alzheimer Disease;C10.228.140.380.100 阿尔茨海默病:0.097398853027 5.1 Experimental description Alzheimer 病:0.04592354867 The toolkit us ed f or constructing the CLIR system is 阿滋海默症:0.18626155397 Lemur (http://www.lemurproject.org/). The document 老年性痴呆:0.244575613782 collections are the TREC Genomics data from 2006 and 氏病 Alzheimer :0.073412383701 2007 (Hersh et al., 2006, 2007), which contain a total of 早老性痴呆 :0.074511778 162,259 biomedical papers (11.9 GB). The indexing and Alzheimer 氏症:0.041794385 retrieving algorithm is Okapi BM25; the parameters used Aphasia, Primary Progressive;C10.228.140.380.132 for Okapi B M25 a re t he system de fault va lues. All 失语, 原发性进行性:0.000000211796 原发性进行性失语:0.317609856764 documents a re i ndexed f or d ocument l evel r etrieval. 原发性进行性失语症:0.208916619538 Indexing of the TREC G enomics documents does no t Creutzfeldt-Jakob Syndrome;C10.228.140.380.165 involve stemming. S topwords a re r emoved u sing the 克-亚综合征:0.014768214 PubMed stop list Creutzfeldt-Jakob 病:0.324075159688 (http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=he Creutzfeldt-Jakob 综合征:0.001337330099 lppubmed&part=pubmedhelp&rendertype=table&id=pu 早老痴呆症:0.264388092697 bmedhelp.T43). 克-雅氏综合征:0.005840557652 TREC G enomics 2 006 a nd 200 7 ta sks pr ovide 64 克-雅氏病:0.119031301389 sentences as q ueries and a g old s tandard r elevance 克雅氏病:0.119031301389 judgement, i.e., each query has been associated with a set 库贾氏病 :0.203074988059 of relevant d ocuments i n T REC G enomics d ata s et by 牛海绵状脑病 :1.398253302721 human experts, except Query 173 and Query 180. There 疯牛病:1.431304403242 are al so scripts a nd utilities t o c alculate r etrieval 皮质-纹体-脊髓变性:0.006199180668 克鲁兹弗得-雅柯病:0.002491919204 performance. In order to obtain the Chinese queries, we 库雅氏症:0.029140724567 make use of the same strategy reported in Levow et al. Dementia, Vascular;C10.228.140.380.230 (2004), in wh ich the original q ueries are m anually 痴呆, 血管性:0.226335681686 translated into ot her l anguages. Here, w e manually 血管性痴呆:0.36594277919 translate T REC G enomics queries f rom E nglish i nto 血管梗塞型痴呆症:0.000230068283 Chinese. For instance, Query 161 in the 2006 task is: 血管型失智症:0.081200049502 ‘What is the role of IDE in Alzheimer’s disease’. CADASIL;C10.228.140.380.230.124 The corresponding Chinese translation is as follows: 大脑常染色体显性动脉病合并皮层下梗塞及脑白质 ‘在阿尔茨海默病中 IDE 的作用是什么’. 病:0.00000043303 The Chinese query sentences need to be segmented before 常染色体显性遗传病合并皮质下梗死和白质脑 translating into them into English queries. As our work is 病 :0.00000043303 intended t o e valuate t he ef fectiveness o f C MeSH Tree 病 CADASIL :0.02736615094 terms rather t han the pe rformance of the information 遗传性多发梗死痴呆病:0.003007920526 伴皮层下梗死和白质脑病的常染色体显性遗传性脑动脉 retrieval, and also that there is no high-performance POS 病:0.014594966461 tagger for the Chinese biomedical domain, we manually 伴皮质下梗死和白质脑病的常染色体显性遗传性脑动脉 segment the C hinese queries i nto word s equence. This 病:0.01608595828 process ensures that errors are not introduced by 显性遗传脑动脉病伴皮层下梗死及脑白质病:0.00005469028 automatic word segmentation. In the above example, the segmentation result is as follows: The extended CMeSH enriches the original resource with ‘在 阿尔茨海默病 中 IDE 的 作用 是 什么’. both synonyms and t erm weights. M oreover, C hinese We t hen r emove w ords f rom t he query whose terms and their synonyms are mapped to English MeSH grammatical categories do no t correspond to on e of the

27 following: nouns or no un phrases, v erbs (except l ink inherit t he OOV p rocessing policy used i n t he ‘ term_t’ verbs and auxiliary verbs), and adjectives. Words without experiment. Chinese characters, like ‘I DE’, are al so r etained in t he query. By car rying out a preliminary ex periment, ( i.e. 4. CMeSH Tree terms translation with hierarchy performing mono-lingual information retrieval on TREC expansion genomics data to compare the performances of two word In t his experiment, w e ex panded Chinese q ueries selection policies: a ll w ords a nd selected words), w e according to the hierarchical structure of the CMeSH Tree. found th at th e a bove c ategories of words play a more Spasić and Ananiadou (2005) refer t o a n a lgorithm t o important r ole o n retrieving r elevant documents t han compute the tree similarity (TS). prepositions, a dverbs, p ostpositions, interrogatives, etc. 2⋅common (, C C ) Following t his filtering s trategy, the a bove-mentioned ts(, C C )= 12 12 example now contains the following words: depth() C12+ depth () C

‘阿尔茨海默病 IDE 作用’. where C and C are the classes related to Term 1 and All the following experiments were carried out on queries 1 2 Term 2 r espectively, common(, C12 C )denotes the which were segmented and filtered according to the above number of common classes in the paths leading from the description. root to the given classes, and depth() C is the number of The segmented and filtered Chinese queries are translated classes in the path connecting the root and the given class. into English queries using either a domain dictionary or common(, C C ) is subject to the following conditions CMeSH Tree, depending on the experiment being 12 in this study: Given t hat C1 and C2 denote cl asses o f undertaken ( see ne xt s ection). English q ueries a re Term 1 and Term 2 respectively, if the ‘common’ function represented as indri queries using ‘Indri query language’ value is the depth of C2 , then Term 2 is the parent node of (Strohman, 2 005), which i s ba sed o n ‘Inquery q uery Term 1 ; the second condition indicates that Term 2 is the language’. Finally, Lemur’s Indri search engine retrieves sibling of Term 1. relevant d ocuments a nd c omputes the p erformance parameters, such as mean average precision (MAP) and depth() C  2 average precision (AP). common(, C C )= 12  depth()1, C−= where depth() C depth () C  1 12 5.2 Experiments CMeSH Tree t erms were applied t o t ranslate Chinese Using this constraint, the TS algorithm expands a Chinese terms i nto E nglish e quivalents. The q uality o f CM eSH query term only with its siblings and parent in the CMeSH Tree is thus reflected in the performance of CLIR. To fully Tree. evaluate t he e xtended C MeSH t ree, w e d esigned four After e xpanding t he original C hinese query t erms, the experiments. The baseline e xperiment makes use of a CMeSH term list is used to translate the expanded queries domain d ictionary t o t ranslate q ueries. The o ther t hree into E nglish. For t his e xperiment, t erm weights are experiments are aimed to evaluating a) the CMeSH Tree ignored, because it is intended to evaluate the ontology terms themselves, b) term w eights within t he CMeSH hierarchy. We name this experiment ‘term_h’. Tree, and c) the CMeSH Tree hierarchy. 5.2 Results and Discussion 1. Baseline Table 1 illustrates the Mean Average Precision (MAP) of For the baseline experiment, we employed a free domain each experiment. Th is is t he mean v alue o f all queries’ 谷歌金山词霸 dictionary, ‘ 2.0’ (Google and K ingsoft average precision. Dictionary 2.0) (http://g.iciba.com/), to translate Chinese query t erms int o E nglish counterparts. T he po licy f or baseline term_t term_w term_h out-of-vocabulary (OOV) is to ignore all unknown words. 2006 0.2622 0.2857 0.3014 0.2706 2007 0.1735 0.1813 0.1899 0.1712 2. CMeSH Tree term translation For t his e xperiment, t he e xtended C MeSH T ree t erms Table 1: MAPs for the four experiments were used to translate Chinese queries. Terms or words which were not present in the CMeSH Tree were ignored According to t he results of e xperiment t erm_t, w e ca n during t ranslation. I n th e following discussion, t his conclude that the extended CMeSH has a positive effect experiment is referred to as ‘term_t’. on IR results. Compared with baseline, the M APs have been increased by 2.53% (for the 2006 task) and 0.78% 3. CMeSH Tree term translation with weights (for the 2007 ta sk) with t he help of e xtended C MeSH This e xperiment, subsequently referred t o as ‘term_w’, terms. The e xperiment t erm_w pr oves that t he t erm had the aim o f evaluating the effectiveness o f o ur t erm weighting a lgorithm can i mprove t he p erformance of weighting a lgorithm. W henever a Chinese t erm was Chinese-English CLIR greatly – from baseline’s results of found in the CMeSH Tree, the weight of that Chinese term 26.22% (2006 task) and 17.35% (2007 task) to 30.14% was passed t o the E nglish translation. The Indri que ry (2006 t ask) a nd 1 8.99% ( 2007 t ask) respectively. T he consists of these translations and their weights. Here, we reasons f or t hese i mprovements are 1 ). O ur CMeSH

28 extension p rovides more C hinese terms than t he Google s earch e ngine to collect Chinese synonyms and dictionary; 2). Our term weighting algorithm succeeds in calculate w eights for t hem. We have e valuated o ur assigning terms with reasonable weight value. extension t o t he t ree using a Chinese-English The results of the term_h experiment are not as expected. cross-lingual i nformation r etrieval system i n the For the 2006 task, the MAP is only slightly better than that biomedical d omain. We have an alysed the s cope a nd of baseline experiment, whilst for the 2007 task, it is the effectiveness o f the e xtended CMeSH T ree t erms and worst result of all four experiments. The reason for this their w eights. T he r esults of our experiments i llustrate bad p erformance is t hat our s imple que ry e xpansion that the extended CMeSH Tree significantly improves the technique introduces too many terms into queries, which performance o f t he C LIR application. The enhanced reduces the precision of the search engine. performance o f t he C LIR s erves t o d emonstrate the Figure 1 shows the Average Precision (AP) for each query quality of the extended CMeSH Tree. It is intended that in all four experiments. this new linguistic resource will also help others in future research. 0.9 Future work will include the following: 1. Ensuring t hat all t erms i n t he o riginal E nglish 0.8 MeSH t ree have c orresponding C hinese translations (a small number are still missing); 0.7 2. Computing the w eights of E nglish as w ell as

0.6 Chinese terms; 3. Finding E nglish synonyms of E nglish Me SH 0.5 terms; 4. Evaluating t he pe rformance o f t he e xtended 0.4 CMeSH Tree in other NLP applications.

0.3 7. Acknowledgements

We sincerely thank Mr. Paul Thompson of the National 0.2 Centre f or Text Mining for his h elp with pr oof-reading and comments regarding this paper. We also thank the two 0.1 anonymous reviewers for their helpful comments.

0 QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ 8. References 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 Abdou, Samir, and Savoy, Jacques, (2007). Searching in 0 1 2 3 4 5 6 7 8 9 0 1 2 4 5 6 7 8 9 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 MEDLINE: Qu ery e xpansion a nd m anual ind exing

evaluation. Information Processing and Management. baseline term_t term_w term_h Volume 44, Issue 2, 2008, pp. 781--789. Brewsster, C., Alani, H., Dasmahapatra, S., and Wilks, Y., (2004). D ata driven on tology evaluation. I n Figure 1: AP of each query in four experiments Proceedings of International Conference on Langauge Figure 1 provides another p erspective on th e CMeSH Resources and Evaluation, Lisbon, pp. 24--30. Tree extension. It illustrates how well the CMeSH Tree Cooper, Gregory F., and Miller, Randolph F., (1998). An extension pe rforms o n e ach q uery. T hese results are Experiment comparing Lexical and Statistical Methods consistent with those presented in Table 1 . For example, for E xtracting M eSH T erms f rom C linic Free T ext. in most cases, the curve of experiment term_w is higher Journal of the American Medical Informatics than other curves, which indicates that CMeSH Tree term Association. Volume 5, Issue 1, 1998, pp. 62--75. translation with t erm w eight i s t he b est result a mong Elkin, Peter L.; Ci mino, J ames J .; L owe, Henry J .; experiments. This c hart a lso gives t he de tails o f Aronow, David B.; Payne, Tom H.; Pincetl, Pierre S.; performances o f eac h query. F or i nstance, Q uery 224 and Barnett, G. Octo, (1988). Mapping to MeSH: The obtains the best AP in baseline experiment; CMeSH terms Art of T rapping M eSH E quivalence from wi thin highly decrease its AP by 55.67% (for term_t), 35.87% Narrative Text. In Proceedings of the Twelfth Annual (for t erm_w), and 46.02% (for t erm_h), compared wi th Symposium on Computer Applications in Medical Care, the r esult of baseline. Meanwhile, Qu ery 2 20 is greatly IEEE Comput Soc Press (1988), pp. 185--190. improved by CMeSH term translation with term weight, Frantzi, Katerina, Ananiadou, Sophia, and Mima, Hideki, but the term_h strategy reduces its performance. (2000). Automatic recognition of multi-word terms: the C-value/NC-value m ethod. International Journal of Digital Library, 3(2), pp. 117--132. 6. Conclusions Guo, Y ., Harkema, H ., a nd Gaizauskas, R ., (2 004). In this paper, we have proposed a task-oriented extension Sheffield University and the T REC 20 04 g enomics of the Chinese M eSH T ree. We have employed t he track: Q uery e xpansion using s ynonymous t erms. I n Proceedings of the Thirteenth Text REtrieval

29 Conference. Gaithersburg, M D: D epartment o f selected s ystems. Journal of the Medical Library Commerce, N ational Institute of Standards a nd Association, Volume 94, Issue 4, pp. 410--414. Technology, pp. 16--19. Spasić, I. and Ananiadou, S., (2005). A Flexible measure Hersh, W., Cohen, A. M., Roberts, P., and Rekapalli, H. K. of contextual similarity f or biomedical terms. Pacific (2006), TREC 2006 genomics track, In Proceedings of Symposium on Biocomputing 10, pp. 197--208. the Fifteenth Text REtrieval Conference, Gaithersburg, Strohman, T., Metzler, D., Turtle, H., and Croft, W. B., MD: Department o f C ommerce, N ational I nstitute o f (2005). Indri: A la nguage-model based search engine Standards and Technology. for c omplex q ueries. In proceedings of International Hersh, W., Cohen, A. M., Roberts, P., and Rekapalli, H. K. Conference on Intelligence Analysis, Ma y 2-6, (2007), T REC 200 7 genomics tr ack overview, I n extended paper. Proceedings of the Sixteenth Text REtrieval Conference, Zhou, Xuezhong, Liu, Baoyan, Wu, Zhaohui, and Feng, Gaithersburg, MD: Department of Commerce, National Yi, (2007). I ntegrative m ining of t raditional C hinese Institute of Standards and Technology. medicine literature and MEDLINE for functional gene Kumar, A., and Smith, B., (2003). The Unified Medical networks. Artificial Intelligence in Medicine, 41(2), pp. Language System a nd t he Ge ne Ontology: S ome 87--104. Critical Reflections. In KI2003: Advances in AI (2003), pp. 135--148. Levow, Gina-Anne, Oard, Douglas W., and Resnik, Philip, (2004). Dictionary-Base T echniques f or Cross-Language I nformation R etrieval, Information Processing & Management 41(3), pp. 523--547. Li, Danya, Hu, Tiejun, Zhu, Wenyan, Qian, Qing, Ren, Huiling, Li, Junlian, and Yang, Bin, (2001). Retrieval System for the Chinese Medical Subject Headings (in Chinese). Chinese Journal of Medical Library, Issue 4, 2001. Lozano-Tello, A ., a nd Gómez-Pérez, A ., (2 004). Ontometric: A m ethod t o c hoose t he appropriate ontology. Journal of Database Management, 15(2), pp. 1--18. Lowe, H. J., and Barnett, G., O., (1994). Understanding and using the medical subject headings ( MeSH) vocabulary t o perform l iterature s earches. Journal of the American Medical Association, 271 (14): p p. 1103--1108. Lu, Zhiyong, K im, Won, and Wilbur, W. J ohn, (2008). Evaluation of query expansion using MeSH in PubMed, Information Retrieval, Volume 12, Issue 1, 2009, pp. 69--80. Lynam, T . R ., C larke, C . L . A ., a nd C ormack, G. V., (2001). Information Extraction with Term Frequencies. In Proceedings of the First International Conference on Human Language Technology Research, pp. 1--4. Maeche, A., and Staab, S., (2002). Measuring similarity between ont ologies. I n Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management. Ontologies and the Semantinc Web, pp. 251--263. Porzel, R., and Malaka, R., (2004). A task-based approach for ontology evaluation. In Proceedings of ECAI 2004 Workshop on Ontology Learning and Population. Qin, Yayun, and Feng, Qichang, (1999). 中文医学主题 词表(机读版)在文献标引中的应用(in C hinese), Journal of Medical Intelligence, Issue 5, 1999. Smith, B. , ( 2006). From c oncepts t o c linical r eality: a n essay on t he be nchmarking of biomedical terminologies, Journal of Biomedical Informatics, 39(3), pp. 288--298. Shultz, M. , (2006). Ma pping o f m edical acr onyms an d initialisms to medical s ubject h eadings ( mesh) a cross

30 Structuring of Status Descriptions in Hospital Patient Records

Svetla Boytcheva1, Ivelina Nikolova2, Elena Paskaleva2, Galia Angelova2, Dimitar Tcharaktchiev3 and Nadya Dimitrova4

1State University of Library Studies and Information Technologies, Sofia, Bulgaria 2Institute for Parallel Processing, Bulgarian Academy of Sciences, Sofia, Bulgaria 3University Specialized Hospital for Active Treatment of Endocrinology, Medical University, Sofia, Bulgaria 4National Oncological Hospital, Sofia, Bulgaria E-mail: [email protected], {iva, hellen, galia}@lml.bas.bg, [email protected], [email protected]

Abstract In this article we present a text analysis system designed to extract key information from clinical text in Bulgarian language. The extracted and structured features will be further employed for classification of patient cases, effective information retrieval and further processing in different medical tasks. Using shallow analysis within an Information Extraction (IE) approach, the system builds structured descriptions of patient status and complications. We discuss some particularities of the medical language of Bulgarian patient records, functionality of our current prototype, and evaluation results regarding the IE tasks we tackle at present.

describes the prototype and its functionality; section 5 1. Introduction deals with the evaluation and section 6 contains the Patient Records (PRs) contain much textual information. conclusion. When studying hospital PRs in Bulgaria, which discuss a single hospital episode, one discovers that the most 2. Related Work important findings, opinions, summaries and When designing our system, its rules for shallow analysis recommendations are stated as free text while clinical data and the training corpus, we have studied carefully the usually supports the textual statements or provides CLEF site (CLEF 2008). Other systems which process clarification of particular facts. Thus the essence of patient symptoms and diagnosis treatment data are: the patient-related information is communicated as Medical Language Extraction and Encoding System unstructured text message of one medical expert with which was designed for radiology reports and later addressee another medical expert. In addition, the clinical extended to other domains such as discharge summaries documents present only partial information about the (Friedman, 1997); (caTIES, 2006) which processes patients, so some kind of aggregation is needed to provide surgical pathology reports; the open-source NLP system a complex view to the patient health status. Automatic Health Information Text Extraction (HITEx, 2006), and analysis of biomedical text is a complex task which the Clinical Text Analysis and Knowledge extraction requires various linguistic and conceptual resources system cTAKES (Savova et al., 2008). Interesting and (Spasic et al., 2005). Despite the difficulties and useful ideas about processing of medical terminology and challenges, however, there are industrial systems and derivation of terminological variants are given in research prototypes in many natural languages, which aim (Valderrábanos et al., 2002). Negative statements in at the information extraction of features from Bulgarian patient-related texts are studied in (Boytcheva patient-related texts. So the application of language et al., 2005). technologies to medical PRs is viewed as an advanced but standard task which is a must in health informatics. In this paper we present an IE prototype which extracts 3. IE in the EVTIMA System important facts from hospital PRs of patients diagnosed The system presented here deals with anonymised PRs with different types of diabetes. The extracted and supported by the Hospital Information System (HIS) of structured features will be employed for classification of the University Specialised Hospital for Active Treatment patient cases, effective information retrieval and further of Endocrinology “Acad. I. Penchev” at Medical processing in different medical tasks. The system, called University, Sofia. The current HIS facilitates PR EVTIMA, is under development in a running project for structuring since the diagnosis, encoded in ICD-10 (the medical text processing in Bulgarian language. It should International Classification of Diseases v. 10), is selected be classified as an ontology-driven IE system, following via menu. The drugs prescribed to the patient, which treat the classification in (Spasic at al., 2005). The article is the illness causing the particular hospital stay, are also structured as follows: section 2 briefly overviews related supported via the so-called Computerised Provider Order approaches; section 3 presents the specific settings of the Entry. In this way some information is structured and easy IE and the medical texts characteristics, section 4 to find.

31 the text with a variety of wordforms which is typical for the highly-inflectional Bulgarian language. The major part of the text consists of short declarative sentences and sentence phrases without agreement, often without proper punctuation marks. There are various kinds of typos in the original text which cannot be properly dealt with within our research project. In the evaluation we consider only normalised texts with standard abbreviations and without spelling errors, because we aim at a research study. Our present experimental corpus is formed of 1150 PRs with some 6400 words, about 2000 of them being medical terms. Another PR text particularity is that descriptions are often missing. It is common for the Bulgarian clinicians to not describe in the PR the organs where no pathological deviations have been found. As reported in (Boytcheva et al., 2009), only 86% of the PRs in our corpus discuss explicitly the patient status regarding skin colour, 63% - fat tissue, about 42% - skin turgor and elasticity, and 13% - skin hydration. In order to fill in the maximum number of slots in our IE template, default values have to be assigned. The presence of these values will be important in a following case classification. The PRs contain also a lot of numerical values of analyses and clinical test data (about 16%) which are subject of additional studies.

3.2 Available Resources A lexicon of 30 000 Bulgarian lexemes, which is part of a large general-purpose lexical database with 70 000 lexemes, serves as basic dictionary for morphological analysis of Bulgarian medical text. The International Classification of Diseases (ICD-10) contains 10 970 Figure 1: Major IE steps for PR processing terms. The Bulgarian version of ICD-10 has no clinical However, in the PR discussion of case history, the extension, i.e. some medical terms had to be extracted previous diseases and their treatments are described as from the PR corpus. So far we have extended the basic unstructured text only. lexicon by medical terminology containing 5 288 terms. In addition, in the hospital archive we find PRs as Constructing the dictionaries is a continuous process; the separated text files, and these PRs consist of free text. Latin terms have to be taken into consideration as well. Therefore our prototype needs to recognise the ICD terms, Within our project, biomedical NLP for Bulgarian texts drug names, patient age and sex, family risk factors and so starts from scratch due to the lack of language resources on. To do so we perform the steps shown on Figure 1. as well as ontological resources and tools. For instance, no Named Entity Recognition module has been 3.1 Medical Language in Hospital PRs implemented for Bulgarian entities in the medical domain; the syntactic rules for partial analysis are constructed for The length of PR texts in Bulgarian hospitals is usually the first time; there are no conceptual resources labelled 2-3 pages. The document is organised in the following by Bulgarian medical vocabulary – except ICD-9 and sections: (i) personal details; (ii) diagnoses of the leading ICD-10. It is curious to note that the list of drugs and and accompanying diseases; (iii) anamnesis (personal medications is supported with Latin names by the medical history), including current complains, past Bulgarian Drug Agency, even for drugs produced in diseases, family medical history, allergies, risk factors; (iv) Bulgaria (BDA 2010), but in the PR texts the medications patient status, including results from physical are predominantly referred to in Bulgarian language, so examination; (v) laboratory and other tests findings; (vi) the drug-related vocabulary is also compiled on the fly in progress notes; (vii) discussion; (viii) treatment; (ix) two languages. recommendations. Despite the clear PR fragmentation, there are various problems for automatic text processing. 4. Present Functionality Bulgarian medical texts contain a specific mixture of terminology in Latin (about 1%) and Cyrillic letters, Latin At present our system performs the following tasks: text terms transcribed with Cyrillic letters. The terms occur in segmentation and normalisation; and text analysis. Here we discuss the latter one.

32 In order to extract the status of some patient organ the IE the limbs, skin and thyroid gland, because many of the system has to find in the PR text a term referring to some diabetes complications are especially related to anatomic organ, which is important for diabetes, e.g. aggravations in these organs. thyroid gland, skin, limbs/legs, eyes, neck etc. For this The status section contains mostly short declarative purpose the system first automatically segments the text sentences in present tense which describe several organs; into several semantic zones as listed above (See section in the current corpus we find most often 20-30 different 3.3). The following text analysis is applied separately on anatomic organs and status conditions. So it is important the epicrisises’ parts in order to extract structured to identify the description boundaries for each particular knowledge from the free text. Up to now the anamnesis organ. The scope is determined by using domain and the patient status areas of the epicrisis are being knowledge concerning anatomic organs and their parts, processed. From the anamnesis we extract features like which shows where an organ description ends and another age, sex, sickness name, type and duration and from the one begins. Domain knowledge helps significantly to patient status - the organs’ condition. Of major interest are design text analysis rules.

Figure 2. Fragment of conceptual model of anatomic organs, their parts, and some diseases.

Figure 2 shows an excerpt of semantic network which sentence and runs the IE process in order to extract limbs’ reflects important relations between organs: isa, part-of status. The second sentence contains adjectives, which are and has-location. Knowledge about limbs and their parts attributes related to limbs - "претибиални" (leg) and helps to scope the description of limb status. Let us "перималеоларни" (ankle). The third sentence contains consider the following example: the terms "онихомикоза" (onychomycosis) - fungal infection of the nails and "tinea pedis", denoting fungal Крайници – отслабени пулсации на a. dorsalis pedis foot infection. Both entities denote diseases of lower legs. двустранно. Претибиални и перималеоларни отоци. Mapping these terms to the concepts and relations at Fig. Oнихомикоза, tinea pedis. Сукусио реналис – (-) отр. 2, the IE system considers sentences 1-3 as a continuous двустранно. status description of limbs. The fourth sentence contains "succusio renalis", which refers to kidney percussion, and Limbs – reduced dorsal pedal pulse on both feet. Pretibial this is a signal that the limbs description is completed. and perimaleolar edema. Onychomycosis, tinea pedis. Moreover, "succusio renalis" occurs in the next sentence, Succusio renalis – bilateral negative (-). as we know empirically that new organ descriptions start in another sentence despite the fact that all statements are Here the IE system finds the term limbs in the first mixed into one paragraph. Another example is:

33 Крайници - без отоци, варикозни промени. Запазени of the legs skin with pretibial oedema". Then the IE пулсации на периферните артерии, запазени system adds additional characteristics to the template at повърхностна, термо и вибрационна чувствителност. steps four-six of the algorithm shown in Figure 1, with Затруднена и болезнена походка, използва помощни final refinement at step eight when the default values are средства." filled in.

Limbs – without oedema, varicose changes. Palpable peripheral arteries pulse, preserved tactile, thermo and vibratory sensation. Walks with difficulty, algetic gait, uses assistive devices".

The status of the peripheral arteries is important for patients with diabetes (which we know due to domain communication knowledge provided by the medical Figure 3: General Template for Limbs experts), so the second sentence is considered as continuation of limbs discussion. Here the occurrence of the word "gait" in the third sentence signals the completion of the limbs description, which in this case is split into the first and second sentence. Another aspect of domain communication knowledge is the default that under limbs the medical experts usually understand lower limbs; parts of upper limbs should be mentioned explicitly to denote arms, forearms etc. About 96% of all PRs in our training corpus contain organ descriptions in this format (Boytcheva et. al., 2009). In Figure 4: Template with specific fields about lower limbs this way the shallow analysis by cascades of regular expressions, which is proposed in (Boytcheva et. al., 2009), proves to be successful approach for structuring the statements concerning patient status. For each particular organ, there is a predefined template, where text units are stored as conceptual entities during domain interpretation phase. In our corpus there are four manners to present the organs and their features: 1. General – by giving some default value, e.g. "без патологични промени, без особености" (without pathological changes, without specifics), "със запазена/нормална характеристика" (with preserved/present/normal characteristics) etc. Figure 3 presents the obligatory IE template fields for limbs. If the PR contains only general statements, the template can store the obligatory fields for the four limbs; Figure 5: Template with fields about the lower left limb 2.Explicit – the PR text contains particular specific values. The characteristic name might be missing since Figure 4 illustrates the case when the respective PR the attribute is sufficient to recognise the feature: e.g. contains specific information concerning only the lower “preserved peripheral pulsations” instead of “preserved limbs - light swelling. Then for the upper limbs, which are pulsations of the peripheral arteries”. The attributes are not explicitly described, the IE system sets the default described by a variety of expressions, e.g. for the "volume value. The same decision is made for the field "Peripheral of the thyroid gland” the value "normal" can be Artery Pulsations" which is set to "preserved" both for the represented as "not enlarged, not palpable enlarged, not upper and lower limbs. palpable". These explicit statements are processed at the The PR behind the template at Figure 5 contains specific fourth and fifth steps of the algorithm. Depending on the information about the "left ankle"; then the IE system depth of the body parts ontology with related anatomic infers that the missing values should be set to the default organs (AO) for the processed AO, and according to the limb attribute values; details encountered in the PR text, the dynamically 4. By diagnosis – sometimes a diagnosis is given generated templates can consists of different levels of instead of the organ description, e.g. "onychomycosis, nested fields (Figure 4 and Figure 5); tinea pedis". This information is captured at the fourth 3. Partial – The text contains descriptions about and fifth steps of the algorithm at Figure 1. organ parts, not about the main AO. For instance, the limbs status can be expressed like e.g. "atrophic changes

34 5. Evaluation Table 1 summarises the precision, recall and the In a previous paper we have evaluated the recognition of F-measure of correctly extracted descriptions for sex, age, skin characteristics (Boytcheva et al., 2009). These results disease, diabetes type and the AO skin, neck, thyroid are presented here to allow for a more complete gland and limbs. For each test the extraction algorithm, assessment of our IE results. The extraction of attributes shown at Figure 1, uses organ-specific regular was evaluated using a corpus of 197 PRs as a training set expressions. (set1) and another set of 1500 PRs as a test set (set2). The The cases of incorrect analysis are due to more complex evaluation is made organ by organ since each AO syntactic structures in the PR text which need to be description is analysed independently. There are PRs analysed by a syntactic analyser (parser) and deeper which do not contain any descriptions relevant to the approach to sentence analysis. tested AOs but they are removed from the evaluation Since we are using nested rules for capturing features, figures. spelling errors in (or lack of) expressions which are to be Usually the Information Extraction performance is matched on an upper level may prevent the further assessed in terms of three measures. The precision is matching of the rules in the nested fields. We have noticed calculated as the number of correctly extracted AO that this was one of the reasons for the comparatively low descriptions, divided by the number of all recognised AO recall value of the diabetes duration (its recognition rule is descriptions in the test set. The recall is calculated as the nested in the diabetes acronym and diabetes type number of correctly extracted AO descriptions, divided recognition rules). The recall for the feature sex is by the number of all available AO descriptions in the test surprisingly low. This can be explained with the fact that set (some of them may remain unrecognised by the there were too few samples in the anonymised training particular IE module). Thus the precision measures the corpus and when testing on a new dataset the available success and the recall – the recognition ability and rules could not capture the relevant features. The main "sensitivity" of the algorithms. The F-measure (harmonic reason is the new author styles encountered in the mean of precision and recall) is defined as enlarged corpus. These inconsistencies are covered by continuous update and adjustment of the IE rules for F = 2 x Precision x Recall / (Precision + Recall). shallow analysis.

Feature Precision % Recall % F-measure % Number of PRs Percentage PRs Test set3 with explicit text with explicit #1 88.89 90.00 89.44 description characteristics #2 80.00 50.00 61.54 Ankle 201 99,5% #3 98.28 96.67 97.47 Leg 201 99,5% #4 96.00 83.33 89.22 Peripheral 201 99,5% #5 95.65 73.82 81.33 artery #6 95.65 88.00 91.67 Feet 8 3,98% Skin #7 94.94 90.36 92.59 11 5,47% Nails 15 7,46% #8 93.41 85.00 89.01 Others 21 10,45% Gait 3 1,49% Table 1: IE precision, recall and f-measure evaluation (#1 - age; #2 - sex; #3 - diagnose; #4 - diabetes duration; #5 - skin; #6 - neck; #7 - thyroid gland; #8 - limbs). Table 2: Availability of status statements in 201 randomly-selected PRs

peripheral Template filled in at: ankle leg Feet skin nails Others Artery Step 4 by regular expressions 150 148 199 6 9 14 18 Step 5 by extra rules 40 40 1 0 0 0 0 Step 6 by correlations 0 0 0 1 1 0 2 Step 8 by defaults 9 11 0 0 0 0 0 Total: 199 199 200 7 10 14 20

Table 3: IE performance steps 4-8 for limbs descriptions: number of PRs from where status statements are extracted at each step

35 We see that the simple regular expressions work relatively approach is helpful for the obtaining of specific medical well and produce enough input for statistical observations statements which are described in the PR texts. We plan to of patient status data. It is also clear that further efforts are develop algorithms for discovering more complex required for proper development of the extraction relations and other dependences, which is a target for our algorithms in order to cover sophisticated language future work. expressions. The automatic recognition of the scope of quantifiers, negation, and temporal qualifications is a 7. Acknowledgements challenge to be met in the future. The research work presented in this paper is supported by Regarding the limbs, an additional in-depth evaluation grant DO 02-292/December 2008 "Effective search of was done on a set of 201 PRs (set3) which were randomly conceptual information with applications in medical selected from set2. Some PRs in set3 contain no explicit informatics", funded by the Bulgarian National Science discussions about certain limb characteristics. Table 2 Fund in 2009-2011. gives the numbers of available or missing descriptions for each attribute; it turns out that the status of ankles, legs 8. References and the peripheral arteries is explicitly mentioned in Boytcheva, S., A. Strupchanska, E. Paskaleva, and D. almost all PRs. Tcharaktchiev (2005). Some Aspects of Negation The IE algorithm shown at Figure 1 was run on set3 and Processing in Electronic Health Records. In Proc. we have counted manually how the processing at steps of the Int. Workshop Language and Speech 4-8 contributes to the final success rates. The results are Infrastructure for Information Access in the Balkan summarised at Table 3. It is clear that an essential share of Countries, held in conjunction with RANLP-05, the successful recognitions is achieved at steps four and Borovets, Bulgaria, pp. 1-8. five of the algorithm. The regular expressions, applied at Boytcheva, S., I. Nikolova, E. Paskaleva, G. Angelova, D. step 4, recognise the majority of the descriptions. Extra Tcharaktchiev and N. Dimitrova (2009). Extraction and rules are applied at step 5, to extract phrasal units Exploration of Correlations in Patient Status Data. In: including the negative descriptions. Savova, G., V. Karkaletsis and G. Angelova (Eds). Some text descriptions of limbs status contain complex Biomedical Information Extraction, Proc. of the Int. statements including embedded sentences and clinical Workshop held in conjunction with RANLP-09, tests; due to this reason the IE recall is relatively low Borovets, Bulgaria, pp. 1-7. (85%, see Table 1 row 8). The skin feature extraction has BDA Bulgarian Drug Agency (2010). See the site also obtained a comparatively low recall, which could be http://www.bda.bg/index.php?lang=en explained by the fact that there is a large variety of lexical caTIES Cancer Text Information Extraction System expressions describing the values of some skin (2006). See https://cabig.nci.nih.gov/tools/caties. characteristics, combined with a variety of the authors’ CLEF Clinical E-Science Framework (2008), University expression manners. of Sheffield. See http://nlp.shef.ac.uk/clef/. In section 4 we have shown that the occurrence of the Friedman C (1997). Towards a comprehensive medical word "gait" signals the end of the limbs description. language processing system: methods and issues. Proc. Please note that according to Table 2, this is helpful for AMIA Annual Fall Symposium, pp. 595-599. 1,49% of all PRs to be analysed, so many other "hints" HITEx Health Information Text Extraction (2006). See need to be acquired and declared in the domain ontology. https://www.i2b2.org/software/projects/hitex/hitex_manual.html. 6. Conclusion Savova, G. K., K. Kipper-Schuler, J. D. Buntrock, and Ch. G. Chute (2008). UIMA-based Clinical Information Our system is the first one which supports medical text Extraction System. LREC 2008 Workshop W16: mining for Bulgarian. So far we have evaluated its Towards enhanced interoperability for large HLT performance on several IE tasks dealing with the systems: UIMA for NLP. anamnesis and patient' status zones of the PRs. It shows Spasic, I., S. Ananiadou, J. McNaught, and A. Kumar the complexity of medical text processing which is due to (2005). Text mining and ontologies in biomedicine: the complexity of the medical domain and the Making sense of raw text. Oxford University Press, particularities of the medical texts written in specific, Briefings in Bioinformatics 2005, Vol. 6(3), pp. well-established style. The role of explicitly-declared 239-251. domain knowledge is shown; it supports the information Valderrábanos, A., A. Belskis, and L. I. Moreno (2002). extraction algorithms since by providing constraints and Multilingual Terminology Extraction and Validation. In inference mechanisms. At the same time the article Proc. LREC 2002 (3rd Int. Conf. on Language illustrates the obstacles to build semantic systems in the Resources and Evaluation), Gran Canaria. medical domain: this requires much effort for construction of the conceptual resources as well as the lexicons and grammatical knowledge in case of text Despite the difficulties, the paper shows that certain facts can be extracted relatively easy. These promising results support the claim that the Information Extraction

36 Annotation of all coreference in biomedical text: Guideline selection and adaptation

K. Bretonnel Cohen1,2, Arrick Lanfranchi2, William Corvey2, William A. Baumgartner Jr.1, Christophe Roeder1, Philip V. Ogren1,3, Martha Palmer2, Lawrence E. Hunter1

1: Center for Computational Pharmacology University of Colorado School of Medicine Aurora, Colorado, USA 2: Department of Linguistics University of Colorado at Boulder Boulder, Colorado, USA 3: Department of Computer Science University of Colorado at Boulder Boulder, Colorado, USA Abstract This paper describes an effort to build a corpus of full-text journal articles in which every co-referring noun phrase is annotated. The identity and appositive relations were marked up. Several annotation schemas were evaluated and are described here; the OntoNotes guidelines were selected. Biomedical journal articles required a number of adaptations to the OntoNotes guidelines—mainly doing away with the notion of generics, which also had implications for the handling of nominal modifiers. Domain experts and linguists were evaluated with respect to their ability to function as annotators, and both were found to be effective. Progress is reported with about one third of the project done; inter-annotator agreement at this stage is 0.684 by the MUC metric.

1. Introduction than developing our own de novo. To this end, a number The Colorado Richly Annotated Full Text (CRAFT) corpus of publicly available annotation guidelines were evaluated. is a set of 97 full-text journal articles that is currently being (We did not consider projects that only tackled pronominal annotated in a joint project between the University of Col- anaphora, to the exclusion of full noun phrase coreference, orado School of Medicine and the Linguistics Department or projects that tackled clinical data.) We were initially rel- of the University of Colorado at Boulder. The corpus con- atively agnostic as to desiderata, other than that we wanted tains about 597,000 words of text and is being annotated the guidelines to include a full range of coreferential and for a number of types of linguistic information, including bridging anaphoric relations, as well as part/whole relation- part of speech and treebanking, and semantic information, ships. including a variety of types of named entities. Addition- ally, we have included some discourse structure annotation in the form of coreference. 2.1.1. OntoNotes This project is unlike any other that we are aware of in OntoNotes (Hovy et al., 2006) is a large, multi-center that it includes marking all coreferential relations of iden- project to create a multi-lingual, multi-genre corpus anno- tity and apposition between all noun phrases in the full text tated at a variety of linguistic levels, including coreference of a large body of publicly available biomedical publica- (Pradhan et al., 2007). As part of the OntoNotes project, tions. Other work (Gasperin, 2006; Gasperin et al., 2007) the BBN Corporation prepared a set of annotation guide- has done annotation of full biomedical text, but only of lines. They are not publicly available. (The version cur- noun phrases referring to biological entities. In contrast, rently available online at the Linguistic Data Consortium is we mark up coreference between any and all noun groups a full numbered version out of date, compared to the ver- in the documents. IIR has done full coreference and appos- sion that we used.) However, (Pradhan et al., 2007) gives itive annotation of full-text journal articles, but on a smaller the flavor of the approach. set, and has not made them publicly available. The CRAFT corpus is about twice the size of the IIR document set and The OntoNotes guidelines include events, pronominal and will be made publicly available. This is the most ambitious full anaphora and coreferents, and verbs as markables. project of its kind of which we are aware. Predicative nouns are not treated as coreferential. There is a separate relation for appositives. Nominal premodifiers 2. Methods and Results are markables, with some restrictions that we discuss below in the section on domain-specific changes to the guidelines. 2.1. Selection of annotation guidelines The guidelines only include the identity and appositive re- One of the desiderata of the project was to contribute to the lations, with set membership and part/whole not included, development of a standard for coreference annotation by but this turned out not to be relevant to our project, since adopting a pre-existing set of annotation guidelines, rather funding constraints precluded annotating these relations.

37 2.1.2. Gasperin a number of other problems pointed out with the MUC7 Gasperin (Gasperin, 2006; Gasperin et al., 2007) is the only guidelines, including situations where the referential status previous attempt that we are aware of to annotate the full of the markable is unclear, are present in all of the other text of biomedical journal articles. Her project involved an- guidelines that we examined as well. notating five such articles. The annotation guidelines reflect a nuanced attempt to reflect the semantics of the biological 2.2. Domain-specific changes to the guidelines domain. They do this in part by only selecting biomedi- After reviewing the pre-existing guidelines, senior anno- cal entities as markables, but more importantly by defining tators marked up a sample full-text article, following the a domain-relevant set of relations. In addition to corefer- OntoNotes guidelines. We found the OntoNotes guidelines ence, the relations are three types of associative relations: to be a good match to our conception of how coreference homology; related “biotype” (e.g. a gene and its protein or should be annotated, and noted that they have responded to a gene and a subsequence of that gene); and the set/member a number of critiques of earlier guidelines. For example, relation. compared to the MUC-7 guidelines, the treatment of ap- Appositives and predicative nouns are both treated as coref- positives in terms of heads and attributes rather than sepa- erential. Premodifiers are not specifically addressed in the rate mentions is an improvement in terms of referential sta- guidelines. tus, as is the handling of predicative nouns. The inclusion Predicative nouns are treated as coreferential. Pronouns are of verbs and events is a desirable increase in scope. The excluded completely. Probably the most striking aspect of guidelines are more detailed, as well. They were also at- Gasperin’s guidelines is its definition of markables; they tractive from a political point of view—we wanted to adopt are limited to “bio-entities.” This was a major mismatch a widely accepted set of guidelines, and they seemed like with our goals, which included annotating all coreferential a good candidate for this due to their association with a entities. large project. We adopted them; however, the nature of the biomedical domain required a major adaptation of the 2.1.3. GENIA project at IIR guidelines. The Institute for Infocomm Research in Singapore and the Department of Computer Science spearheaded an annota- 2.2.1. Generics tion project involving the 2,000 abstracts in the GENIA The OntoNotes guidelines make crucial reference to a cat- corpus (Yang et al., 2004a; Yang et al., 2004b) as well egory of nominal that they refer to as a generic. Generics as 43 full papers. They annotated four relations: identity, include: appositives, pronouns (considered separately from other identity relations), and relative pronouns. The annotation • bare plurals of markables included a minimal string that would suffice • indefinite noun phrases for identification. As in the case of the other projects be- sides OntoNotes, they annotated only nouns, not verbs; the • abstract and underspecified nouns guidelines allowed for these, but they did not find any while preparing the guidelines. Premodifiers were not considered The status of generics in the annotation guidelines is that markables. Predicative nouns were marked when they were they cannot be linked to each other via the IDENTITY re- definite. lation. They can be linked with subsequent non-generics, but never to each other, so every generic starts a new IDEN- 2.1.4. MUC7 TITY chain (assuming that it does corefer with subsequent The groundbreaking MUC7 guidelines (Hirschman, 1997) markables). have underlaid a number of subsequent coreference anno- The notion of a generic is problematic in the biomedical do- tation projects. The scheme covered only a single relation, main. The reason for this is that any referring expression in identity. The annotation of markables included a minimal a biomedical text is or should be a member of some biomed- string that would suffice for identification of the corefer- ical ontology, be it in the set of Open Biomedical Ontolo- ring element. Noun phrases and pronouns were included as gies, the Unified Medical Language System, or a nascent markables. Gerunds were excluded as markables. Appos- ontology. As such, it has the status of a named entity. To itives and predicate nominals were both marked, as coref- take an example from BBN, consider the status of cataract erential. Prenominal modifiers were annotated only in the surgery in the following: case where they could be linked to something other than another prenominal modifier. Allergan Inc. said it received approval to sell A number of authors have critiqued various aspects of the PhacoFlex intraocular lens, the first fold- the MUC7 coreference annotation guidelines, e.g. (van able silicone lens available for cataract surgery. Deemter and Kibble, 2001). A strong point of the MUC7 The lens’ foldability enables it to be inserted guidelines is that they make an attempt to deal with chang- in smaller incisions than are now possible for ing numerical values, as in The results of this analysis cataract surgery. showed that the statistical support for the linkage of the C57BL/6 locus on Chromosome 3 for ANA increased from According to the OntoNotes guidelines, cataract surgery is logarithm of odds (LOD) 5.4 to LOD 6.4, which other a generic, by virtue of being abstract or underspecified, and guidelines do not, although even the MUC7 organizers therefore the two noun phrases are not linked to each other were not happy with their approach to this. We note that via the IDENTITY relation. However, cataract surgery

38 is a concept within the Unified Medical Language System • striatal volume and neural number (Concept Unique Identifier C1705869), where it occurs as • the structure of the basal ganglia part of the SNOMED Clinical Terms. As such, it is a named entity like any other biomedical ontology concept, • It and should not be considered generic. Indeed, it is easy to These were not pre-annotated—the annotators selected find examples of sentences in the biomedical literature in their spans themselves. This is one potential source of lack which we would want to extract information about the term of interannotator agreement. Base noun phrases were an- cataract surgery when it occurs in contexts in which the notated only when they participated in one of the two rela- OntoNotes guidelines would consider it generic: tionships that we targetted. • Intravitreal administration of 1.25 mg bevacizumab at 2.3.2. Definitions of the two relations the time of cataract surgery was safe and effective in The two relations that are annotated in the corpus are the preventing the progression of DR and diabetic mac- IDENTITY relation and the APPOSITIVE relation. The ulopathy in patients with cataract and DR. (PMID identity relation holds when two units of annotation refer 19101420) to the same thing in the world. The appositive annotation • Acute Endophthalmitis After Cataract Surgery: 250 holds when two noun phrases or a noun phrase and an ap- Consecutive Cases Treated at a Tertiary Referral Cen- positive are adjacent and not linked by a copula or other ter in the Netherlands. (PMID 20053391) linking word. • TRO can present shortly after cataract surgery 2.3.3. Details of the annotation schema and lead to serious vision threatening complica- More specifically, the annotation schema is defined as: tions. (TRO is thyroid-related orbitopathy; PMID IDENTITY chain An IDENTITY chain is a set of base 19929665). noun phrases and/or appositives that refer to the same thing in the world. It can contain any number of elements. In these examples, we might want to extract an Base noun phrase Discussed above. IS ASSOCIATED WITH relation between , , and . This makes it important to be able to resolve Either the head or the attributes may themselves be apposi- coreference with them. tives. Thus, our project’s guidelines do not consider there to be Nonreferential pronoun All nonreferential pronouns are generics as such in the genre and domain that it is con- included in this single class. 1 cerned with . Thus, an example set of annotations would be: All brains analyzed in this study are part of [the Mouse 2.2.2. Prenominal modifiers Brain Library] ([MBL] ). [The MBL] is both a physical A related issue concerned the annotation of prenominal a b c and Internet resource. (PMID 11319941) modifiers. The OntoNotes guidelines call for prenominal modifiers to be annotated only when they are proper nouns. • APPOSITIVE chain: The Mouse Brain Librarya, However, since we considered all entities to be named en- MBLb tities, our guidelines called for annotation of prenominal • IDENTITY chain: Mouse Brain Library , The MBL modifiers regardless of whether or not they were proper a c nouns, as such. 2.4. Training of the annotators We hired and trained biologists and linguists as a group. 2.3. The annotation schema Annotators were given a lecture on the phenomenon of 2.3.1. Noun groups coreference and on how to recognize coreferential and ap- The basic unit of annotation in the project is the base noun positive relations, as well as nonreferential pronouns. They phrase. We defined this as one or more nouns and any se- were then given a non-domain-specific practice document. quence of leftward determiners, adjectives, and conjunc- Following a separate session on the use of the annotation tions not separated by a preposition or other noun-phrase- tool, they were given an actual document to annotate. This delimiting part of speech; and rightward modifiers such as document is quite challenging, and exercised all of the nec- relative clauses and prepositional phrases. essary annotation skills. We began with paired annotation, Thus, all of the following would be considered base noun then introduced a second document for each annotator to phrases: mark up individually. Once annotators moved on to indi- vidual training annotation, they met extensively with a se- • striatal volume nior annotator to discuss questions and review their final • neural number annotations. During the initial training phase, we paired biologists with 1Note that in our guidelines, as in the OntoNotes project, in- linguists and had them work on the same article indepen- definite noun phrases are used to start new IDENTITY chains, and dently, then compare results. This turned out to be an un- are not linked with previous markables, but this is because they are necessary step, and we soon switched to having annotators discourse-new, not because we consider them to be generics. work independently from the beginning.

39 2.5. Two populations of annotators Documents annotated 31 Total IDENT chains 6,650 We hired two very different types of annotators— Total APPOS chains 1,230 linguistics graduate students, and biologists at varying lev- Total nonreferential pronouns 296 els of education and with varying specialties. Impression- Total base noun phrases 27,870 istically, we did not notice any difference in their perfor- mance. The biologists were able to grasp the concept of Table 1: Total documents and markables to date. Base coreference, and the linguists did not find their lack of do- noun phrase count includes only those base noun phrases main knowledge to be an obstacle to annotation. Both the included within an IDENT or APPOS chain. biologists and the linguists had opportunities to clarify is- sues via email and meetings with the senior annotators. Average time per document 20 hours Average IDENT chains per document 215 2.6. The annotation process Average APPOS chains per document 40 Most articles are single-annotated, but a subset of fifteen Average nonreferential pronouns per document 10 will be double-annotated by random pairs of annotators to Average base noun phrases per document 899 calculate inter-annotator agreement. Average markables to date. The length of the articles means that a single IDENTITY Table 2: Base noun phrase chain can extend over an exceptionally long distance. To count includes only those base noun phrases included cope with this, annotators typically marked up single para- within an IDENT or APPOS chain. graphs as a whole, and then linked entities in that paragraph to earlier mentions in the document. for projects using the OntoNotes guidelines have been pub- In the case of questions, annotators had access to senior lished to compare these numbers to. Note again that these annotators via email and meetings. are preliminary numbers representing progress to date only Annotation was done using Knowtator, a Proteg´ e´ plug-in and do not represent the inter-annotator agreement values (Ogren, 2006a; Ogren, 2006b). for the completed project.

2.7. Progress to date 3. Discussion Table 1 shows the total numer of documents, IDENTITY We found that published and unpublished annotation guide- chains, APPOSITIVE chains, nonreferential noun phrases, lines for coreference differ widely with respect to the def- and base noun phrases with about one third of the project initions of markables, handling of appositives, handling of done. (In calculating these numbers, when a document had predicative noun phrases, handling of nominal premodi- been double-annotated, we took the average of the two an- fiers, and the inclusion of other relations in addition to notators for that document.) These numbers give some idea coreference. A number of them are compared and con- of the scale of the task of annotating all coreferential noun trasted here. phrases in full-text scientific journal articles, along with the We also found that a candidate for a standard set of corefer- data in Table 2, where we see the averages. There we note ence guidelines, the OntoNotes guidelines, required consid- especially that the average time to annotate a single article erable adaptation to fit the nature of the biomedical domain, is twenty hours. This number should be useful for estimat- particularly with respect to the notion of generics, which ing the funding needed for future annotation projects of this play a large role in the OntoNotes guidelines. The descrip- sort. tion of our annotation process describes how these guide- The low number of nonreferential pronouns is striking, but lines can be incorporated into an active annotation project, accords with the work of Gasperin, who reported numbers and our report on progress to date gives an early indication of nonreferential pronouns in her full-text articles that were of the scale of the data that will result from annotation to so low that she omitted them from the annotation schema. these guidelines. In contrast, the high number of base noun phrases is quite striking. Recall that our process includes the annotators Acknowledgements marking the boundaries of the base noun phrases them- We thank the BBN Corporation for sharing the OntoNotes selves; this high number suggests that a significant time coreference annotation guidelines with us, and the other savings might be realized by pre-marking the syntactic con- stituents of the papers. We also note from early work by Hirschman et al. that this would likely increase our inter- Metric Average annotator agreement. MUC .684 Average inter-annotator agreement over a set of ten arti- Class-B3 .858 cles is .684 by the MUC metric. We give a number of Entity-B3 .750 other metrics in Table 3 (MUC, (Vilain et al., 1995), B3, Mention-based CEAF .644 (Bagga and Baldwin, 1998), CEAF, (Luo, 2005), and Krip- Entity-based CEAF .480 pendorff’s alpha (Passonneau, 2004; Krippendorff, 1980)). Krippendorff’s alpha .619 We note that the value for Krippendorff’s alpha is lower than the 0.67 that Krippendorff indicates must be obtained Table 3: Inter-annotator agreement values for a sample before values can be conclusive, but no other IAA values of ten documents. A variety of metrics is reported.

40 writers of guidelines for making their guidelines publicly Xiao Feng Yang, Jian Su, Guo Dong Zhou, and Chew Lim available. We also thank Guergana Savova and Jiaping Tan. 2004a. A NP-cluster based approach to coreference Zheng of the Mayo Clinic for sharing their inter-annotator resolution. In Proceedings of 20th International Con- agreement code with us. Two anonymous reviewers im- ference on Computational Linguistics (COLING 2004), proved the paper. pages 226–232. Xiaofeng Yang, Guodong Zhou, Jian Su, and Chew Lim 4. References Tan. 2004b. Improving noun phrase coreference resolu- tion by matching strings. In IJCNLP04, pages 326–333. Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In Proceedings of the Lin- guistic Coreference Workshop at The First Interna- tional Conference on Language Resources and Evalua- tion (LREC ’98), pages 563–566. Caroline Gasperin, Nikiforos Karamanis, and Ruth Seal. 2007. Annotation of anaphoric relations in biomedi- cal full-text articles using a domain-relevant scheme. In Proceedings of DAARC 2007. Caroline Gasperin. 2006. Semi-supervised anaphora reso- lution in biomedical texts. In Linking natural language processing and biology: towards deeper biological liter- ature analysis, pages 96–103. Association for Computa- tional Linguistics. L. Hirschman. 1997. MUC-7 coreference task definition. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: The 90% solution. In Human Language Technology Conference of the NAACL Companion Volume, pages 57–60. Klaus Krippendorff. 1980. Content Analysis: An Introduc- tion to Its Methodology (Commtext Series). SAGE Pub- lications, September. Xiaoqiang Luo. 2005. On coreference resolution per- formance metrics. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 25–32. Philip Ogren. 2006a. Knowtator: a Protege plugin for an- notated corpus construction. In HLT-NAACL 2006 Com- panion Volume. Philip Ogren. 2006b. Knowtator: a plug-in for creating training and evaluation data sets for biomedical natural language systems. In The International Protege confer- ence, pages 73–76. Rebecca J. Passonneau. 2004. Computing reliability for coreference annotation. In Proceedings of the Language Resources and Evaluation Conference. Sameer S. Pradhan, Lance Ramshaw, Ralph Weischedel, Jessica MacBride, and Linnea Micciulla. 2007. Unre- stricted coreference: identifying entities and events in OntoNotes. In ICSC ’07: Proceedings of the Interna- tional Conference on Semantic Computing, pages 446– 453. Kees van Deemter and Rodger Kibble. 2001. On core- ferring: Coreference in MUC and related annotation schemes. Computational Linguistics, 26(4):629–637. Marc Vilain, John Burger, John Aberdeen, Dennis Con- nolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Proceedings of the Sixth Message Understanding Conference (MUC-6), pages 45–52.

41 Contribution of Syntactic Functions to Single Word Term Extraction

Xing Zhang1 and Alex Chengyu Fang2

Dialogue Systems Group Department of Chinese, Translation and Linguistics, City University of Hong Kong Tat Chee Avenue, Kowloon, Hong Kong SAR [email protected] [email protected]

Abstract

This paper intends to investigate what contributions syntactic functions can make towards single word term extraction. It examines the probabilistic relations between medical terms and their syntactic functions. By probabilistic relations, it means the relations between term occurrence ratios and different paths of syntactic functions. An Automatic Term Extraction (ATE) system on the basis of such extended syntactic information is built up to find out which paths of syntactic functions are good indicators for terms after training on a large medical corpus drawn from MEDLINE. Accordingly, term candidates occurring in these syntactic paths will be assigned higher weights for better probabilistic estimates. As a result, the most helpful syntactic paths in identifying terms will be found out. One linguistic motivated method, SF-Value, is proposed to weight termhood of term candidates. Results of experiments show that single word terms are extracted dominantly at a fairly good recall besides multiword terms. In this way, syntactic behaviors of single word terms prove to be especially effective in selecting single word terms. All in all, this work studies the actual usage of terms in real texts rather than a static description of their internal structures. It dynamically characterizes patterns of term usage to a much deeper degree. And this information will in turn contribute to practical ATE system.

considered as the state-of-the-art model for ATE, which can 1. Introduction also perform well on other languages such as Japanese Terms usually refer to the linguistic manifestation of (Mima &Ananiadou, 2001), Slovene (Vintar, 2004). concepts in a specific domain. More specifically, terms are As for single word term extraction, limited works have the linguistic expression of the concepts of special been done. TerMight (Dagan & Church, 1994) just define communication and are organized into systems of terms Single-word candidates by taking the list of all words that which ideally reflect the associated conceptual system occur in the document and do not appear in a standard stop- (Ananiadou, 1994). Generally, terms are divided into single list of "noise" words. Xu et al. (2002) designed a TFIDF- word terms and multi word terms. Past works have different based single word term classifier. Bernhard (2006) presents opinions on levels of challenges these two types of terms a pattern-based technique to extract single word term, which pose. Wermter and Hahn (2005) think the recognition of is based on some classical word-forming unit, e.g. prefixes single-word terms usually does not pose any particular (extra-, anti-), initial combining forms (hydro-, pharmaco-) challenge; it’s multiword terms that are much more difficult. and suffixes (-ism). Corpora comparison method was used While other researchers believe single term words are much in Rayson & Garside (2000), Baroni & Bernardini (2004), more difficult to recognize because semantic information is Kit & Liu (2008). needed to distinguish between the general usage of a word This paper aims to investigate what contributions and its terminological usage (Eumeridou et al., 2004) and syntactic functions can make towards single word term statistically is very difficult to capture domain-specific extraction. It presents a linguistic-grounded method to single-word terms (Sclano & Velardi, 2007). measure termhood of single word term. The intuition of this With respect to application for ATE, lots of works have work is that terms tend to play certain kinds of syntactic been devoted to multiword extraction. Methods used include functions more prominently. And this kind of syntactic morpho-syntactic properties (Daille, 1996), and classic behavior of terms can be captured as termhood by statistical measures as TF·IDF (Salton, 1988), Mutual computation of term ratios in different syntactic paths. Information (Church, 1989), Log-Likelihood Ratio Syntactic path in this work refers to a path of syntactic (Dunning, 1993), Dice Factor (Smadja et al., 1996), etc. C- functions of one NP. Specifically, it is defined as Value (Frantzi & Ananiadou, 1998) measure is widely concatenation of elementary syntactic functions tagged by

42 Survey Parser. Term ratios are defined as the frequencies of and a number of sentences. One sub-corpus of 360 abstracts term occurrences in each syntactic path over all term was manually checked by human professionals for possible occurrence frequencies in all syntactic paths. This work tagging mistakes of syntactic functions after they were studies the correlation between terms and syntactic parsed by the Survey Parser (Fang, 1996). A list of medical functions through careful analysis of real experiments with terms was created from Medical Subject Headings (MeSH theoretical insights. The corpus it uses is built from 2009) beforehand. This MeSH term list consists of 602,436 MEDLINE 1 and its performance is compared with two terms, 430,848 are multi word terms, and 171,588 are single existing term extractor, TerMine and TermExtractor (Sclano word terms. These corpora then will be terminologically & Velardi, 2007). annotated: noun phrases that match the term list are tagged as terms. 2. Methods and Experiments This manually checked sub-corpus is of 82, 055 words and further divided into ten subsets randomly, 36 abstracts This study proposes a method to measure the probabilistic each. Each time, nine out of ten subsets is used as training, relations between terms and their syntactic functions. This and the one left out is used as testing set. And the whole method is a weighting scheme based on term ratios in procedure is repeated 10 times. The advantages of such ten- syntactic paths of term candidates in parsed texts. Overall fold cross validation enable the greatest possible amount of architecture of this system includes three major modules. data used for training in each iteration. And we can also The first module is to get abstracts from MEDLINE. It will create experimental corpus from MEDLINE database. The predict accuracy for unseen data sets. second module creates a term list from MeSH, and Gold standard used in this work is the number of true annotates the corpus according to this term list. Another MeSH terms in testing corpora. In order to get all true major function of the second module is to compute term MeSH terms in testing corpora, N-grams (N is from 1 to 10) ratios in different syntactic paths. The third module uses the will be extracted from each corpus at first and matched knowledge of term ratios in different syntactic paths to against MeSH term list. N-gram matching method is assign different weights to different syntactic paths and then employed in order to avoid effects from parsers because compute the Syntactic Function Value (SF-Value) of each different parsers will output different NP lists, which will term candidate using the following formula: lead to different term candidates at the beginning. Therefore, n = × the absolute number of MeSH terms had better to be parser SFValue∑ FSSi WSS i i=1 independent. The following table (Table 1) presents basic This formula contains two parameters: FSSi is the statistics of testing subsets. frequency of syntactic pathi, WSSi is the weight of syntactic path , n is the count of how many syntactic paths this term i 2.1.2 Survey Parser candidate occurs in. WSS is computed previously from i Survey Parser was first designed to complete the syntactic training on a corpus that are annotated with MeSH terms. It annotation of the International Corpus of English. It is computed on the basis of the proportion of term effectively parses sentences in many layers with detailed occurrence frequency in this syntactic path among the total syntactic functions. The unique feature of the parsing term occurrence frequencies of all syntactic paths. scheme is that it analyzes the syntactic functions of the Therefore, WSS is higher for syntactic paths that are more i constituent paths and represents them in the form of a likely to be filled by terms than other syntactic paths. And parsing tree. Survey Parser also classifies syntactic SF-Value is higher for terms that are present in syntactic functions into two major kinds: one is phrasal functions, and functions with a higher WSS . Moreover, SF-Value is higher i the other is clausal functions, which correspond to the basic for NPs that occur more often in syntactic paths that are elements of English sentences such as subject, verb, object, themselves more often occupied by terms than others. complement and adverbial. For example, if cells is tagged as a term, and the syntactic path for it is recorded as “NPHD- 2.1 Resource Building and Processing N%PC-NP%A-PP”, which means that it is a noun of the function NP head, which is a part of larger NP of the 2.1.1 Corpora Building up function preposition complement, and which is part of a This study built a small subset of MEDLINE abstracts based preposition phase of the function adverbial. on the controlled search of the database using the keyword internal medicine. This search produces 252,033 abstracts 2.1.3 Stop List (until 17 July 2008). Each abstract consists of a single title In this experiment, a stop word list is created, which consist of a few frequent grammatical words, such as definite articles, demonstrative and possessive adjectives, and 1 MEDLINE is the National Library of Medicine's premier indefinite articles. Words in stop list are uninformative for terminology extraction. The aim of using a ‘stop word’ list bibliographic database. MeSH is the U.S. National Library of is to remove very frequent words which are not considered Medicine's controlled vocabulary used for indexing articles for to carry terminological meanings. MEDLINE.

43 Testin # of # of All Multi Single prepositional complement and so on (see Appendix 1). And g words parsed MeSH MeSH MeS ‘%’ is used to indicate a node of higher level. ‘+’ is used to corpor sentence terms terms H indicate two nodes of the same level. a s terms subset 7,929 356 466 199 267 2.2.1 Results of Term Ratios of Syntactic Paths 1 From the first module, around three hundred kinds of syntactic paths are recorded if taking clausal functions as subset 7,576 313 429 120 309 ending nodes. However, there are only a few paths 2 accounting most prominently, such as SU-NP, PC-NP%A- subset 8,355 374 447 117 330 PP, while the rest has quite a low frequency each. Therefore, 3 in order to deal with sparse data and meanwhile keep subset 7,693 325 479 152 327 distinctive features of these syntactic functions to a great 4 extent, this research conflates those syntactic paths with the subset 8,562 391 490 137 353 same beginning nodes and ending nodes. This conflation 5 promotes the ranking of some syntactic paths by means of subset 8,562 357 494 138 356 grouping these syntactic paths with extremely low term 6 occurrences. subset 8,562 334 525 151 374 7 2.2.2 Syntactic Paths Ending in Clausal Functions subset 8,353 381 515 140 375 The following table is a list of syntactic paths ranked higher 8 in training corpora (see Table 2). subset 8,675 375 475 134 341 9 subset 7,788 343 524 157 367 Syntactic Paths Frequency Ratio 10 SU-NP 2515 17.57% Table 1: Basic Statistics of Testing Corpora OD-NP 1657 11.58% 2.2 Experimental Setup DEFUNC-NP 491 3.43% A-PP 315 2.20% After parsing training and testing texts with Survey parser, VB-VP 280 1.96% the experiment is realized in the second and third module introduced earlier. The second module mainly annotates the PC-NP 241 1.68% training corpora with MeSH terms and computes term ratios APPOS-NP 205 1.43% of syntactic paths. The procedure includes following steps: NPPR-AJP+NPHD-N%PC- • Match NPs in training corpora with terms in MeSH NP%A-PP 156 1.09% term list. • Record syntactic paths in which MeSH terms are CS-NP 145 1.01% identified. NPPR-AJP+NPHD-N%SU- • Calculate term occurrence frequencies of such NP 91 0.64% syntactic paths and compute term ratios of them; Table 2: Top Ten Term Ratios in Syntactic Paths Ending in • Compute different weights (WSS) respectively for Clausal Functions those syntactic paths. From above table, we can see that terms take the function The third module is to compute SF-Value for each term SU (subject) most frequently, followed by these taking the candidate: function of OD (direct object). The third ranking is the • Input testing texts and extract all NPs with their function DEFUNC (detached function), followed by frequencies from it. function A (adverbial). And the accumulative ratios of these • Use stop list to filter those NPs; and delete those ten kinds of syntactic paths total around 45%, which with a length larger than 10 words. indicates nearly half of the terms occurring in these ten • Produce a NP list. paths. • For each NP, compute frequencies of syntactic paths where this NP has occurred in. 2.2.3 Conflation of Syntactic Paths with the Same • Compute SF-Value for each NP, and arrange them Beginning Node and Ending Node in descending order. Besides these paths discussed earlier, there are other 570 • Set a threshold value for SF-Value and NPs with kinds of paths with a term ratio below 0.5%. Therefore, SF-Value above this threshold are considered as these paths are conflated before allocating weights to them. terms. The principle is that syntactic paths with the same beginning syntactic functions and the same ending clausal functions are conflated into a single group. For example: This study adopts all the symbols used in the Survey Parser, for example, SU stands for subject, PC stands for

44 Syntactic Paths Term Ratios terms would be extracted. NPPR-AJP+NPHD-N%PC- 0.199% For comparison, the ten testing corpora will be uploaded NP%NPPO-PP%SU-NP to TerMine and TermExtractor separately. And the results NPPR-AJP+NPHD-N%PC- 0.003% given back will be matched against the MeSH term list from NP%NPPO-PP%PC-NP%NPPO- 2009 MeSH files. As we can see from Table 5, ATE system PP%SU-NP using SF Value can extract much more single MeSH terms NPPR-AJP+NPHD-N%NPPO- 0.001% than either TerMine or TermExtractor. Take testing subset 1 PP%NPPO-PP%NPPO-PP%SU-NP as example, TerMine output 1239 terms, 110 are multi Table 3: Conflation of Syntactic Paths MeSH terms; TermExtractor output 266 terms, 51 of them In this table, the starting syntactic functions of these three are MeSH terms, and only 2 are single. Comparatively, all syntactic paths are all NPPR-AJP+NPHD-N, which means a NPs extracted by SF Value are 2350, 1023 are single NPs. node of NPPR-AJP together with a node of NPHD-N. And Among all these NPs, 419 of them are MeSH terms and the ending syntactic functions are all SU-NP, therefore, single MeSH terms is 309, accounting for 73 percent. these three paths are conflated as NPPR-AJP+NPHD- N%SU-NP, and the term ratios of them is added up as 3.2 Evaluation 0.203% after conflation. For evaluation of SF Value, an automatic method is implemented in this ATE system. Within the interval of the Syntactic Paths Frequency Ratio minimum SF Value to maximum SF Value, a set of PC-NP%A-PP 2855 25.29% threshold values will be set automatically. Each time, the SU-NP 1996 17.68% threshold value will be increased on the basis of a preset OD-NP 1296 11.48% amount. Precision and recall will be computed with respect to each threshold. And F-score will be computed as: PC-NP%SU-NP 711 6.30% ( precision × recall ) PC-NP%NPPO-PP 431 3.82% scoreF =− 2 + PC-NP%OD-NP 426 3.77% precision recall From Table 6, we can see the recall of single MeSH term is DEFUNC-NP 371 3.29% very high, with an average value of 0.86. And average F- VB-VP 300 2.66% score of all 10 testing corpora is 0.30 before conflation of NPPR-AJP+NPHD- 233 2.06% syntactic paths. It is worthy of noticing that F-scores of N%A-PP these 10 testing corpora are significantly improved after A-PP 230 2.04% conflating syntactic paths (p<0.05). And the average F-score reaches 0.36 after conflation of syntactic paths. Based on Table 4: Top 10 Syntactic Paths after Conflation these results, we can find that SF-Value is especially effective in measuring termhood of single word terms which From above table, the syntactic paths PC-NP%A-PP is are an important part in term extraction. promoted to be the first, while SU-NP and OD-NP ranking next. And the number of syntactic paths is reduced to 128 all together. And meanwhile, term ratio of each syntactic path is increased, which means effectiveness of the weighting strength of each path is enhanced.

3. Results Analysis

3.1 Comparison with Existing Term Extraction Systems TerMine is the online service provided by National Centre for Text Mining of University of Manchester. It mainly employs C-Value to extract terms. As C-Value is designed for multiword terms, TerMine extracts multi word only.

TermExtractor is the online service provided by the Linguistic Computing Laboratory of the University of Roma "La Sapienza". It uses domain relevance, domain consensus and lexical cohesion, to weight term candidates. It can let the users set word lengthen for terms to be extracted. Therefore, if the minimum number is set as 1, single word

45 Term Candidates by Term Candidates by Term Candidates by ATE System based on SF- TerMine TermExtractor Value Testing Corpora Terms Terms Terms Terms Terms Terms Terms All NPs All Term All Term All MeSH Multi NPs All MeSH Candidates Candidates Single NPs Multi MeSH Multi MeSH Single MeSH Single MeSH Single MeSH subset 1 1239 110 0 266 51 2 2350 1327 1023 419 110 309 subset 2 1179 114 0 261 46 1 2138 1196 942 374 112 262 subset 3 1296 119 0 277 65 4 2361 1379 982 391 122 269 subset 4 1384 132 0 277 68 4 2399 1386 1013 400 137 263 subset 5 1407 120 0 291 59 1 2577 1560 1017 423 134 289 subset 6 1265 127 0 305 60 3 2432 1459 979 438 130 308 subset 7 1388 132 0 273 62 1 2432 1428 1004 455 147 308 subset 8 1327 126 0 316 72 4 2568 1533 1035 451 132 319 subset 9 1387 123 0 218 49 1 2471 1450 1021 425 124 301 subset 10 1355 153 0 257 61 5 2417 1437 980 463 163 300 Table 5: Term Candidates Produced by TerMine, TermExtractor and ATE System based on SF-Value

Single Single Threshold Threshold F-score Testing MeSH MeSH F-score before Value for F- Value for F- Recall after Corpora Terms by N- Terms by Conflation score before score after Conflation gram SF Value Conflation Conflation subset 1 267 309 1 0.34 0.03 0.39 0.13 subset 2 309 262 0.85 0.31 0.01 0.35 0.10 subset 3 330 269 0.82 0.29 0.06 0.37 0.10 subset 4 327 263 0.80 0.30 0.06 0.35 0.13 subset 5 353 289 0.82 0.28 0 0.33 0.08 subset 6 356 308 0.87 0.30 0 0.34 0.08 subset 7 374 308 0.82 0.31 0 0.35 0.13 subset 8 375 319 0.85 0.29 0.01 0.38 0.13 subset 9 341 301 0.88 0.29 0.01 0.34 0.13 subset 10 367 300 0.82 0.33 0.06 0.39 0.13 Average 0.86 0.30 0.36 Table 6: Performance of ATE System based on SF-Value

4. Conclusion and Future Work terminology extraction. Most importantly, unlike other ATE systems that include various terminology extraction This research shows there is a probabilistic correlation techniques, this system lies mainly on syntactic properties of between syntactic functions of a NP in sentences and the terms with an aim to study the relations between terms and termhood of this NP. One weighting measure, SF- Value, is their syntactic functions. This feature promotes us to obtain implemented to compute termhood of terms. This measure reliable statistics on term occurrences. Therefore, this incorporates linguistic knowledge about syntactic properties research can be fairly valuable in that it shows the direct of terms and can assign effective values to term candidates. correlation between term occurrence and syntactic functions By setting threshold values, both multi word terms and of an NP. And it also proves the effectiveness of such a single word terms can be selected effectively. In particular, method in distinguishing single terms and non-terms. single word terms can be selected at a dominant rate. The In the future, this method should be validated for different syntactic properties of single word terms can be measured domains. What’s more, during linguistic processing of these by such a simple statistical value, which can be considered corpora, it is found that different parsers present linguistic as a practical indicator of single word term. information of different granularities, which will The most innovative aspect of this research is the exploring subsequently affect the accuracy of term recognition. the contribution of syntactic functions in recognizing and Therefore, different parsing results may be tested to find out extracting single terms from texts. This approach represents how the performance of ATE can be influenced. a novel, linguistically motivated perspective in the area of

46 5. Acknowledgements Frantzi, K., Ananiadou, S. & Tsujii, J. (1998) The C- value/NC-value Method of Automatic Recognition of Research described in this article was supported in part by Multi-word Terms. In Proceedings of the Second research grants from City University of Hong Kong (Project European Conference on Research and Advanced No. 7002387, 7008002, and 9610126). Technology for Digital Libraries, pp. 585-604. Sclano F. and Velardi P. (2007). TermExtractor: A Web application to learn the common terminology of interest groups and research communities. In Proceedings of the 6. References 9th Conference on Terminology and Artificial Intelligence, Sophia Antinopolis, France. Ananiadou, S. (1994). A methodology for automatic term Mima, H. and Ananiadou, S. (2001). An Application and recognition. In Proceedings of COLING 94, pp. 1034- Evaluation of the C/NC-Value Approach for the 1038. Automatic Term Recognition of Multi-Word Units in Baroni, M. and Bernardini, S. (2004). Boot-CaT: Japanese. International Journal on Terminology 6(2), pp. Bootstrapping Corpora and Terms from the Web. In 175-194 Proceedings of the Fourth International Conference on Kit, C. & Liu, X. (2008). Measuring Mono-word Termhood Language Resources and Evaluation (LREC), pp. 1313– by Rank Difference via Corpus Comparison. Terminology 1316. 14(2), pp. 204-229. Bernhard, D. (2006). Multilingual Term Extraction from Medical Subject Headings: Domain-specific Corpora Using Morphological Structure. http://www.nlm.nih.gov/pubs/factsheets/mesh.html. In Proceedings the 11th Conference of the European Rayson P. and Garside R. (2000). Comparing Corpora using Chapter of the Association for Computational Linguistics Frequency Profiling. In Proceedings of the ACL Workshop (EACL-06): Posters and Demonstrations, pp. 171–174. on Comparing Corpora, pp. 1–6. Church, K. (1989). Word association norms, mutual Salton, G., & Buckley, C. (1988). Term-weighting information and lexicography. In Proceedings of the 27th approaches in automatic text retrieval. Information annual meeting of the ACL. Vancouver, pp. 76-83. Processing and Management: an International Journal, Dagan, I. & Church, K. (1994). Termight: identifying and 24(5), pp. 513-523. translating technical terminology. In Proceedings of the Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). fourth conference on Applied Natural Language Translating collocations for bilingual lexicons: a Processing, pp. 34-40. statistical approach. Computational Linguistics, 22(1), pp. Daille, B. (1996). Study and Implementation of Combined 1-38. Techniques for Automatic Extraction of Terminology. In J. Vintar S. (2004). Comparative Evaluation of C-value in the Klavans & P. Resnik (Eds.), The Balancing Act: Treatment of Nested Terms. Memura 2004 – Combining Symbolic and Statistical Approaches to Methodologies and Evaluation of Multiword Units in Language, Cambridge, MA: The MIT Press, pp. 49–66. Real-World Applications. In Proceedings of the Fourth Dunning, T. (1993). Accurate methods for the statistics of International Conference on Language Resources and surprise and coincidence. Computational Linguistics Evaluation, pp. 54-57. 19(1), pp. 61-74. TerMine. (2009). NaCTeM website (http://nactem.ac.uk) . Eumeridou, E., Nkwenti-Azeh, B. & McNaught, J. (2004). Xu, F., Kurz D., Piskorski J., and Schmeier S. (2002). Term An Analysis of Verb Subcategorization Frames in Three extraction and mining term relations from free-text Special Language Corpora with View towards documents in the financial domain. In Proceedings of the AutomaticTerm Recognition. Computers and the 5th International Conference on Business Information Humanities 38, pp. 37-60. Systems (BIS’02), Poznan, Poland. Fang, A.C. (1996). The Survey Parser: Design and Wermter J. and Hahn, U. (2005). Massive Biomedical Term Development. In S. Greenbaum (Ed.), Comparing Discovery. In Proceedings of the 8th International English World Wide: The International Corpus of English, Conference on Discovery Science, pp. 281- 293. Oxford: Oxford University Press, pp. 142-160.

47

Appendix

1. Survey Parser Symbols 2. Top Single Term candidates ranked by SF Value in descending order (from testing subset 10). The Phrasal Categories and Functions: Adverb (AVP), The leftmost value has two values, 1 or 0. 1 indicates this Premodifier (AVPR), Head (AVHD), Postmodifier (AVPO), NP is a true MeSH term, 0 indicates it is not. Adjective (AJP) Premodifier (AJPR), Head (AJHD), 0 0.170745 background Postmodifier (AJPO), Determiner (DTP), Premodifier (DTPR), Predeterminer (DTPE), Central determiner 0 0.132059 glanzmann1 0.132059 thrombasthenia (DTCE), Postdeterminer (DTPS), Postmodifier (DTPO), 0 0.179134 agt Noun (NP) Determiner (DT), Premodifier (NPPR), Head 0 0.197231 disorder1 0.171540 alloantibodies (NPHD), Postmodifier (NPPO, Prepositional (PP) Modifier 1 0.286158 autoantibodies (PMOD), Prepositional (P), Complement (PC) , 1 0.132059 paraproteins Subordinator (SUBP) Modifier (SUBMO), Head 1 0.887126 diagnosis (SUBHD),Verb (VP), Operator (OP), Auxiliary verb (AVB), 0 0.069618 assays Main verb (MVB). 0 0.025571 consuming 0 0.823444 tests The Clausal Categories and Functions: Subject (SU), 1 0.457820 time Provisional subject (PRSU), Notional subject (NOSU),Verb 0 1.888486 we (VB), Predicate (PRED), Object Direct object (OD), 0 0.632440 case Indirect object (OI), Provisional object (PROD), Notional 0 0.233235 detection object (NOOD), Complement Subject complement (CS), 1 0.327183 methods Object complement (CO),Transitive complement (CT), 1 0.276609 male Focus complement (CF), Adverbial (A), Cleft operator 0 0.000121 count (CLOP), Existential operator (EXOP), Imperative operator 1 0.000121 lymphoma (IMPOP), Interrogative operator (INTOP), Inversion 0 1.743777 results0 0.078183 normal operator (INVOP), Coordinator (COOR), Detached function 1 0.294236 ristocetin (DEFUNC0,Discourse marker (DISMK), Clause element 0 0.077491 response (ELE), Focus (FOC), Linker (LK), Punctuation (PUNC),Subordinator (SUB).

48