Extracting Molecular Binding Relationships from Biomedical Text
Thomas C. RINDFLESCH Jayant V. RAJAN Lawrence HUNTER National Library of Medicine University of Rochester National Cancer Institute 8600 Rockville Pike Rochester, NY, 14620 7550 Wisconsin Avenue Bethesda, MD, 20894 Jayant.Rajan@ mc.rochester.edu Bethesda, MD, 20894 [email protected] [email protected]
proximately 500,000 MEDLINE citations rele- Abstract vant to molecular binding affinity. Our decision to apply information extraction ARBITER is a Prolog program that extracts technology to binding relationships was guided assertions about macromolecular binding not only by the biological importance of this relationships from biomedical text. We de- phenomenon but also by the relatively straight- scribe the domain knowledge and the under- forward syntactic cuing of binding predications specified linguistic analyses that support the in text. The inflectional forms of a single verb, identification of these predications. After bind, indicate this relationship in the vast major- discussing a formal evaluation of ity of cases, and our initial work is limited to ARBITER, we report on its application to these instances. For example, our goal in this 491,000 MEDLINE ~ abstracts, during project is to extract the binding predications in which almost 25,000 binding relationships (2) from the text in (1). suitable for entry into a database of macro- molecular function were extracted. (1) CC chemokine receptor 1 (CCR1) is ex- pressed in neutrophils, monocytes, lympho- Introduction cytes, and eosinophils, and binds the leuko- cyte chemoattractant and hematopoiesis Far more scientific information exists in the lit- regulator macrophage inflammatory protein erature than in any structured database. Con- (MIP)- 1 alpha, as well as several related CC venient access to this information could signifi- chemokines. cantly benefit research activities in various fields. The emerging technology of information (2)
188
encoded in text into a more readily accessible type that categorizes the concept into subareas of tbrmat, typically a template with slots named for biology or medicine. Examples pertinent to the participants in the scenario of interest. The binding terminology include the semantic types template for molecular binding can be thought of 'Amino Acid, Peptide, or Protein' and 'Nucleo- as a simple predication with predicate "bind" tide Sequence'. The SPECIALIST Lexicon and two arguments which participate (sym- (with associated lexical access tools) supplies metrically) in the relationship: BINDS(
1RQ189
phrases and proceeds in a series of cascaded (4) for the sentence in (3). In (4) evidence sup- steps that depend on existing domain knowledge porting identification as a binding term is given as well as several small, special-purpose re- in braces. Note that in the underspecified syn- sources in order to determine whether each noun tactic analysis, prepositional phrases are treated phrase encountered is to be considered a binding as (simple) noun phrases that have a preposition term. as their first member. As the first step in the process, an existing (3) Je142 is an IgG which binds to the small program, MetaMap, (Aronson et al. 1994) at- bacterial protein, HPr and the structure of the tempts to map each simple noun phrase to a con- complex is known at high resolution. cept in the UMLS Metathesaurus. The semantic type for concepts corresponding to successfully (4) [binding_term([ head(Je142)], mapped noun phrases is then checked against a {Morphology Shape Rule } small subset of UMLS semantic types referring [aux0s)], to bindable entities, such as 'Amino Acid, Pep- [det(an), head(IgG)] {Metathesaurus } tide, or Protein', 'Nucleotide Sequence', 'Car- ), bohydrate', 'Cell', and 'Virus'. For concepts [pron(which)], with a semantic type in this set, the correspond- [verb(binds)], ing noun phrase is considered to be a binding binding_term([prep(to), det(the), term. mod(small), mod(bacterial), head(protein), The heads of noun phrases that do not map to punc(,)], {Metathesaurus } a concept in the Metathesaurus are tested against [head(HPr)] {Morphology Shape Rule } a small set of general "binding words," which ), often indicate that the noun phrase in which they [conj(and)], appear is a binding term. The set of binding [det(the), head(structure)I, words includes such nouns as cleft, groove, binding_term([prep(of), det(the), membrane, ligand, motif, receptor, domain, ele- head(complex)] { General Binding Word } ment, and molecule. ), The head of a noun phrase that did not sub- [aux(is)], mit to the preceding steps is examined to see [verb(known)], whether it adheres to the morphologic shape of a [prep(at), mod(high), head(resolution), normal English word. In this context such a punc(.)l] word is often an acronym not defined locally and indicates the presence of a binding term 1.2 Identifying binding terms as (Fukuda et al. 1998). A normal English word has arguments of relationships at least one vowel and no digits, and a text token that contains at least one letter and is not a nor- Before addressing the strategy for determining real English word functions as a binding word in the arguments of binding predications, we dis- this context. cuss the general treatment of macro-noun The final step in identifying binding terms is phrases during the second part of the processing. to join contiguous simple noun phrases qualify- Although ARBITER attempts to recover com- ing as binding terms into a single macro-noun plete macro-noun phrases during the first phase, phrase. Rindflesch et al. (1999) use the term only the most specific (and biologically useful) "macro-noun phrase" to refer to structures that part of a macro-noun phrase is recovered during include reduced relative clauses (commonly in- the extraction of binding predications. Terms re- troduced by prepositions or participles) as well ferring to specific molecules are more useful as appositives. Two binding terms joined by a than those referring to general classes of bind- form of be are also treated as though they able entities, such as receptor, ligand, protein, formed a macro-noun phrase, as in Jel42 is an or molecule. The syntactic head of a macro-noun IgG which binds ... phrase (the first simple noun phrase in the list) is The results of identifying binding terms (and not always the most specific or most useful term thus potential binding arguments) are given in in the construction.
190l_qt~ The Specificity Rule for determining the lationships asserted in the text are cued by the most specific part of the list of simple binding gerundive or participial form binding. In this terms constituting a macro-noun phrase chooses syntactic predication, the resources available the first simple term in the list which has either from the underspecified syntactic parse serve of the following two characteristics: a) The head quite well as the basis for correctly identifying was identified by the Morphology Shape Rule. the arguments of the binding relationship. b) The noun phrase maps to a UMLS concept The most common argument configuration having one of the following semantic types: associated with binding is for both arguments to 'Amino Acid, Peptide, or Protein', 'Nucleic occur to the right, cued by prepositions, most Acid, Nucleoside, or Nucleotide', 'Nucleotide commonly of and to; however, other frequent Sequence', 'Immunologic Factor', or 'Gene or patterns are of-by and to-by. Another method of Genome'. For example, in (5), the second simple argument cuing for binding is for the subject of term, TNF-alpha promoter, maps to the Meta- the predication to function syntactically as a thesaurus with semantic type 'Nucleotide Se- modifier of the head binding in the same simple quence' and is thus considered to be the most noun phrase. The object in this instance is then specific term in this complex-noun phrase. cued by either of or to (to the right). A few other patterns are seen and some occurrences of bind- (5) binding_term( ing do not cue a complete predication; either the [transcriptionally active kappaB motifs], subject is missing or neither argument is expli- [in the TNF-alpha promoter], citly mentioned. However, the examples in (6) [in normal cells]) fairly represent the interpretation of binding. In identifying binding terms as arguments of (6) These results suggest that 2 amino acids, a complete binding predication, as indicated Thr-340 and Ser-343, play important but above, we examine only those binding relations distinct roles in promoting the binding of ar- cued by some form of the verb bind (bind, binds, restin to rhodopsin. bound, and binding). The list of minimal syntac- tic phrases constituting the partial parse of the
191 one binding term precedes the verb. We there- BINDS fore use the strategy summarized in (7) for re-
192
Although the particular underspecified syn- in (12), where the failure to identify ran as a tactic analysis used in the identification of binding term caused ARBITER to miss the cor- binding predications in the biomedical research rect binding predication asserted in this sentence literature is limited in several important ways, it (indicated by "-FN->"). appears adequate to enable this project with a (12) Requirement of guanosine triphosphate- useful level of effectiveness, and this is sup- bound ran for signal-mediated nuclear pro- ported by evaluation. tein export. -FN->
1Q"2 193
identify the predication marked as a false nega- • 27,345 bindings between a Genbank entry tive. and a UMLS Metathesaurus concept • 5,569 unique relationships among pairs of 3 Application entries (involving 11,617 unique entries) As an initial application of ARBITER we ran the program on 491,356 MEDLINE citations, which were retrieved using the same search strategy re- Conclusion sponsible for the gold standard. During this run, The cooperation of structured domain knowl- 331,777 sentences in 192,997 citations produced edge and underspecified syntactic analysis en- 419,782 total binding assertions. Extrapolating ables the extraction of macromolecular binding from the gold standard evaluation, we assume relationships from the research literature. Al- that this is about half of the total binding predi- though our implementation is domain-specific, cations asserted in the citations processed and the underlying principles are amenable to that somewhat less than three quarters of those broader applicability. extracted are correct. ARBITER makes a distinction between first The initial list of 419,982 binding triples rep- labeling binding terms and then identifying cer- resents what ARBITER determined was asserted tain of these terms as arguments in a binding in the text being processed. Many of these as- predication. The first phase of this processing is sertions, such as those in (14), while correct, are dependent on biomedical domain knowledge ac- too general to be useful. cessible from the UMLS. Applying the tech- (14)
194 Ait-Mokhtar S. and Chanod J.-P. (1997) Incremental McDonald D. D. (1992) Robust partial parsing finite-state parsing. Proceedings of the Fifth Con- through incremental, multi-algorithm processing. ference on Applied Natural Language Processing, In "Text-Based Intelligent Systems," P. S. Jacobs, pp. 72-9. ed., pp. 83-99. Appelt D. E. and Israel D. (1997) Tutorial on build- Morin E. and Jacquemin C. Projecting corpus-based ing information extraction systems. Fifth Confer- semantic links on a thesaurus. Proceedings of the ence on Applied Natural Language Processing. 37th Annual Meeting of the Association for Com- Aronson A. R., Rindflesch T. C., and Browne A. C. putational Linguistics, pp. 389-96. (1994) Exploiting a large thesaurus for informa- MUC-7. Message Understanding Conference Pro- tion retrieval. Proceedings of RIAO 94, pp. 197- ceedings, http://www.muc.saic.com. 216. Rajan J. V., Hunter L., and Rindflesch T. C. (In Benson D. A., Boguski M. S., Lipman D. J., Ostell J., prep.) Mining MEDLINE. and Ouelette B. F. (1998) GenBank. Nucleic Acids Rindflesch T. C. (1995) Integrating natural language Research, 26/1, pp. 1-7. processing and biomedical domain knowledge for Blaschke C., Andrade M. A., Ouzounis C., and Va- increased information retrieval effectiveness. Pro- lencia A. (1999) Automatic extraction of biological ceedings of the 5th Annual Dual-Use Technologies information from scientific text: protein-protein and Applications Conference, pp. 260-5. interactions. Intelligent Systems for Molecular Bi- Rindflesch T. C., Hunter L., and Aronson A. R. ology (ISMB), 7, pp. 60-7. (1999) Mining molecular binding terminology from Church K. W. (1988) A stochastic parts program and biomedical text. Proceedings of the AMIA Annual noun phrase parser for unrestricted text. Proceed- Symposium, pp. 127-131. ings of the Second Conference on Applied Natural Rindflesch T. C., Tanabe L., Weinstein J. N., and Language Processing, pp. 136-143. Hunter L. (2000) EDGAR: Extraction of drug,s, Craven M. and Kumlien J. (1999) Constructing bio- genes and relations from the biomedical literature. logical knowledge bases by extracting information Pacific Symposium on Biocomputing (PSB), 5, pp. from text sources. Intelligent Systems for Molecu- 514-25. lar Biology (ISMB), 7, pp. 77-86. Tersmette K. W. F., Scott A. F., Moore G.W., Mathe- Cutting D. R., Kupiec J., Pedersen J. O., and Sibun P. son N. W., and Miller R. E. (1988) Barrier word (1992) A practical part-of-speech tagger. Pro- method for detecting molecular biology multiple ceedings of the Third Conference on Applied Natu- word terms. Proceedings of the 12th Annual Sym- ral Language Processing. posium on Computer Applications in Medical Fukuda F., Tsunoda T., Tamura A., and Takagi T. Care, pp. 207- 11. (1998) Toward information extraction: Identifying Voorhees E. M. and Harman D. K. (1998) The Sev- protein names from biological papers. Pacific enth Text Retrieval Conference. Symposium on Biocomputing (PSB), 3, pp. 705- Vourtilainen A. and Padro L. (1997) Developing a 16. hybrid NP parser. Proceedings of the Fifth Confer- Hearst M. A. (1999) Untangling text data mining. ence on Applied Natural Language Processing, pp. Proceedings of the 37th Annual Meeting of the As- 80-7. sociation for Computational Linguistics, pp. 3-10. Wacholder N., Ravin Y., and Choi M. (1997) Disam- Hindle D. (1983) Deterministic parsing of syntactic biguation of proper names in text. Proceedings of non-fluencies. Proceedings of the 21st Annual the Fifth Conference on Applied Natural Language Meeting of the Association for Computational Lin- Processing, pp. 202-208. guistics, pp. 123-8. Weischedel R., Meteer M., Schwartz R., Ramshaw Humphreys B. L., Lindberg D. A. B., Schoolman H. L., and Palmucci J. (1993) Coping with ambiguity M., and Barnett G. O. (1998) The Unified Medical and unknown words through probabilistic models. language System: An informatics research collabo- Computational Linguistics, 19/2, pp. 359-382. ration. Journal of the American Medical Informat- Zhai C. (1997) Fast statistical parsing of noun ics Association, 5/1, pp. 1-13. phrases for document indexing. Proceedings of the McCray A. T., Srinivasan S., and Browne A. C. Fifth Conference on Applied Natural Language (1994) Lexical methods for managing variation in Processing, pp. 312-31. biomedical terminologies. Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care, pp. 235-9.
195