Extracting Biochemical Interactions from MEDLINE Using a Link Grammar Parser

Extracting Biochemical Interactions from MEDLINE Using a Link Grammar Parser Jing Ding Daniel Berleant Jun Xu Andy W. Fulmer Department of Electrical and Computer Engineering, The Procter and Gamble Company Iowa State University, Ames, IA Cincinnati, OH {dingjing, berleant}@iastate.edu {xu.j.1, fulmer.aw}@pg.com Abstract sentences to be practical on large corpora. Yet MEDLINE abstracts are full of complex sentences, so without effi- Many natural language processing approaches at vari- cient handling of complex sentences, the overall perform- ous complexity levels have been reported for extracting ance and application of any NLP interaction mining sys- biochemical interactions from MEDLINE. While some tem is limited. algorithms using simple template matching are unable to In this paper, we propose extracting biochemical inter- deal with the complex syntactic structures, others exploit- actions from MEDLINE using a link grammar parser ing sophisticated parsing techniques are hindered by (LGP) [4]. The parser has good performance on complex greater computational cost. This study investigates link sentences, attributable to its balance between expressive grammar parsing for extracting biochemical interactions. power and computational complexity. First, in section 2, Link grammar parsing can handle many syntactic struc- we briefly discuss syntactic structures of MEDLINE sen- tures and is computationally relatively efficient. We ex- tences, and give a review of the strengths and limitations perimented on a sample MEDLINE corpus. Although the of the current NLP interaction mining approaches in the parser was originally developed for conversational Eng- literature. In section 3 we give a brief introduction to link lish and made many mistakes in parsing sentences from grammar and a link grammar parser. In section 4, we re- the biochemical domain, it nevertheless achieved better port the experiment results of using the parser on a sample overall performance than a co-occurrence-only method. MEDLINE corpus. Section 5 contains a discussion of the Customizing the parser for the biomedical domain is ex- LGP’s expressive power and computational complexity. pected to improve its performance further. Future developments and a conclusion follow. 2. Related work 1. Introduction Biochemical interactions described in MEDLINE ab- MEDLINE is a rich source for mining biochemical in- stracts are rarely stated as simply as “protein A activates teractions for various tasks, such as populating databases protein B.” Various syntactic structures are used to com- of interacting proteins, constructing networks of protein pact several interactions, as well as other information, into interactions, and assisting human experts to sift through a single sentence. Among the most frequently used are the most relevant documents. Many algorithms have been nominalization (converting a predicate to a noun phrase) proposed in the literature, falling into two broad catego- and coordination (combining two or more predicates with ries, statistical approaches and natural language processing coordinating conjunctions). It is not uncommon in (NLP) approaches. More attention has been paid to the MEDLINE abstracts for a single sentence (or fragment) to latter, probably due to the fact that the single sentence describe a network of interactions. Consider an example (NLP’s main focus) is a good choice of text unit for min- from MEDLINE: ing biochemical interactions [3], and many algorithms and Gamma-aminobutyric acid mediation of the inhibi- tools can be borrowed from computational linguistics. In tory effect of nitric oxide on the arginine vaso- the existing reports, NLP approaches were used to analyze pressin and oxytocin responses to insulin-induced sentences with grammars of various levels of expressive hypoglycemia. (PMID: 8952001) power and computational complexity. However, it is chal- This sentence fragment has four instances of nominali- lenging to find a good balance between the two factors. zation and one of coordination. There are five biochemi- Most systems suffered from either not enough expressive cals in the sentence (ten possible interacting pairs). Three power to deal with complex sentence structures, or too interactions are explicitly described, NO → GABA, NO much computational overhead when processing complex → AVP and NO → OXT), and four are implied (GABA → AVP, GABA → OXT, insulin → AVP and insulin → parser in their system to confirm their own noun phrase OXT (NO means nitric oxide, AVP means arginine vaso- grouping rules, ignoring other output of the parser. In such pressin, OXT means oxytocin, and GABA means Gamma- systems, part of the computational resources goes into aminobutyric acid. All are nominalized. Among the three generating parse results that are irrelevant to the task at non-interacting pairs, only AVP/OXT is obvious because hand, which is a waste of computing resources. Finally, of the coordination evidenced by the coordinating con- none of these algorithms dealt with coordination. junction “and.” The complexity shown here illustrates the Coordination occurs in a sentence when it contains a challenges that NLP interaction mining algorithms face in shared structure. The sharing avoids duplication, so that this domain. the sentence is more compact than if sharing had not oc- Natural Language Processing is still in its relative in- curred. That is the main reason why this syntactic structure fancy, so automatic full discourse analysis and semantic is so widely used in MEDLINE abstracts and elsewhere. understanding are beyond the capability of today’s com- Coordination can be applied to various sentence compo- puting systems. Current NLP algorithms, therefore, try nents. For example, various ways to simplify the syntactic structures, or focus Protein A activates proteins B and C. on specific subsets of the structures. Protein A activates protein B and protein C. The syntactically simplest algorithms predict interac- Protein A activates protein B, and inhibits protein C. tion from a pair of co-occurring terms, as in PathBinder … [2]. Template matching algorithms require more evidence All of these examples use coordination to avoid saying, of interaction than just co-occurrence of the terms. One for example, “Protein A activates protein B. Protein A type of templates consists of two slots for interacting bio- activates protein C.” This kind of complexity cannot eas- chemicals and an interactor (an interaction-related word) ily be handled by simple template matching. Note that the slot, such as in [6] and [10]. Blaschke et al. [1] added a coordinating components (B and C) are not related to each constraint that the interactor must be between the two other. They are put together only because they are related terms. An implicit assumption behind such templates is to a common third party. Therefore, coordination should that there is no crossover among predicates or nominalized be used to rule out interactions between the coordinating predicates, so they performed poorly on multiple-instance components. nominalization and coordination. For example, a “term- In the next section, we introduce a link grammar parser, interactor-term” template will miss two explicitly stated which deals with coordination. In addition, the output of interactions in the example sentence given above (NO → the parser is also suitable for extracting relationships be- AVP and NO → OXT), falsely include a non-interaction tween non-coordinating terms. (GABA/insulin), either miss two implied interactions (insulin → AVP and insulin → OXT) or include a non- 3. Link grammar and the link grammar interaction (NO/insulin) depending on whether or not “re- parser sponses” can fill the interactor slot. In addition, both “mediation” and “inhibitory” can fill the interactor slot be- Link grammar was first introduced by Sleator and tween GABA and NO. Temperley to simplify English grammar with a context- To better deal with nominalization, Leroy and Chen [5] free grammar [8]. The basic idea of link grammar is to tried to first find noun phrases using templates built connect pairs of words in a sentence with various links. around the preposition “of,” and then fill the phrases into Each word is viewed as a block with connectors coming main templates built around the preposition “by.” Other out. There are various types of connectors, and connectors researchers took advantage of progress in the NLP field. may point to the right or to the left. A valid sentence may For example commercial and open-source part-of-speech have more than one complete linkage, just as a sentence taggers and parsers were experimented with, such as in [7] may have several meanings. and [11]. However, these approaches have some limita- Grinberg et al. [4] developed a robust parser to imple- tions. First, parsing is difficult, especially in face of the ment the link grammar. It has a dictionary of about 60,000 complexity of MEDLINE sentences. For example, Yaku- words, and can recognize a wide range of English syntac- shiji et al. [11] experimented with a full parser on 179 sen- tic phenomena: noun-verb agreement, questions, impera- tences taken from MEDLINE. In only 66 (30%) were cor- tives, complex and irregular verbs, many types of nouns, rect parse results obtained. Second, the parsing results past- or present-participles in noun phrases, commas, a (parse trees and grouped phrases) may or may not help variety of adjective types, prepositions, adverbs, relative extract interactions. Consider the example sentence

Load more