Analysis of Link Grammar on Biomedical Dependency Corpus

Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions Sampo Pyysalo, Filip Ginter, Tapio Pahikkala, Jeppe Koivula Jorma Boberg, Jouni JÄarvinen, Tapio Salakoski MediCel Ltd., Turku Centre for Computer Science (TUCS) Haartmaninkatu 8 and Dept. of computer science, University of Turku 00290 Helsinki, Finland, LemminkÄaisenkatu 14A [email protected] 20520 Turku, Finland, ¯rst name.last [email protected].¯ Abstract parser with the UMLS Specialist2 lexicon, and Ding et al. (2003) perform a basic evaluation In this paper, we present an evaluation of the of LG performance on biomedical text. As both Link Grammar parser on a corpus consisting papers suggest, LG will require modi¯cations in of sentences describing protein-protein interac- order to provide a correct analysis of grammat- tions. We introduce the notion of an interac- ical phenomena that are rare in general English tion subgraph, which is the subgraph of a de- text, but common in biomedical language. Im- pendency graph expressing a protein-protein in- plementing such modi¯cations is a major e®ort teraction. We measure the performance of the that requires a careful analysis of the perfor- parser for recovery of dependencies, fully correct mance of the LG parser to identify the most linkages and interaction subgraphs. We analyze common causes of parsing failures and to target the causes of parser failure and report speci¯c modi¯cation e®orts. causes of error, and identify potential modi¯ca- While Szolovits (2003) does not attempt to tions to the grammar to address the identi¯ed evaluate parser performance at all and Ding issues. We also report and discuss the e®ect of et al. (2003) provide only an informal evalua- an extension to the dictionary of the parser. tion on manually simpli¯ed sentences, we focus on a more formal evaluation of the LG parser. 1 Introduction For the purpose of this study and also for sub- The challenges of processing the vast amounts of sequent research of biomedical information ex- biomedical publications available in databases traction with the LG parser, we have developed such as MEDLINE have recently attracted a a hand-annotated corpus consisting of unmod- considerable interest in the Natural Language i¯ed sentences from publications. We use this Processing (NLP) research community. The corpus to evaluate the performance of the LG task of information extraction, commonly tar- parser and to identify problems and potential geting entity relationships, such as protein- improvements to the grammar and parser. protein interactions, is an often studied prob- lem to which various NLP methods have been 2 Link Grammar and parser applied, ranging from keyword-based methods The Link Grammar and its parser represent an (see, e.g., Ginter et al. (2004)) to full syntactic implementation of a dependency-based compu- analysis as employed, for example, by Craven tational grammar. The result of LG analysis for and Kumlien (1999), Temkin and Gilder (2003) a sentence is a labeled undirected simple graph, and Daraselia et al. (2004). whose nodes represent the words of the sentence In this paper, we focus on the syntactic anal- and whose edges and their labels express the ysis component of an information extraction grammatical relationships between the words. system targeted to ¯nd protein-protein inter- In LG terminology, the graph is called a link- actions from the dependency output produced age links 1 , and its edges are called . The linkage by the Link Grammar (LG) parser of Sleator must be planar (i.e., links must not cross) when and Temperley (1991). Two recent papers study drawn above the words of the sentence, and the LG in the context of biomedical NLP. The work labels of the links must satisfy the linking con- by Szolovits (2003) proposes a fully automated straints speci¯ed for each word in the grammar. method to extend the dictionary of the LG A connected linkage is termed complete. 1http://www.link.cs.cmu.edu/link/ 2http://www.nlm.nih.gov/research/umls/ 15 findings suggest that PIP2 binds to proteins such as profilin Figure 1: Annotation example. The interaction of two proteins, PIP2 and pro¯lin, is stated by the words binds to. The links joining these words form the interaction subgraph (drawn with solid lines). Due to the structural ambiguity of natural names of at least two proteins that are known language, several linkages can typically be con- to interact. A domain expert annotated these structed for an input sentence. In such cases, sentences for protein names and for words stat- the LG parser enumerates all linkages allowed ing their interactions. Of these sentences, 1114 by the grammar. A post-processing step is then described at least one protein-protein interac- employed to enforce a number of additional con- tion. straints. The number of linkages for some sen- Thereafter, we performed a dependency anal- tences can be very high, making post-processing ysis and produced annotation of dependencies. and storage prohibitively expensive. This prob- To minimize the amount of mistakes, each sen- lem is addressed in the LG parser by de¯ning tence was independently annotated by two an- kmax, the maximal number of linkages to be notators and di®erences were then resolved by post-processed. If the parsing algorithm pro- discussion. The assigned dependency structure duces more than kmax linkages, the output is was produced according to the LG linkage con- reduced to kmax linkages by random sampling. ventions. Link types were not included in the The linkages are then ordered from best to worst annotation, and no cycles were introduced in using heuristic goodness criteria. the dependency graphs. All ambiguities where In order to be usable in practice, a parser is the LG parser is capable of at least enumerat- typically required to provide a partial analysis ing all alternatives (such as prepositional phrase of a sentence for which it cannot construct a full attachment) were enforced in the annotation. analysis. If the LG parser cannot construct a A random sample consisting of 300 sentences, complete linkage for a sentence, the connected- including 28 publication titles, has so far been ness requirement is relaxed so that some words fully annotated, giving 7098 word-to-word de- do not belong to the linkage at all. The LG pendencies. This set of sentences is the corpus parser is also time-limited. If the full set of we refer to in the following sections. linkages cannot be constructed in a given time An information extraction system targeted tmax, the parser enters a panic mode, in which it at protein-protein interactions and their types performs an e±cient but considerably restricted needs to identify three constituents that express parse, resulting in reduced performance. The an interaction in a sentence: the proteins in- parameters tmax and kmax set the trade-o® be- volved and the word or phrase that states their tween the qualitative performance and the re- interaction and suggests the type of this inter- source e±ciency of the parser. action. To extract this information from a LG linkage, the links connecting these items must 3 Corpus annotation and interaction be recovered correctly by the parser. The fol- subgraphs lowing de¯nition formalizes this notion. To compile a corpus of sentences describing De¯nition 1 (Interaction subgraph) The protein-protein interactions, we ¯rst selected interaction subgraph for an interaction between pairs of proteins that are known to interact two proteins A and B in a linkage L is the 3 from the Database of Interacting Proteins . We minimal connected subgraph of L that contains entered these pairs as search terms into the A, B, and the word or phrase that states their PubMed retrieval system. We then split the interaction. publication abstracts returned by the searches into sentences and included titles. These were The recovery of a connected component con- again searched for the protein pairs. This gave taining the protein names and the interaction us a set of 1927 sentences that contain the word is not su±cient: by the de¯nition of a complete linkage, such a component is always 3http://dip.doe-mbi.ucla.edu/ present. Consequently, the exact set of links 16 that forms the interaction subgraph must be re- time tmax for producing a normal parse was ex- covered. hausted and the parser entered panic mode, (2) For each interaction stated in a sentence, sentences where linkages were sampled because the corpus annotation speci¯es the proteins in- more than kmax linkages were produced, and volved and the interaction word. The interac- (3) stable sentences for which neither of these tion subgraph for each interaction can thus be occurred. A full analysis of all linkages that extracted automatically from the corpus. Be- the grammar allows is only possible for stable cause the corpus does not contain cyclic depen- sentences. For sentences in the other two cat- dencies, the interaction subgraphs are unique. egories, random e®ects may a®ect the results: 366 interaction subgraphs were identi¯ed from sentences for which more than kmax linkages are the corpus, one for each described interaction. produced are subject to randomness in sam- The interaction subgraphs can be partially over- pling, and sentences where the parser enters lapping, because a single link can be part of panic mode were always subject to subsequent more than one interaction subgraph. Figure 1 sampling in our experiments. shows an example of an annotated text frag- ment. 5 Evaluation To evaluate the ability of the LG parser to pro- 4 Evaluation criteria duce correct linkages, we increased the number of stable sentences by setting the tmax param- We evaluated the performance of the LG parser eter to 10 minutes and the kmax parameter to according to the following three quantitative cri- 10000 instead of using the defaults tmax = 30 teria: seconds and kmax = 1000.

Load more