<<

Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions Sampo Pyysalo, Filip Ginter, Tapio Pahikkala, Jeppe Koivula Jorma Boberg, Jouni J¨arvinen, Tapio Salakoski MediCel Ltd., Turku Centre for Computer Science (TUCS) Haartmaninkatu 8 and Dept. of computer science, University of Turku 00290 Helsinki, Finland, Lemmink¨aisenkatu 14A [email protected] 20520 Turku, Finland, first name.last [email protected].fi

Abstract parser with the UMLS Specialist2 lexicon, and Ding et al. (2003) perform a basic evaluation In this paper, we present an evaluation of the of LG performance on biomedical text. As both Link Grammar parser on a corpus consisting papers suggest, LG will require modifications in of sentences describing protein-protein interac- order to provide a correct analysis of grammat- tions. We introduce the notion of an interac- ical phenomena that are rare in general English tion subgraph, which is the subgraph of a de- text, but common in biomedical language. Im- pendency graph expressing a protein-protein in- plementing such modifications is a major effort teraction. We measure the performance of the that requires a careful analysis of the perfor- parser for recovery of dependencies, fully correct mance of the LG parser to identify the most linkages and interaction subgraphs. We analyze common causes of failures and to target the causes of parser failure and report specific modification efforts. causes of error, and identify potential modifica- While Szolovits (2003) does not attempt to tions to the grammar to address the identified evaluate parser performance at all and Ding issues. We also report and discuss the effect of et al. (2003) provide only an informal evalua- an extension to the dictionary of the parser. tion on manually simplified sentences, we focus on a more formal evaluation of the LG parser. 1 Introduction For the purpose of this study and also for sub- The challenges of processing the vast amounts of sequent research of biomedical information ex- biomedical publications available in databases traction with the LG parser, we have developed such as MEDLINE have recently attracted a a hand-annotated corpus consisting of unmod- considerable interest in the Natural Language ified sentences from publications. We use this Processing (NLP) research community. The corpus to evaluate the performance of the LG task of , commonly tar- parser and to identify problems and potential geting entity relationships, such as protein- improvements to the grammar and parser. protein interactions, is an often studied prob- lem to which various NLP methods have been 2 Link Grammar and parser applied, ranging from keyword-based methods The Link Grammar and its parser represent an (see, e.g., Ginter et al. (2004)) to full syntactic implementation of a dependency-based compu- analysis as employed, for example, by Craven tational grammar. The result of LG analysis for and Kumlien (1999), Temkin and Gilder (2003) a sentence is a labeled undirected simple graph, and Daraselia et al. (2004). whose nodes represent the words of the sentence In this paper, we focus on the syntactic anal- and whose edges and their labels express the ysis component of an information extraction grammatical relationships between the words. system targeted to find protein-protein inter- In LG terminology, the graph is called a link- actions from the dependency output produced age links 1 , and its edges are called . The linkage by the Link Grammar (LG) parser of Sleator must be planar (i.e., links must not cross) when and Temperley (1991). Two recent papers study drawn above the words of the sentence, and the LG in the context of biomedical NLP. The work labels of the links must satisfy the linking con- by Szolovits (2003) proposes a fully automated straints specified for each word in the grammar. method to extend the dictionary of the LG A connected linkage is termed complete.

1http://www.link.cs.cmu.edu/link/ 2http://www.nlm.nih.gov/research/umls/

15 findings suggest that PIP2 binds to proteins such as profilin

Figure 1: Annotation example. The interaction of two proteins, PIP2 and profilin, is stated by the words binds to. The links joining these words form the interaction subgraph (drawn with solid lines).

Due to the structural ambiguity of natural names of at least two proteins that are known language, several linkages can typically be con- to interact. A domain expert annotated these structed for an input sentence. In such cases, sentences for protein names and for words stat- the LG parser enumerates all linkages allowed ing their interactions. Of these sentences, 1114 by the grammar. A post-processing step is then described at least one protein-protein interac- employed to enforce a number of additional con- tion. straints. The number of linkages for some sen- Thereafter, we performed a dependency anal- tences can be very high, making post-processing ysis and produced annotation of dependencies. and storage prohibitively expensive. This prob- To minimize the amount of mistakes, each sen- lem is addressed in the LG parser by defining tence was independently annotated by two an- kmax, the maximal number of linkages to be notators and differences were then resolved by post-processed. If the parsing algorithm pro- discussion. The assigned dependency structure duces more than kmax linkages, the output is was produced according to the LG linkage con- reduced to kmax linkages by random sampling. ventions. Link types were not included in the The linkages are then ordered from best to worst annotation, and no cycles were introduced in using heuristic goodness criteria. the dependency graphs. All ambiguities where In order to be usable in practice, a parser is the LG parser is capable of at least enumerat- typically required to provide a partial analysis ing all alternatives (such as prepositional phrase of a sentence for which it cannot construct a full attachment) were enforced in the annotation. analysis. If the LG parser cannot construct a A random sample consisting of 300 sentences, complete linkage for a sentence, the connected- including 28 publication titles, has so far been ness requirement is relaxed so that some words fully annotated, giving 7098 word-to-word de- do not belong to the linkage at all. The LG pendencies. This set of sentences is the corpus parser is also time-limited. If the full set of we refer to in the following sections. linkages cannot be constructed in a given time An information extraction system targeted tmax, the parser enters a panic mode, in which it at protein-protein interactions and their types performs an efficient but considerably restricted needs to identify three constituents that express parse, resulting in reduced performance. The an interaction in a sentence: the proteins in- parameters tmax and kmax set the trade-off be- volved and the word or phrase that states their tween the qualitative performance and the re- interaction and suggests the type of this inter- source efficiency of the parser. action. To extract this information from a LG linkage, the links connecting these items must 3 Corpus annotation and interaction be recovered correctly by the parser. The fol- subgraphs lowing definition formalizes this notion. To compile a corpus of sentences describing Definition 1 (Interaction subgraph) The protein-protein interactions, we first selected interaction subgraph for an interaction between pairs of proteins that are known to interact two proteins A and B in a linkage L is the 3 from the Database of Interacting Proteins . We minimal connected subgraph of L that contains entered these pairs as search terms into the A, B, and the word or phrase that states their PubMed retrieval system. We then split the interaction. publication abstracts returned by the searches into sentences and included titles. These were The recovery of a connected component con- again searched for the protein pairs. This gave taining the protein names and the interaction us a set of 1927 sentences that contain the word is not sufficient: by the definition of a complete linkage, such a component is always 3http://dip.doe-mbi.ucla.edu/ present. Consequently, the exact set of links

16 that forms the interaction subgraph must be re- time tmax for producing a normal parse was ex- covered. hausted and the parser entered panic mode, (2) For each interaction stated in a sentence, sentences where linkages were sampled because the corpus annotation specifies the proteins in- more than kmax linkages were produced, and volved and the interaction word. The interac- (3) stable sentences for which neither of these tion subgraph for each interaction can thus be occurred. A full analysis of all linkages that extracted automatically from the corpus. Be- the grammar allows is only possible for stable cause the corpus does not contain cyclic depen- sentences. For sentences in the other two cat- dencies, the interaction subgraphs are unique. egories, random effects may affect the results: 366 interaction subgraphs were identified from sentences for which more than kmax linkages are the corpus, one for each described interaction. produced are subject to randomness in sam- The interaction subgraphs can be partially over- pling, and sentences where the parser enters lapping, because a single link can be part of panic mode were always subject to subsequent more than one interaction subgraph. Figure 1 sampling in our experiments. shows an example of an annotated text frag- ment. 5 Evaluation To evaluate the ability of the LG parser to pro- 4 Evaluation criteria duce correct linkages, we increased the number of stable sentences by setting the tmax param- We evaluated the performance of the LG parser eter to 10 minutes and the kmax parameter to according to the following three quantitative cri- 10000 instead of using the defaults tmax = 30 teria: seconds and kmax = 1000. When parsing the • Number of dependencies recovered corpus using these parameters, 28 sentences fell into the panic category, 61 into the sampled • Number of fully correct linkages category, and 211 were stable. The measured parser performance for the corpus is presented • Number of interaction subgraphs recovered in Table 1. While the fraction of sentences that have a The number of recovered dependencies gives an fully correct linkage as the first linkage is quite estimate of the probability that a dependency low (approximately 7%), for 28% of sentences will be correctly identified by the LG parser the parser is capable of producing a fully cor- (this criterion is also employed by, e.g., Collins rect linkage. Performance was especially poor et al. (1999)). The number of fully correct link- for the publication titles in the corpus. Because ages, i.e. linkages where all annotated depen- titles are typically fragments not containing a dencies are recovered, measures the fraction of verb, and LG is designed to model full clauses, sentences that are parsed without error. How- the parser failed to produce a fully correct link- ever, a fully correct linkage is not necessary to age for any of the titles. extract protein-protein interactions from a sen- The performance for recovered interaction tence; to estimate how many interactions can subgraphs is more encouraging, as 25% of the potentially be recovered, we measure the num- subgraphs were recovered in the first linkage and ber of interaction subgraphs for which all de- more than half in the best linkage. Yet many in- pendencies were recovered. teraction subgraphs remain unrecovered by the For each criterion, we measure the perfor- parser: the results suggest an upper limit of mance for the first linkage returned by the approximately 60% to the fraction of protein- parser. However, the first linkage as ordered protein interactions that can be recovered from by the heuristics of the LG parser was often not any linkage produced by the unmodified LG. the best (according to the criteria above) of the In the following sections we further analyze the linkages returned by the parser. To separate reasons why the parser fails to recover all de- the effect of the heuristics from overall LG per- pendencies. formance, we identify separately for each of the three criteria the best linkage among the link- 5.1 Panics ages returned by the parser, and we also report No fully correct linkages and very few interac- performance for the best linkages. tion subgraphs were found in the panic mode. We further divide the parsed sentences into This effect may be partly due to the complex- three categories: (1) sentences for which the ity of the sentences for which the parser en-

17 Category Criterion Linkage Stable Sampled Panic Overall Dependency First linkage 3242 (80.0%) 1376 (74.3%) 569 (52.3%) 5187 (73.1%) Best linkage 3601 (86.6%) 1576 (85.0%) 620 (57.0%) 5797 (81.7%) Total 4157 1853 1088 7098

Fully correct First linkage 22 (10.4%) 0 (0.0%) 0 (0.0%) 22 (7.3%) Best linkage 79 (37.4%) 6 (9.8%) 0 (0.0%) 85 (28.3%) Total 211 61 28 300

Interaction First linkage 75 (30.5%) 16 (20.2%) 0 (0.0%) 91 (24.9%) subgraph Best linkage 156 (63.4%) 49 (62.0%) 4 (9.8%) 209 (57.1%) Total 246 79 41 366

Table 1: Parser performance. The fraction of fulfilled criteria is shown by category (the criteria and categories are explained in Section 4). The total rows give the number of criteria for each category, and the overall column gives combined results for all categories. tered panic mode. The effect of panics can der the sentences so that linkages that are more be better estimated by forcing the parser to likely to be correct are presented first. The bypass standard parsing and to directly apply heuristics are based on examination and intu- panic options. For the 272 sentences where the itions on general English, and may not be opti- parser did not enter the panic mode, 77% of mal for biomedical text. Note in Table 1 that dependencies were recovered in the first link- both for recovered full linkages and interaction age. When these sentences were parsed in forced subgraphs, the number of items that were recov- panic mode, 67% of dependencies were recov- ered in the best linkage is more than twice the ered, suggesting that on average parses in panic number recovered in the first linkage, suggesting mode recover approximately 10% fewer depen- that a better ordering heuristic could dramat- dencies than in standard parsing mode. Simi- ically improve the performance of the parser. larly, the number of fully correct first linkages Such improvements could perhaps be achieved decreased from 22 to 6 and the number of inter- by tuning the heuristics to the domain or by action subgraphs recovered in the first linkage adopting a probabilistic ordering model. from 91 to 65. These numbers indicate that panics are a significant cause of error. 6 Failure analysis Experiments indicate than on a 1GHz ma- A significant fraction of dependencies were not chine approximately 40% of sentences can be recovered in any linkage, even in sentences fully parsed in under a second, 80% in under 10 where resources were not exhausted. In order to seconds and 90% within 10 minutes; yet approx- identify reasons for the parser failing to recover imately 5% of sentences take more than an hour the correct dependencies, we analyze sentences to fully parse. With tmax set to 10 minutes, the for which it is certain that the grammar cannot total parsing time was 165 minutes. produce a fully correct linkage. We thus ana- Long parsing times are caused by ambiguous lyzed the 132 stable sentences for which some sentences for which the parser creates thousands dependencies were not recovered. or even millions of alternative linkages. In ad- For each sentence, we attempt to identify the dition to simply increasing the time limit, the reason for the failure of the parser. For each fraction of sentences where the parser enters the identified reason, we manually edit the sentence panic mode could therefore be reduced by re- to remove the source of failure. We repeat this ducing the ambiguity of the sentences, for ex- procedure until the parser is capable of pro- ample, by extending the dictionary of the parser ducing a correct parse for the sentence. Note (see Section 7). that this implies that also the interaction sub- graphs in the sentence are correctly recovered, 5.2 Heuristics and therefore the reasons for failures to recover When several linkages are produced for a sen- interaction subgraphs are a subset of the iden- tence, the LG parser applies heuristics to or- tified issues. The results of the analysis are

18 Reason for failure Cases Unknown grammatical structure 72 (34.4%) Dictionary issue 54 (25.8%) capping protein and actin genes Unknown word handling 35 (16.7%) Sentence fragment 27 (12.9%) Ungrammatical sentence 17 (8.1%) capping protein and actin genes Other 4 (1.9%) Figure 2: Multiple modifier coordination prob- Table 2: Results of failure analysis lem. Above: correct linkage disallowed by the LG parser. Below: solution by chaining modi- summarized in Table 2. In many of the sen- fiers. tences, more than one reason for parser failure was found; in total 209 issues were identified in 6.2 Unknown grammatical structures the 132 sentences. The results are described in The method of the LG implementation for pars- more detail in the following sections. ing coordinations was found to be a frequent cause of failures. A specific coordination prob- 6.1 Fragments and ungrammatical lem occurs with multiple noun-modifiers: the sentences parser assumes that coordinated constituents As some of the analyzed sentences were taken can be connected to the rest of the sentence from publication titles, not all of them were through exactly one word, and the grammar at- full clauses. To identify further problems when taches all noun-modifiers to the head. Biomed- parsing fragments not containing a verb, the ical texts frequently contain phrases that cause phrase “is explained” and required these requirements to conflict: for example, in were added to these fragments, a technique used the phrase “capping protein and actin genes” also by Ding et al. (2003). The completed frag- (where “capping protein genes” and “actin ments were then analyzed for potential further genes” is the intended parse), the parser allows problems. only one of the words “capping” and “protein” A number of other ungrammatical sentences to connect to the word “genes”, and is thus un- were also encountered. The most common able to produce the correct linkage (for illustra- problem was the omission of determiners, but tion, see Figure 2(a)). some other issues such as missing possessive This multiple modifier coordination issue markers and errors in agreement (e.g., “expres- could be addressed by modifying the grammar sions. . . has”) were also encountered. to chain modifiers (Figure 2(b)). This alterna- Ungrammatical sentences pose interesting tive model is adopted by another major depen- challenges for parsing. Because many authors dency grammar, the EngCG-based Connexor are not native English speakers, a greater toler- Machinese. The problem could also be ad- ance for grammatical mistakes should allow the dressed by altering the coordination handling parser to identify the intended parse for more system in the parser. sentences. Similarly, the ability to parse publi- Other identified grammatical structures not cation titles would extend the applicability of known to the parser were number postmodifiers the parser; in some cases it may be possible to nouns (e.g., “serine 38”), specifiers in paren- to extract information concerning the key find- theses (e.g., “profilin mutant (H119E)”), coor- ings of a publication from the title. However, dination with the phrase “but not”, and various while relaxing completeness and correctness re- unknown uses of colons and quotes. Single in- quirements, such as mandatory determiners and stances of several distinct unknown grammati- subject-predicate agreement, would allow the cal structures were also noted (e.g., “5 to 10”, parser to create a complete linkage for more sen- “as expected from”, “most concentrated in”). tences, it would also be expected to lead to in- Most of these issues can be addressed by local creased ambiguity for all sentences, and subse- modifications to the grammar. quent difficulties in identifying the correct link- age. If the ability to parse titles is considered 6.3 Unknown word handling important, a potential solution not incurring The LG parser assigns unknown words to cate- this cost would be to develop a separate version gories based on morphological or other surface of the grammar for parsing titles. clues when possible. For remaining unknown

19 words, parses are attempted by assigning the recognition. However, the performance of cur- words to the generic noun, verb and adjective rent NE recognition systems is not perfect, and types in all possible combinations. it is not clear what the effect of adopting such Some problems with the unknown word pro- a method would be on parser performance. cessing method were encountered during analy- sis; for example, the assumption that unknown 7 Dictionary extension capitalized words are proper nouns often caused Szolovits (2003) describes an automatic method failures, especially in sentences beginning with for mapping lexical information from one lexi- an unknown word. Similarly, the assumption con to another, and applies this method to aug- that words containing a hyphen behave as ad- ment the LG dictionary with terms from the ex- jectives was violated by a number of unknown tensive UMLS Specialist lexicon. The extension verbs (e.g., “cross-links”). introduces more than 125,000 new words into Another problem that was noted occurred the LG dictionary, more than tripling its size. with lowercase unknown words that should be We evaluated the effect of this dictionary exten- treated as proper nouns: because LG does not sion on LG parser performance using the criteria allow unknown lowercase words to act as proper described above. The fraction of distinct tokens nouns, the parser assigns incorrect structure to in the corpus found in the parser dictionary in- a number of phrases containing words such as creased from 52% to 72% with the dictionary “actin”. Improving unknown word handling re- extension, representing a significant reduction quires some modifications to the LG parser. in uncertainty. This decrease was coupled with 6.4 Dictionary issues a 32% reduction in total parsing time. Because the LG parser is unable to produce Cases where the LG dictionary contains a word, any linkage for sentences where it cannot iden- but not in the sense in which it appears in a tify a verb (even incorrectly), extending the dic- sentence, almost always lead to errors. For ex- tionary significantly reduced the ability of LG ample, the LG dictionary does not contain the to extract dependencies in titles, where the frac- word “assembly” in the sense “construction”, tion of recovered dependencies fell from the al- causing the parser to erroneously require a de- 4 ready low value of 67% to 55%. terminer for “protein assembly” . A related For the sentences excluding titles, the benefits frequent problem occurred with proper names of the dictionary extension were most significant headed by a common noun, where the parser ex- for sentences that were in the panic category pects a for such names (e.g., “myosin when using the unextended LG dictionary; 12 heavy chain”), and fails when one is not present. of these 28 sentences could be parsed without These issues are mostly straightforward to ad- panic with the dictionary extension. In the first dress in the grammar, but difficult to identify linkage of these sentences, the fraction of re- automatically. covered dependencies increased by 8%, and the 6.5 Biomedical entity names fraction of recovered interaction subgraphs in- Many of the causes for parser failure discussed creased from zero to 15% with the dictionary above are related to the presence of biomed- extension. ical entity names. While the causes for fail- The overall effect of the dictionary extension ures relating to names can be addressed in the was positive but modest, with no more than grammar, the existence of biomedical named en- 2.5% improvement for either the first or best tity (NE) recognition systems (for a recent sur- linkages for any criterion, despite the threefold vey, see, e.g., Bunescu et al. (2004)) suggests increase in dictionary size. This result agrees an alternative solution: NEs could be identified with the failure analysis: most problems can- in preprocessing, and treated as single (proper not be removed by extending the dictionary and noun) tokens during the parse. During failure must instead be addressed by modifications of analysis, 59 cases (28% of all cases) were noted the grammar or parser. where this procedure would have eliminated the error, assuming that no errors are made in NE 8 Conclusion We have presented an analysis of Link Gram- 430 distinct problematic word definitions were iden- tified, including “breakdown”, “composed”, “factor”, mar performance using a custom dependency “half”, “independent”, “localized”, “parallel”, “pro- corpus targeted at protein-protein interactions. moter”, “segment”, “upstream” and “via”. We introduced the concept of the interaction

20 subgraph and reported parser performance for References three criteria: recovery of dependencies, in- Razvan Bunescu, Ruifang Ge, Rohit J. Kate, teraction subgraphs and fully correct linkages. Edward M. Marcotte, Raymond J. Mooney, While LG was able to recover 73% of dependen- Arun Kumar Ramani, and Yuk Wah Wong. 2004 cies in the first linkage, only 7% of sentences had (to appear). Comparative experiments on learn- a fully correct first linkage. However, fully cor- ing information extractors for proteins and their rect linkages are not required for information ex- interactions. Artificial Intelligence in Medicine. Special Issue on Summarization and Information traction, and we found that 25% of interaction Extraction from Medical Documents. subgraphs were recovered in the first linkage. Michael Collins, Jan Hajic, Lance Ramshaw, and Resource exhaustion was found to be a signif- Christoph Tillmann. 1999. A statistical parser icant cause of poor performance. Furthermore, for Czech. In 37th Annual Meeting of the Associ- an evaluation of performance in the case when ation for Computational Linguistics, pages 505– optimal heuristics for ordering linkages are ap- 512. Association for Computational Linguistics, plied indicated that the fraction of recovered in- Somerset, New Jersey. teraction subgraphs could be more than doubled Mark Craven and Johan Kumlien. 1999. Construct- (to 57%) by optimal heuristics. ing biological knowledge bases by extracting in- formation from text sources. In T. Lengauer, To further analyze the cases where the parser R. Schneider, P. Bork, D. Brutlag, J. Glasgow, cannot produce a correct linkage, we carefully H HW Mewes, and Zimmer R., editors, Proceed- examined the sentences and were able to iden- ings of the 7th International Conference on Intel- tify five problem types. For each identified ligent Systems in Molecular Biology, pages 77–86. case, we discussed potential modifications for AAAI Press, Menlo Park, CA. addressing the problems. We also considered Nikolai Daraselia, Anton Yuryev, Sergei Egorov, the possibility of using a named entity recogni- Svetalana Novichkova, Alexander Nikitin, and tion system to improve parser performance and Ilya Mazo. 2004. Extracting human protein in- found that 28% of LG failures would be avoided teractions from MEDLINE using a full-sentence parser. Bioinformatics, 20(5):604–611. by a flawless named entity recognition system. Jing Ding, Daniel Berleant, Jun Xu, and Andy W. We evaluated the effect of the dictionary ex- Fulmer. 2003. Extracting biochemical interac- tension proposed by Szolovits (2003), and found tions from medline using a link grammar parser. that while it significantly reduced ambiguity In B. Werner, editor, Proceedings of the 15th and improved performance for the most ambigu- IEEE International Conference on Tools with Ar- ous sentences, overall improvement was only tificial Intelligence, pages 467–471. IEEE Com- 2.5%. This indicates that extending the dic- puter Society, Los Alamitos, CA. tionary is not sufficient to address the perfor- Filip Ginter, Tapio Pahikkala, Sampo Pyysalo, mance problems and that modifications to the Jorma Boberg, Jouni J¨arvinen, and Tapio Salakoski. 2004. Extracting protein-protein inter- grammar and parser are necessary. action sentences by applying rough set data anal- The quantitative analysis of LG performance ysis. In S. Tsumoto, R. Slowinski, J. Komorowski, confirms that, in its current state, LG is not well and J.W. Grzymala-Busse, editors, Lecture Notes suited to the IE task discussed. However, in the in Computer Science 3066. Springer, Heidelberg. failure analysis we have identified a number of Daniel D. Sleator and Davy Temperley. 1991. Pars- specific issues and problematic areas for LG in ing english with a link grammar. Technical Re- parsing biomedical publications, and suggested port CMU-CS-91-196, Department of Computer improvements for adapting the parser to this Science, Carnegie Mellon University, Pittsburgh, PA. domain. The examination and implementation Peter Szolovits. 2003. Adding a medical lexicon to of these improvements is a natural follow-up of an english parser. In Mark Musen, editor, Pro- this study. Our initial experiments suggest that ceedings of the 2003 AMIA Annual Symposium, it is indeed possible to implement general so- pages 639–643. American Medical Informatics As- lutions to many of the discussed problems, and sociation, Bethesda, MD. such modifications would be expected to lead to Joshua M. Temkin and Mark R. Gilder. 2003. Ex- improved applicability of LG to the biomedical traction of protein interaction information from domain. unstructured text using a context-free grammar. Bioinformatics, 19(16):2046–2053. 9 Acknowledgments This work has been supported by Tekes, the Finnish National Technology Agency.

21