A Protein Interaction Extraction Systemusing a Link Grammar Parser from Biomedical Abstracts
Total Page:16
File Type:pdf, Size:1020Kb
World Academy of Science, Engineering and Technology International Journal of Biomedical and Biological Engineering Vol:1, No:5, 2007 PIELG: A Protein Interaction Extraction System using a Link Grammar Parser from Biomedical Abstracts Rania A. Abul Seoud, Nahed H. Solouma, Abou-Baker M. Youssef, and Yasser M. Kadah, Senior Member, IEEE failure. Applications that repair or replace portions of or Abstract—Due to the ever growing amount of publications about whole living tissues (e.g., bone, dentine, or bladder) using protein-protein interactions, information extraction from text is living cells is named Tissue Engineering (TE). For example, increasingly recognized as one of crucial technologies in dentine formation is the process of regenerating dental tissues bioinformatics. This paper presents a Protein Interaction Extraction by tissue engineering principles and technology. Dentine System using a Link Grammar Parser from biomedical abstracts formation is governed by biological mediators or growth (PIELG). PIELG uses linkage given by the Link Grammar Parser to start a case based analysis of contents of various syntactic roles as factors (protein) and interactions amongst different proteins. well as their linguistically significant and meaningful combinations. Dentine formation needs the support of continuous updated The system uses phrasal-prepositional verbs patterns to overcome information about protein-protein interactions. preposition combinations problems. The recall and precision are Researches in the last decade have resulted in the 74.4% and 62.65%, respectively. Experimental evaluations with two production of a large amount of information about protein other state-of-the-art extraction systems indicate that PIELG system functions involved in dentine formation process. That achieves better performance. For further evaluation, the system is augmented with a graphical package (Cytoscape) for extracting generated data is highly connected; hence, should be such data protein interaction information from sequence databases. The result is made easily available. Scientists in that field are aided by shows that the performance is remarkably promising. many online databases covering different aspects of protein function, such as protein–protein interaction DIP [1] and Keywords—Link Grammar Parser, Interaction extraction, BOND [2], CSNDB [3] and SPAD [4]. However, since they protein-protein interaction, Natural language processing. are dependent on human experts, they rarely store more than a few thousand of the best-known protein relationships and do I. INTRODUCTION not contain the most recently discovered facts and ROTEOMICS is aimed at understanding protein-protein experimental details. Pinteractions. The function of a protein can be The information about protein – protein interactions characterized more precisely through knowledge of involved in dentine formation process is scattered throughout protein-protein interactions. Protein-protein interactions are numerous publications in scientific journals and/or abstracts. important for many biological functions. They link many According to the U.S. National Library of Medicine [5], proteins in the cell into large connected interaction networks. PubMed [6] includes over 17 million citations from Each protein can have one or more of many roles in the MEDLINE and other life science journals for biomedical network. Moreover, networks of interacting proteins provide a articles dating back to the 1950s, with about 40,000 being first level of understanding the cellular mechanism. added each month. Hence, manual collection of database Many tragic and costly problems in human health care to be updating data from such resources becomes impractically time solved need the support of continuous updated information consuming. Thus, having a scalable, robust system for protein about protein-protein interactions such as tissue loss or organ interaction involved in dentine formation process discovery International Science Index, Biomedical and Biological Engineering Vol:1, No:5, 2007 waset.org/Publication/4889 provides a major information extraction tool for molecular biologists in tissue engineering laboratories to automatically Rania A. Abul Seoud is with the Department of Computer Engineering, Faculty of Engineering, Fayoum University, Fayoum, Egypt. extract and transfer updated biological data about protein- E-mail: [email protected]. protein interactions from unstructured form, to a structured Nahed h. Solouma is with the Laser Institute, Helwan University, Giza, form to be used in their respective applications. Egypt. E-mail: [email protected]. Abou-Baker M. Youssef is with the Department of Biomedical In this paper we present PIELG system. PIELG is a Protein Engineering, Faculty of Engineering, Cairo University, Giza, Egypt Interaction Extraction System using a Link Grammar Parser E-mail: [email protected]. from biomedical abstracts. PIELG is a fully automated Yasser M. kadah is with the Department of Biomedical Engineering, Faculty of Engineering, Cairo University, Giza, Egypt extraction system to extract protein interactions in natural E-mail: [email protected]. language texts. Our approach tags protein names with the help International Scholarly and Scientific Research & Innovation 1(5) 2007 282 scholar.waset.org/1307-6892/4889 World Academy of Science, Engineering and Technology International Journal of Biomedical and Biological Engineering Vol:1, No:5, 2007 of protein names and linguistic ontologies. The system into a structure from which relationships can be readily extracts complete interactions by analyzing the matching extracted. Many natural language processing approaches at contents of syntactic roles and their linguistically significant various complexity levels have been used successfully to combinations. The system uses phrasal-prepositional verbs extract various classes of data from biological texts, including patterns to overcome preposition combinations problems. protein-protein interactions. PIELG is purely implemented with Perl under Linux platform. More advanced systems utilizing shallow parsing The recall and precision are 47.4% and 62. 65%. Experimental techniques have been described to extract protein interactions evaluations with two other state-of-the-art extraction systems [12]. Shallow parsers perform partial decomposition of indicate that PIELG system achieves better performance. For sentence structure. Unlike word-based pattern matchers; further evaluation, the PIELG system is augmented with a shallow parsers [13] perform partial decomposition of a graphical package (Cytoscape) for extracting protein sentence structure. They identify certain phrasal components interaction information from sequence databases. The and extract local dependencies between them without augmentation process evaluates the extracted interaction by reconstructing the structure of an entire sentence. In some drawing the pathways for the extracted interaction. Then cases, shallow-parsers are used in combination with various compare those pathways with the stored pathways in heuristic and statistical methods [14]. The most promising Cytoscape. Our experimental results show that the PIELG candidates for a practical information extraction system are system achieves better performance without the need of ones based on full-sentence parsing as they deal with the manual pattern creation (by user) which is required for other structure of an entire sentence and therefore are potentially systems. more accurate. However, full parsers are significantly slower and require more memory. A problem of parsing ambiguity II. RELATED WORK can be reduced by employment of domain-specific context- sensitive grammars. This approach has been implemented in a The goal of relationship extraction is to detect occurrences system called MedLee [15]. Another system is called GENIES of a pre-specified type of relationship between a pair of [16] which utilizes a grammar based NLP engine for entities of given types. While the type of the entities is usually information extraction. Context-free parsing systems, on the very specific (e.g. genes, proteins or drugs), the type of other hand, are general enough to be applicable to any relationship may be very general (e.g. any biochemical domain, but completely generic systems seem to be association) or very specific (e.g. a regulatory relationship). impractical and inefficient. The Pathway Assist system uses Many approaches have been proposed for information an NLP system, MedScan [17] for the bio-medical domain extraction (IE) from scientific publications, ranging from that tags the entities in text and produces a semantic tree. Slot simple statistical methods to advanced natural language filler type rules are engineered based on the semantic tree processing (NLP) systems. The first step done towards representation to extract relationships from text. Recently, it Information extraction was to recognize the names of proteins, has been extended as GeneWays [18], which also provides a genes, drugs and other molecules [7]. The next step was to Web interface that allows users to search and submit papers of recognize interaction events between such entities [8]. Basic interest for analysis. The BioRAT [19] system uses manually information extraction approaches rely on the matching of engineered templates that combine lexical and semantic pre-specified templates (patterns) or rules. A number of information to identify protein interactions. Grammar groups reported application