Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction

Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction Alona Fyshe Duane Szafron Department of Computing Science Department of Computing Science University of Alberta University of Alberta Edmonton, Alberta T6G 2E8 Edmonton, Alberta T6G 2E8 [email protected] [email protected] Abstract domain. Our prototype method predicts the subcellular localization of proteins (the part of the biolog- The field of molecular biology is growing ical cell where a protein performs its function) by at an astounding rate and research findings performing text classification on related journal ab- are being deposited into public databases, stracts. such as Swiss-Prot. Many of the over In the last two decades, there has been explosive 200,000 protein entries in Swiss-Prot 49.1 growth in molecular biology research. Molecular bi- lack annotations such as subcellular lo- ologists organize their findings into a common set calization or function, but the vast major- of databases. One such database is Swiss-Prot, in ity have references to journal abstracts de- which each entry corresponds to a protein. As of scribing related research. These abstracts version 49.1 (February 21, 2006) Swiss-Prot con- represent a huge amount of information tains more than 200,000 proteins, 190,000 of which that could be used to generate annotations link to biological journal abstracts. Unfortunately, a for proteins automatically. Training clas- much smaller percentage of protein entries are anno- sifiers to perform text categorization on tated with other types of information. For example, abstracts is one way to accomplish this only about half the entries have subcellular localiza- task. We present a method for improving tion annotations. This disparity is partially due to text classification for biological journal the fact that humans annotate these databases manu- abstracts by generating additional text fea- ally and cannot keep up with the influx of data. If a tures using the knowledge represented in computer could be trained to produce annotations by a biological concept hierarchy (the Gene processing journal abstracts, proteins in the Swiss- Ontology). The structure of the ontology, Prot database could be curated semi-automatically. as well as the synonyms recorded in it, are leveraged by our simple technique to sig- Document classification is the process of cate- nificantly improve the F-measure of sub- gorizing a set of text documents into one or more cellular localization text classifiers by as of a predefined set of classes. The classification much as 0.078 and we achieve F-measures of biological abstracts is an interesting specializa- as high as 0.935. tion of general document classification, in that scientific language is often not understandable by, nor written for, the lay-person. It is full of specialized 1 Introduction terms, acronyms and it often displays high levels Can computers extract the semantic content of aca- of synonymy. For example, the “PAM complex”, demic journal abstracts? This paper explores the use which exists in the mitochondrion of the biologi- of natural language techniques for processing bio- cal cell is also referred to with the phrases “pre- logical abstracts to answer this question in a specific sequence translocase-associated import motor” and 17 Proceedings of the BioNLP Workshop on Linking Natural Language Processing and Biology at HLT-NAACL 06, pages 17–24, New York City, June 2006. c 2006 Association for Computational Linguistics “mitochondrial import motor”. This also illustrates a data set by mining Medline for abstracts contain- the fact that biological terms often span word bound- ing a yeast gene name, which achieved F-measures aries and so their collective meaning is lost when in the range [0.31,0.80]. F-measure is defined as text is whitespace tokenized. 2rp To overcome the challenges of scientific lan- f = r + p guage, our technique employs the Gene Ontology (GO) (Ashburner et al, 2000) as a source of expert where p is precision and r is recall. They expanded knowledge. The GO is a controlled vocabulary of their training data to include extra biological infor- biological terms developed and maintained by biol- mation about each protein, in the form of amino acid ogists. In this paper we use the knowledge repre- content, and raised their F-measure by as much as sented by the GO to complement the information 0.05. These results are modest, but before Stapley present in journal abstracts. Specifically we show et al. most localization classification systems were that: built using text rules or were sequence based. This was one of the first applications of SVMs to bio- • the GO can be used as a thesaurus logical journal abstracts and it showed that text and • the hierarchical structure of the GO can be used amino acid composition together yield better results to generalize specific terms into broad concepts than either alone. Properties of proteins themselves were again used • simple techniques using the GO significantly to improve text categorization for animal, plant and improve text classification fungi subcellular localization data sets (Hoglund¨ et al, 2006). The authors’ text classifiers were Although biological abstracts are challenging based on the most distinguishing terms of docu- documents to classify, solving this problem will ments, and they included the output of four pro- yield important benefits. With sufficiently accurate tein sequence classifiers in their training data. They text classifiers, the abstracts of Swiss-Prot entries measure the performance of their classifier using could be used to automatically annotate correspond- what they call sensitivity and specificity, though ing proteins, meaning biologists could more effi- the formulas cited are the standard definitions of ciently identify proteins of interest. Less time spent recall and precision. Their text-only classifier for sifting through unannotated proteins translates into the animal MultiLoc data set had recall (sensitivity) more time spent on new science, performing impor- in the range [0.51,0.93] and specificity (precision) tant experiments and uncovering fresh knowledge. [0.32,0.91]. The MultiLocText classifiers, which 2 Related Work include sequence-based classifications, have recall [0.82,0.93] and precision [0.55,0.95]. Their overall Several different learning algorithms have been ex- and average accuracy increased by 16.2% and 9.0% plored for text classification (Dumais et al, 1998) to 86.4% and 94.5% respectively on the PLOC an- and support vector machines (SVMs) (Vapnik, imal data set when text was augmented with addi- 1995) were found to be the most computationally ef- tional sequence-based information. ficient and to have the highest precision/recall break- Our method is motivated by the improvements even point (BEP, the point where precision equals that Stapley et al. and Hoglund¨ et al. saw when they recall). Joachims performed a very thorough evalu- included additional biological information. How- ation of the suitability of SVMs for text classifica- ever, our technique uses knowledge of a textual na- tion (Joachims, 1998). Joachims states that SVMs ture to improve text classification; it uses no infor- are perfect for textual data as it produces sparse mation from the amino acid sequence. Thus, our ap- training instances in very high dimensional space. proach can be used in conjunction with techniques Soon after Joachims’ survey, researchers started that use properties of the protein sequence. using SVMs to classify biological journal abstracts. In non-biological domains, external knowledge Stapley et al. (2002) used SVMs to predict the sub- has already been used to improve text categoriza- cellular localization of yeast proteins. They created tion (Gabrilovich and Markovitch, 2005). In their 18 research, text categorization is applied to news docu- Set of ments, newsgroup archives and movie reviews. The Proteins authors use the Open Directory Project (ODP) as a source of world knowledge to help alleviate prob- a lems of polysemy and synonymy. The ODP is a Retrieve Abstracts hierarchy of concepts where each concept node has links to related web pages. The authors mined these web pages to collect characteristic words for each Set of concept. Then a new document was mapped, based Abstracts on document similarity, to the closest matching ODP b concept and features were generated from that con- Process cept’s meaningful words. The generated features, Abstracts along with the original document, were fed into an SVM text classifier. This technique yielded BEP as high as 0.695 and improvements of up to 0.254. Data Set 1 Data Set 2 Data Set 3 We use Gabrilovich and Markovitch’s (2005) idea to employ an external knowledge hierarchy, in our case the GO, as a source of information. It has Figure 1: The workflow used to create data sets used been shown that GO molecular function annotations in this paper. Abstracts are gathered for proteins in Swiss-Prot are indicative of subcellular localiza- with known localization (process a). Treatments are tion annotations (Lu and Hunter, 2005), and that GO applied to abstracts to create three Data Sets (pro- node names made up about 6% of a sample Medline cess b). corpus (Verspoor et al, 2003). Some consider GO terms to be too rare to be of use (Rice et al, 2005), however we will show that although the presence of cess begins with a set of proteins with known sub- GO terms is slight, the terms are powerful enough to cellular localization annotations (Figure 1). For this improve text classification. Our technique’s success we use Proteome Analyst’s (PA) data sets (Lu et al, may be due to the fact that we include the synonyms 2004; Szafron et al, 2004). The PA group used these of GO node names, which increases the number of data sets to create very accurate subcellular classi- GO terms found in the documents.

Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction

Uniloc: a Universal Protein Localization Site Predictor For

Evidence That Subcellular Localization of a Bacterial Membrane Protein Is Achieved by Diffusion and Capture

Bird Eye View of Protein Subcellular Localization Prediction

Subcellular Localization of a Bacterial Regulatory RNA

Subcellular Localization in Bacteria: from Envz/Ompr to Transertion

Vimentin) ScaOld

Physiological Roles of Zinc Transporters: Molecular and Genetic Importance in Zinc Homeostasis

Towards a Systematics for Protein Subcellular Location

Cytosolic, Nuclear and Nucleolar Localization Signals Determine Subcellular Distribution and Activity of the NF-Κb Inducing Kinase NIK

The Cell-Shape Protein Mrec Interacts with Extracytoplasmic Proteins Including Cell Wall Assembly Complexes in Caulobacter Crescentus

Antibody-Based Subcellular Localization of the Human Proteome

Molecular Characterizations of Subcellular Localization Signals in the Nucleocapsid Protein of Porcine Epidemic Diarrhea Virus