Automatic Classification of Protein Functions from the Literature

Comparative and Functional Genomics Comp Funct Genom 2003; 4: 75–79. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.241 Conference Review Automatic classification of protein functions from the literature Christian Blaschke and Alfonso Valencia* Protein Design Group, CNB-CSIC, Madrid, Spain *Correspondence to: Alfonso Valencia, Protein Design Group, National Center for Biotechnology, CNB-CSIC, Cantoblanco, Madrid E-28049, Spain. E-mail: [email protected] Received: 6 November 2002 Accepted: 29 November 2002 Key activities in genomics and proteomics are providing fast access to the annotations of various related to the use of classifications (ontologies). functional characteristics (metabolic, chemistry, The three main activities are: localization). However, a number of the require- ments described above are still not addressed, 1. Annotation of gene function after large-scale including the need to keep pointers to the underly- genome sequencing projects or in protein data- ing information up to date and the need to pro- bases. vide reproducible evidence (e.g. why this group 2. Annotation of the possible function common to of sequences was assigned to that function). Fur- sets of genes with similar expression profiles thermore, the annotation of functions at levels that (determined after DNA array experiments). differ from the ones represented in GO, or genomes 3. Annotation of the characteristic functions of not yet included, also remains to be solved. groups of proteins, protein complexes and bio- Statistical information extraction is a field with logical pathways (determined by proteomics a long history that has only recently been applied approaches). to biology. One of the best-known applications The process of annotation requires the work has been integrated into PubMed to enhance the of human experts who, by reading the available search functionality for Medline queries [21,22]. literature and using their expert knowledge, are Special applications were dedicated to knowledge able to generate abstractions regarding the common base construction [7,14], analysis of DNA micro- function of genes and proteins. array data [13,19] and improvements of the search This process is extremely demanding and time capabilities in document collections [16]. consuming and usually has to be repeated many times (e.g. for every different clustering process applied to the same DNA array experiment). In Systems for the annotation of groups of addition, the pointers to the evidence used by genes (gene products) the annotators must be recorded (the experimental evidence should be linked to the annotations). These methods were developed for the functional The GO ontology [2] has become a de facto stan- annotation of protein families, results of DNA array dard that helps to solve some of these problems, expression experiments, and data from proteomics. Copyright 2003 John Wiley & Sons, Ltd. 76 C. Blaschke and A. Valencia They are based on statistical information extrac- clusters. Figure 1 shows part of the results gener- tion techniques that basically use term frequencies ated by the system. and their distribution in the text to characterize a The results obtained for a cluster containing given text corpus. Here the criterion is whether genes related to DNA replication initiation and a term appears more frequently in a given doc- entrance into the cell cycle, including cell divi- ument set than in other comparable documents. sion control (CDC) genes such as cdc47 and Terms will have high (statistical significance) val- cdc54, genes related to minichromosome main- ues if they are frequent in the documents under tenance (mcm2 and mcm3), and dbf2, a protein consideration and not frequent in the rest of the kinase related to cell division, illustrate the qual- groups. ity of the terms extracted by the system. The These systems extract (from a set of documents) terms extracted were related to minichromosome specific terms that are related to a number of genes maintenance (mcm2, mcm3, mcm4, mcm5, cdc46, or proteins of interest, detect highly informative mcm proteins, mcm family, mcm genes, minichro- sentences in the text and rank the documents mosome maintenance, maintenance mcm, mis5, according to their information content, and thus chromosome loss), DNA synthesis (licensing fac- reduce the number of documents that have to be tor, replicate, replication licensing, replication ori- reviewed for a set of genes or proteins. gins, autonomously replicating, DNA replication, DNA synthesis, S-phase), phosphorylation (protein kinase, dbf2, phosphorylate) and cell cycle (cdc46, Specific applications related to the cdc47, cdc21, cdc54, cell cycle). annotation of protein families and gene clusters Problems with gene and protein names This basic engine has been adapted for the analysis of protein families [1,3], clusters of genes from The main challenge is still the correct iden- DNA array experiments [4,15] and protein com- tification of all the pertinent text, i.e. all the plexes (not published), e.g. for functionally related documents (abstracts) related to the genes (or proteins implicated in vacuolar ATP synthesis, the gene products) under analysis. Using current tech- system extracts terms such as vacuolar acidifi- niques it is not difficult to identify the enti- cation, vacuolar ATPase or proton-translocating ties in the text (this is the part of the analysis ATPase; for proteins related to the cell cycle, terms called ‘named entity recognition’ in the context such as microtubule, spindle pole or kinesin-related of information extraction) but, because of ambi- (showing the strong relationship between cell divi- guities in the names of biological objects and sion and the cytoskeleton) are extracted. the absence of a strict nomenclature for gene The main result of expression array experiments and protein names, these entities are difficult to is the discovery of sets of genes with similar classify. expression patterns (expression-based gene clus- The problem is related to the different ways in ters). The underlying assumption is that these which the same gene (or gene product) can be clusters are related by their participation in com- referred to in text. Our previous evaluation shows mon biological processes. These similarities are that even in well-annotated, manually curated reflected in the individual publications about the databases a substantial number of records contain genes in a given cluster and can be detected by protein names that cannot be identified in the liter- our methods. ature [5]. We analysed gene expression data, including the A number of groups have addressed this prob- experiments in yeast published by Eisen et al.[9]. lem using different technologies [11,17,20] and These experiments monitored the expression of Franzen´ et al. recently reported some improve- yeast cells in 79 separate experiments, including ments [10]. Still, no perfect solution is at hand. The diauxic shift, mitotic cell cycle, sporulation, and problem is not only due to the number of ways temperature and reducing shocks. The system was in which gene and protein names can be written applied to the 254 genes that showed significant (e.g. IL6, IL-6 or IL 6, for interleukin 6); other changes in gene expression, corresponding to 10 issues are that names can be part of other names Copyright 2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 75–79. Automatic protein classification 77 Figure 1. Results of an analysis of a DNA array expression experiment. The figure shows extracted keywords for each cluster (on the left) and the corresponding expression profiles (middle). This output was produced with an implementation of the methods described in the publications mentioned above by Alma Bioinformatica S.L. (www.almabioinfo.com) under the name AlmaTextMiner (e.g. Cdc7 and Cdc7 protein kinase), that they can Creation of ontologies for the specific include non-protein name parts (e.g. RNA in RNA annotation of protein function polymerase II), and that they can refer to classes of proteins rather than a specific protein (e.g. Fus3p Following the initial methodology for the extrac- and Kss1p are MAP kinases, CLN1 and CLN2 tion of information for the functional annotation are G1/S cyclins). Furthermore, a protein can be of genes and proteins, we implemented a system referred to by its specific name or by its general that would detect similarities between genes based class name (cyclin B or ‘this cyclin’), and gene on the literature published about them [6]. These and protein names can easily be confused with the similarities were then used to cluster genes and names of other biological entities (e.g. drugs). to construct tree-like structures with genes at the Copyright 2003 John Wiley & Sons, Ltd. Comp Funct Genom 2003; 4: 75–79. 78 C. Blaschke and A. Valencia leaves and concepts (terms) at the nodes of the The system includes facilities for: (a) the use tree, each one of them weighted by its statistical of large repositories of published knowledge to significance. enrich the information associated with the exist- The system produces similar annotations to those ing ontologies; (b) the automatic suggestion of the given by GO, particularly for aspects related to classification of new entities; and (c) the introduc- biochemical function. Additionally, the terms and tion of possible new pointers to the corresponding document links that justify the proposed relations literature. The process has the advantage of produc- are provided, to assist human experts in building ing annotations that can be directly related to the the ontologies. corresponding sentences and documents, providing For example, the system associated a number of additional words and concepts to supplement those genes implicated in cell cycle regulation, e.g. G1 provided in the human-driven ontologies. and G2 cyclins, the protein kinases cdc28 and cdc7, The utility of automatic classification systems and DNA polymerases, because they all shared a may be particularly relevant for database annota- number of significant terms (e.g. G1 cyclin, cell tors during the construction of ontologies for new cycle, s-phase) that indicate that they are implicated organisms.

Automatic Classification of Protein Functions from the Literature

Semantic Web

Aggregation and Correlation Toolbox for Analyses of Genome Tracks Justin Jee Yale University

Biocreative II.5 Workshop 2009 Special Session on Digital Annotations

Tporthmm : Predicting the Substrate Class Of

MPGM: Scalable and Accurate Multiple Network Alignment

You Are Cordially Invited to a Talk in the Edmond J. Safra Center for Bioinformatics Distinguished Speaker Series

ISCB's Initial Reaction to the New England Journal of Medicine

Keynote Speaker

Obstacles to Studying Alternative Splicing Using Scrna-Seq

I S C B N E W S L E T T

Optimization and Machine Learning Applications to Protein Sequence and Structure

Big Data E Inteligencia Artificial Y HPC En Salud Y Biomedicina