Kersey Focus.Indd

SYSTEMS BIOLOGY: A USER’S GUIDE REVIEW Linking publication, gene and protein data Paul Kersey and Rolf Apweiler The computational reconstruction of biological systems, ‘systems biology’, is necessarily dependent on the existence of well- annotated data sets defining and describing the components of these systems, especially genes and the proteins they encode. Information about these components can be accessed either through structured bioinformatics databases, which store basic chemical and functional information abstracted from (or supplementing) the scientific literature, or through the literature itself, which is richer in content but essentially unstructured. Initial alignment User’s protein Protein domain The systematic application of automated high-throughput molecular Curation sequence classification biology techniques has led to the generation of an immense quantity of of InterPro and member MVNELENNVRRASTLTNE Ran BP1 EVH1 data compared with that produced by the hypothesis-driven applica- databases tion of earlier technologies. For example, advances in DNA sequencing User query technology have made the study of complete genomes routine, and have been followed by similar progress in the study of transcripts, proteins Recursive and metabolites, although these generate far more complex datasets. In database search InterProScan fact, the need for databases that store molecular biological data, and that Expanded alignment Cached allow analysis through computational algorithms, was apparent long results before experimental techniques were as powerful — and as data gen- Annotation erative — as they are today. The first attempts to catalogue information Method look-up about proteins began in 1965, and by 1984 these had evolved into the scan 1 2 Predictive model Literature Protein Information Resource , the Protein Data Bank (which stores curation a a information about protein structures and was founded in 1971) and 12 23 GO terms the EMBL data library3 (the first database to store information about x1 x2 x3 a nucleic-acid sequences, founded in 1981). Today, these databases and 21 InterPro entry their successors have been joined by numerous other resources, which y1 y2 y3 store information on chemical entities, gene expression, molecular inter- Figure 1 Protein analysis using InterProScan. Curators of member databases actions and biochemical pathways. These databases often contain more of InterPro identify proteins that share a known functional domain and raw data than could be practically published in a conventional hard-copy manually identified sequences can be supplemented through iterative journal. However, the interpretation of these data is still dependent on searching of the public databases until a comprehensive set is found. The inference drawn from hypothesis-driven experimentation, the details sequences are aligned and a classifying function is built from the alignment, which is designed to recognise other proteins that posses the same domain. of which reside in free-text articles. The value of bioinformatics data is Many of these classifiers are built using hidden Markov models. Alternative thus utterly dependent on the ability to make the correct links from the classifying functions judged (by curators) to identify the same domain are sequences to the scientific literature, and to extract the information that grouped into a single entry in the InterPro database, and annotated with relevant information (expressed both in free text and in the terms of the GO literature contains. controlled vocabulary). InterProScan is a programme that allows users to This article is a user’s guide to linking gene, protein and publication characterize a sequence by applying these classifying functions. Performance data. It considers the principal primary resources in which such data can can be improved by precomputing the results for known sequences. be found, as well as the difficulties and potential pitfalls when linking them, and some of the general approaches and specific tools that can be variety of available data, including transcriptome, proteome, metabo- used to overcome these problems. We also discuss the semantic web, a lome, molecular and genetic interaction, and image data continues to recent technological framework proposed for the integration of distrib- rise. However, this deluge of data is also an opportunity for development uted data that is attracting particular interest within the bioinformatics and has led to the growth of the discipline of systems biology, which community. The need for new solutions is acute, as the quantity and aims to reconstruct entire biological systems through the modelling of their components. This requires that all components are clearly identi- Paul Kersey and Rolf Apweiler are in the EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. fied and well-described, which for a large system means, in practice, that e-mail: [email protected] descriptive annotation must be inferred, often automatically, from related NATURE CELL BIOLOGY VOLUME 8 | NUMBER 11 | NOVEMBER 2006 1183 © 2006 Nature Publishing Group KKerseyersey ffocus.inddocus.indd 11831183 116/10/066/10/06 113:04:363:04:36 REVIEW SYSTEMS BIOLOGY: A USER’S GUIDE Nucleotide databases INSD Literature 3 4 5 Nucleic-acid sequence archive The EMBL –Genbank –DDBJ International Nucleotide Sequence Gene Database (INSD) is the product of a collaborative effort to maintain Look-up PubMed prediction Sequence similarity tools, a single global archive for nucleotide sequences. As most other bioin- search Reference translation PubMed data sets utilities Text mining formatic data is dependent on inference from nucleotide sequence, it is arguably the single most important bioinformatics resource. Its com- Protein sequence Structure UniProtKB prediction by prehensive coverage is maintained through the editorial policy of most MVNELENNVRRASTLTNE PDB homology Look-up structure scientific journals — submitting authors have to deposit the sequence InterPro-GO lookup information that is associated with articles they publish. The joint data- of known sequence base currently contains over 80 million sequences. Ran BP1 EVH1 InterPro The information contained within the archive, however, is not uni- Protein domain composition formly well annotated. The contents of existing databases are used in InterPro GO most methods for gene prediction, and thus, once these databases have mapping been updated, earlier predictions may differ from those that could now 0000187 0043507 Activation of MAPK Positive regulation GOA activity of JNK activity be made. In addition, individual submitters are free to choose what Protein structure annotation they consider pertinent to add to a sequence. As a result, 0007252 Activation of JNK activity nothing can be inferred from the absence of information in database Protein structure prediction GO functional annotation records — this particular information may simply have been unknown to, or ignored by, the submitter. Figure 2 A schematic representation of a typical workflow for bioinformatics With the advent of whole genome sequencing, a new class of database analysis. The figure shows a typical workflow for bioinformatics analysis, has emerged for data from higher eukaryotic species, in which sequence progressing from sequence to functional annotation, protein structure and literature. The complete analysis may be carried out entirely within and automatically generated annotation is made available in advance of a bioinformatics warehousing system (such as SRS), or as a sequence of not only experimental confirmation, but even the existence of defini- separate operations performed in different environments. Starting with the tive methods for gene prediction. The Ensembl6 database of metazoan sequence of a gene or protein, identical and/or similar sequences are identified genome annotation is a good example of such a resource. The contents of in the public databases. The database records describing these sequences also contain general information about the sequence, curated links to structural this type of database are subject to frequent revision, especially when the information and relevant scientific literature. Protein sequence itself can sequence has only recently been deciphered, but they often provide more also be directly analysed to predict its domain composition and structure. complete and up-to-date annotation than the traditional archives. These may provide a more reliable indication of protein function than overall sequence similarity. It is often useful to express the results of these analyses in standard controlled vocabularies (such as GO), to allow the comparison and Protein databases correspondence of information derived from different resources. Most efforts at recording functional information have occurred in protein databases. Swiss-Prot, a subsection of the UniProt Knowledgebase7 components in other systems. Therefore, the establishment and exploita- (UniProtKB) that is manually curated from the scientific literature, is tion of reliable connections between the underlying source data is not only widely considered to be the richest and most reliable store of functional necessary for primary sequence analysis, but is also an essential part of information, and is frequently used to infer information about proteins not the platform needed to support the computational, synthetic approaches yet subjected to specific experimental study. The world-wide Protein Data to

Kersey Focus.Indd

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

The HUPO Proteomics Standards Initiative Meeting: Towards Common Standards for Exchanging Proteomics Data Hinxton, Cambridge, UK, 19–20 October 2002

The European Bioinformatics Institute in 2020: Building a Global Infrastructure of Interconnected Data Resources for the Life Sciences Charles E

2003 Mulder Nucl Acids Res {22

MPGM: Scalable and Accurate Multiple Network Alignment

EMBL-EBI-Overview.Pdf

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

Expert Review of Proteomics

Multi-Platform Discovery of Haplotype-Resolved Structural Variation in Human Genomes

UC Riverside UC Riverside Previously Published Works

Interpro: the Integrative Protein Signature Database

Speaker Biographies