Bioinformatics: from Genome Data to Biological Knowledge Fviiguel a Andrade* and Chris Sander?

675 Bioinformatics: from genome data to biological knowledge fviiguel A Andrade* and Chris Sander? Recently, molecular biologists have sequenced about a dozen knowledge from one set of biological molecules to another, bacterial genomes and the first eukaryotic genome. We can both within one species and between species. The arc now obtain answers to detailed questions about the complete of evolutionary information transfer is in the skill set of set of genes of an organism. Bioinformatics methods are bioinformacics. increasingly used for attaching biological knowledge to long lists of genes, assigning genes to biological pathways, Bioinformatics is a science of recent creation that uses bio- comparing the gene sets of different species, identifying logical data and knowledge scored in computer databases, SF zcificity factors, and describing sets of highly conserved complemenred by computational methods, to derive new proteins common to all domains of life. Substantial progress biological knowledge. It is a theoretical biology firmly has recently been made in the availability of primary and grounded in comprehensive and detailed experimental added-value databases, in the development of algorithms facts. Currently, bioinformatics is making a key concribu- and of network information services for genome analysis. tion to the organization and analysis of the massive amount The pharmaceutical industry has greatly benefited from the of biological data from genome sequencing projects accumulation of sequence data through the identification and, increasingly, from other areas of ‘high-throughput’, of targets and candidates for the development of drugs, ‘massively parallel’, robotized and miniaturized methods vaccines, diagnostic markers and therapeutic proteins. of biological experimentation. The classical progression of the pharmaceutical discovery Addresses process goes from drug target (typically a receptor protein EMBL-EBI, Genome Campus, Cambridge, CBlO 1 SD, UK *e-mail: andradeaembi-ebi.ac.uk or enzyme) to lead compound to drug. With the arrival te-mail: [email protected] of multiple genome sequences, bioinformatics is already making practical contributions in target identification and Current Opinion in Biotechnology 1997, 8:675-683 in the design of combinatorial libraries based on detailed http://biomednet.com/elecref/0958166900800675 knowledge of one or more protein structures. 0 Current Biology Ltd ISSN 0958-l 669 This review focuses on recent bioinformatics methods Abzeviations applied co protein sequences, functions, structures and EL-ST basic local alignment search tool EST expressed sequence tag pathways. Genome analysis can be performed at different ORF open reading frame levels of complexity: individual sequences, complete genomes, and sets of genomes. We will review applications at each level, from protein sequence analysis to compara- Introduction: information transfer and tive genomics. evolution Life involves the storage, handling and transformation of Prediction of protein function information. The basic information necessary to construct The richest information in genome sequences is in the and manage a living organism is contained in its genome. set of genes coding for functional proteins. As only a T:;is information is transformed into physical reality by minority of proteins have their function known by direct molecular processes in living cells. Parts of the genome experiment, information transfer from one protein t0 are translated into proteins. Other parts of the genome anocher, by evolutionary analogy, often yields the first regulate the expression of these proteins. The translated indications about the function of a newly sequenced gene. proteins fold into highly specific three-dimensional struc- The information transfer relies on the notion (shown tures. The proteins are targeted to their precise cellular experimentally to be true in numerous cases) that proteins location where they perform their function. And so on. similar in sequence are also similar in function. The transfer of functional information from one protein to hlany details of these processes can be unraveled in another can, however, lead to substantial errors unless a th J molecular biologist’s laboratory. Biological experimen- number of technical difficulties are properly addressed tation, however, is generally laborious, expensive, slow (e.g., protein domain organization, hierarchial definition of and rarely comprehensive. Fortunately, as a result of protein functionality). evolution and of physicochemical constraints, biological organization is redundant: elementary molecular processes In practice, given a new protein sequence, the following are multiply adapted and reused in different concexrs and questions arise. Does the protein belong to a known in different species. This redundancy can be wonderfully protein family about which functional information is exploited by using evolutionary comparison and system available? If so, what is the correct alignment and is a analysis to transfer biochemical, genetic and cell-biological 3D structure available to build a model by homology? 676 Pharmaceutical biotechnology HOW closely is the protein related to the members of the capability by comparing the query protein’s protein family, family and what does this imply about the similarity in as found by the sequence-sequence search, with the function? Which region of the protein matches the family, sequence database in a second and further passes (this and does the match occur in a region known to relate to the appears to be similar to MOST [4], a sequence-sequen::e functional properties of the protein? Within that region, comparison method also based on protein family gentr- are the known or apparent functional residues conserved? ation from BLAST results). In general, a set of aligned While well-trained experts in sequence analysis are often sequences can be organized (i.e., grouped and aligned) into able to address these questions using standard database an emerging family in a variety of ways to define a ‘profile’ search software, considerable effort is going into algorithm or family ‘model’. Such profiles aim at capturing the key and software development in these areas. Here, we can functionally constrained features of the family. As a result, only cover selected and recent achievements. profile-sequence comparisons are a &ore powerful search tool than mere sequence-sequence comparisons [5,6]. The ability to assess membership in a particular protein family depends critically on the alignment search used. The next level of database searches involves profile-prof, e Most searches are based on pairwise sequence-sequence comparison, with increased power for the detection of comparison. The most widely used method, the basic remotely related family members; however, the availability local alignment search tool (BLAST) algorithm [l], of well-curated and comprehensive datasets of protein relates similar sequence fragments. This algorithm has family profiles is still limited (see Table 1). In rare been improved by two recent adaptations (WU-BLAST2 cases, discovery of similarities in 3D structure, without [2]; PSI-BLAST [3-l). PSI-BLAST extends the search apparent sequence similarity, can lead to the unification Table 1 Selected databases and information services. Database Content World Wide Web URL Reference or site author DNA and protein sequences EMBL Nucleotide www.embl-ebi.ac.uk/ebi~docs/embl_dblebi/topembl.html [411 sequences GenBank Nucleohde and www.ncbi.nlm.nih.govlWeblGenbank/index.html Ml protein sequences SwissProt Protein sequences expasy.hcuge.ch/sprot/sprot-top.html 1431 PIR Protein sequences www-nbrf.georgetown.edu/pir/ 1441 Genome sequencing Genome completion www.mcs.anl.govlhome/gaasterl/genomes.html T Gaasterland projects list Gene identification GeneMark Gene identification amber.biology.gatech.edu/-william/genemark.html 1451 Grail Gene identification compbio.ornl.gov/Grail-1.31 [451 Gene finder Gene identification dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html V Solovyez Frame Frame-shift detection www.sander.embl-ebi.ac.uk/frame/ [471 Functional genome analysis GeneQuiz Automatic protein www.sander.embl-ebi.ac.uk/genequiz I1 11 function annotation Magpie Genome analysis www.mcs.anl.gov/home/gaasterl/magpie.html 1481 Pedant Automatic analysis pedant.mips.biochem.mpg.de/frishman/pedant.html 1491 of proteins Complete Genomes at Genome analysis www.ncbi.nlm.nih.gov/Complete_Genomes WI NCBI and comparison Protein families and sequence patterns Prosite Protein sites expasy.hcuge.ch/sprotlprosite.html [501 and patterns Blocks Protein patterns wwwblocksfhcrcorg 1511 FYam Protein families www.sanger.ac.uk/Software/Pfam/ (521 and profiles Database links SRS Biological database srs.embl-ebi.ac.uk:5OOOlsrs5/ [I21 browser There are many more World Wide Web sites relevant to bioinformatics. The selection here relates to the discussion in the text and, in some cases, new developments. Bioinformatics: from genome data to biological knowledge Andrade and Sander 677 of functional families into functional superfamilies, with of the GeneQuiz software system [ll] for large scale an attendant increase in database search power [7’,8*]. sequence analysis in 1994, several groups have developed F nally, given a reliable multiple-sequence alignment systems with varying degrees of automation (see Table 1). from any method, the analysis of specificity determinants (residues characteristically conserved within a subfamily GeneQuiz [I l] aims at the

Bioinformatics: from Genome Data to Biological Knowledge Fviiguel a Andrade* and Chris Sander?

Ontology-Based Methods for Analyzing Life Science Data

Ploidetect Enables Pan-Cancer Analysis of the Causes and Impacts of Chromosomal Instability

BIOINFORMATICS APPLICATIONS NOTE Pages 380-381

Are You an Invited Speaker? a Bibliometric Analysis of Elite Groups for Scholarly Events in Bioinformatics

Full List of PCAWG Consortium Working Groups and Writing

Biological Pathways Exchange Language Level 3, Release Version 1 Documentation

Systems Biology Graphical Notation: Process Description Language Level 1

I S C B N E W S L E T T

Mapping the Protein Universe

A General Mathematical Framework for Understanding the Behavior of Heterogeneous Stem Cell Regeneration Arxiv:1903.11448V1 [Q-B

Bioinformatics 2 -- Lecture 3

Generating Functional Protein Variants with Variational Autoencoders