346 © 2001 Schattauer GmbH
What is Bioinformatics? A Proposed Definition and Overview of the Field
N. M. Luscombe, D. Greenbaum, M. Gerstein Department of Molecular Biophysics and Biochemistry Yale University, New Haven, USA
related to this new field has been surging, Summary 1. Introduction and now comprise almost 2% of the Background: The recent flood of data from genome annual total of papers in PubMed. sequences and functional genomics has given rise to Biological data are being produced at a This unexpected union between the two new field, bioinformatics, which combines elements of biology and computer science. phenomenal rate [1]. For example as subjects is attributed to the fact that life Objectives: Here we propose a definition for this of April 2001, the GenBank repository of itself is an information technology; an new field and review some of the research that is nucleic acid sequences contained organism’s physiology is largely deter- being pursued, particularly in relation to transcriptional 11,546,000 entries [2] and the SWISS- mined by its genes, which at its most basic regulatory systems. PROT database of protein sequences con- can be viewed as digital information.At the Methods: Our definition is as follows: Bioinformatics tained 95,320 [3]. On average, these databa- same time, there have been major advances is conceptualizing biology in terms of macromolecules ses are doubling in size every 15 months [2]. in the technologies that supply the initial (in the sense of physical-chemistry) and then applying In addition, since the publication of data; Anthony Kervalage of Celera recent- “informatics” techniques (derived from disciplines the H. influenzae genome [4], complete ly cited that an experimental laboratory such as applied maths, computer science, and statis- sequences for nearly 300 organisms have can produce over 100 gigabytes of data a tics) to understand and organize the information been released, ranging from 450 genes to day with ease [5]. This incredible pro- associated with these molecules, on a large-scale. over 100,000. Add to this the data from the cessing power has been matched by devel- Results and Conclusions: Analyses in bioinformatics predominantly focus on three types of large datasets myriad of related projects that study gene opments in computer technology; the most available in molecular biology: macromolecular struc- expression, determine the protein structu- important areas of improvements have tures, genome sequences, and the results of function- res encoded by the genes, and detail how been in the CPU, disk storage and Internet, al genomics experiments (eg expression data). these products interact with one another, allowing faster computations, better data Additional information includes the text of scientific and we can begin to imagine the enormous storage and revolutionalised the methods papers and “relationship data” from metabolic path- quantity and variety of information that is for accessing and exchanging data. ways, taxonomy trees, and protein-protein interaction being produced. networks. Bioinformatics employs a wide range As a result of this surge in data, compu- of computational techniques including sequence and ters have become indispensable to biologi- structural alignment, database design and data cal research. Such an approach is ideal 1.1 Aims of Bioinformatics mining, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and because of the ease with which computers In general, the aims of bioinformatics are function, gene finding, and expression data clustering. can handle large quantities of data and three-fold. First, at its simplest bioinfor- The emphasis is on approaches integrating a variety of probe the complex dynamics observed in matics organises data in a way that allows computational methods and heterogeneous data nature. Bioinformatics, the subject of the researchers to access existing information sources. Finally, bioinformatics is a practical discipline. current review, is often defined as the appli- and to submit new entries as they are We survey some representative applications, such as cation of computational techniques to produced, e.g. the Protein Data Bank for finding homologues, designing drugs, and performing understand and organise the information 3D macromolecular structures [6, 7]. While large-scale censuses. Additional information pertinent associated with biological macromolecules. data-curation is an essential task, the in- to the review is available over the web at Fig. 1 shows that the number of papers formation stored in these databases is http://bioinfo.mbb.yale.edu/what-is-it. essentially useless until analysed. Thus the purpose of bioinformatics extends much Keywords further. The second aim is to develop tools Bioinformatics, Genomics, Introduction, Transcription Updated version of an invited review paper and resources that aid in the analysis of Regulation that appeared in Haux, R., Kulikowski, C. (eds.) (2001). IMIA Yearbook of Medical Informatics data. For example, having sequenced a par- Method Inform Med 2001; 40: 346–58 2001: Digital Libraries and Medicine, pp. 83–99. ticular protein, it is of interest to compare it Stuttgart: Schattauer. with previously characterised sequences.
Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 347 What is Bioinformatics?
2001).At the next level are protein sequenc- es comprising strings of 20 amino acid- letters. At present there are about 400,000 known protein sequences [3], with a typical bacterial protein containing approximately 300 amino acids. Macromolecular struc- tural data represents a more complex form of information. There are currently 15,000 entries in the Protein Data Bank, PDB [6, 7], containing atomic structures of pro- teins, DNA and RNA solved by x-ray crystallography and NMR. A typical PDB file for a medium-sized protein contains the xyz-coordinates of approximately 2,000 atoms. Scientific euphoria has recently centred on whole genome sequencing. As with the raw DNA sequences, genomes consist of strings of base-letters, ranging from 1.6 million bases in Haemophilus influenzae Fig. 1 Plot showing the growth of scientific publications in bioinformatics between 1973 and 2000. The histogram bars (left vertical axis) counts the total number of scientific articles relating to bioinformatics, and the black line (right vertical [10] to 3 billion in humans [11, 12]. The axis) gives the percentage of the annual total of articles relating to bioinformatics. The data are taken from PubMed. Entrez database [13] currently has com- plete sequences for nearly 300 archaeal, bacterial and eukaryotic organisms. In This needs more than just a simple text- finally some of the major practical applica- addition to producing the raw nucleotide based search, and programs such as FASTA tions of bioinformatics. sequence, a lot of work is involved in [8] and PSI-BLAST [9] must consider what processing this data. An important aspect constitutes a biologically significant match. of complete genomes is the distinction Development of such resources dictates between coding regions and non-coding expertise in computational theory, as well regions -‘junk’ repetitive sequences making as a thorough understanding of biology. 2. “…the INFORMATION up the bulk of base sequences especially in The third aim is to use these tools to ana- associated with these eukaryotes. Within the coding regions, lyse the data and interpret the results in a genes are annotated with their translated biologically meaningful manner. Traditio- Molecules…” protein sequence, and often with their nally, biological studies examined individu- cellular function. al systems in detail, and frequently com- Table 1 lists the types of data that are pared them with a few that are related. In analysed in bioinformatics and the range of bioinformatics, we can now conduct global topics that we consider to fall within the Bioinformatics – a Definition1 analyses of all the available data with the field. Here we take a broad view and in- aim of uncovering common principles that clude subjects that may not normally be (Molecular) bio – informatics: bioinfor- apply across many systems and highlight listed. We also give approximate values matics is conceptualising biology in novel features. describing the sizes of data being discussed. terms of molecules (in the sense of Phy- In this review, we provide a systematic We start with an overview of the sources sical chemistry) and applying “informa- definition of bioinformatics as shown in of information. Most bioinformatics analy- tics techniques” (derived from disci- Box 1. We focus on the first and third aims ses focus on three primary sources of data: plines such as applied maths, computer just described, with particular reference to DNA or protein sequences, macromolecu- science and statistics) to understand and the keywords: information, informatics, lar structures and the results of functional organise the information associated organisation, understanding, large-scale genomics experiments. Raw DNA se- with these molecules, on a large scale. In and practical applications. Specifically, we quences are strings of the four base-letters short, bioinformatics is a management discuss the range of data that are currently comprising genes, each typically 1,000 bases information system for molecular biolo- being examined, the databases into which long. The GenBank [2] repository of gy and has many practical applications. they are organised, the types of analyses nucleic acid sequences currently holds a 1As submitted to the Oxford English that are being conducted using transcrip- total of 12.5 billion bases in 11.5 million Dictionary. tion regulatory systems as an example, and entries (all database figures as of April
Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 348 Luscombe, Greenbaum, Gerstein
Table 1 Sources of data used in bioinformatics, the quantity of each type of data that is currently (April 2001) available, quences. While more biological informa- and bioinformatics subject areas that utilize this data. tion can be derived from a single structure than a protein sequence, the lack of depth in the latter is compensated by analysing larger quantities of data.
3. “… ORGANISE the Infor- mation on a LARGE SCALE…” 3.1 Redundancy and Multiplicity of Data A concept that underpins most research methods in bioinformatics is that much of the data can be grouped together based on biologically meaningful similarities. For example, sequence segments are often repeated at different positions of genomic DNA [27]. Genes can be clustered into those with particular functions (eg enzy- matic actions) or according to the meta- bolic pathway to which they belong [28], although here, single genes may actually possess several functions [29]. Going further, distinct proteins frequently have comparable sequences – organisms often have multiple copies of a particular gene through duplication and different species have equivalent or similar proteins that were inherited when they diverged from each other in evolution. At a structural level, we predict there to be a finite number of different tertiary structures – estimates More recent sources of data have been tities of data when experiments are con- range between 1,000 and 10,000 folds from functional genomics experiments, of ducted for larger organisms and at more [30, 31] – and proteins adopt equivalent which the most common are gene expres- time-points. structures even when they differ greatly in sion studies.We can now determine expres- Further genomic-scale data include sequence [32]. As a result, although the sion levels of almost every gene in a given biochemical information on metabolic number of structures in the PDB has cell on a whole-genome level, however pathways, regulatory networks, protein- increased exponentially, the rate of discov- there is currently no central repository for protein interaction data from two-hybrid ery of novel folds has actually decreased. this data and public availability is limited. experiments, and systematic knockouts of There are common terms to describe the These experiments measure the amount of individual genes to test the viability of an relationship between pairs of proteins or mRNA that is produced by the cell [14-18] organism. the genes from which they are derived: under different environmental conditions, What is apparent from this list is the analogous proteins have related folds, but different stages of the cell cycle and differ- diversity in the size and complexity of dif- unrelated sequences, while homologous ent cell types in multi-cellular organisms. ferent datasets. There are invariably more proteins are both sequentially and structu- Much of the effort has so far focused on the sequence-based data than others because rally similar. The two categories can some- yeast [19-24] and human genomes [25, 26]. of the relative ease with which they can be times be difficult to distinguish especially if One of the largest dataset for yeast has produced.This is partly related to the great- the relationship between the two proteins made approximately 20 time-point meas- er complexity and information-content of is remote [33, 34]. Among homologues, it is urements for 6,000 genes [19]. However, individual structures or gene expression useful to distinguish between orthologues, there is potential for much greater quan- experiments compared to individual se- proteins in different species that have evolv-
Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 349 What is Bioinformatics?
ed from a common ancestral gene, and straightforward to access and cross- producing multiple sequence alignments paralogues, proteins that are related by reference these sources of information be- [48], and searching for functional domains gene duplication within a genome [35]. cause of differences in nomenclature and from conserved sequence motifs in such Normally, orthologues retain the same file formats. alignments. Investigations of structural function while paralogues evolve distinct, At a basic level, this problem is fre- data include prediction of secondary and but related functions [36]. quently addressed by providing external tertiary protein structures, producing An important concept that arises from links to other databases. For example in methods for 3D structural alignments [49, these observations is that of a finite “parts PDBsum, web-pages for individual struc- 50], examining protein geometries using list” for different organisms [37-39]: an tures direct the user towards corresponding distance and angular measurements, calcu- inventory of proteins contained within an entries in the PDB, NDB, CATH, SCOP lations of surface and volume shapes and organism, arranged according to different and SWISS-PROT databases. At a more analysis of protein interactions with other properties such as gene sequence, protein advanced level, there have been efforts to subunits, DNA, RNA and smaller mole- fold or function. Taking protein folds as an integrate access across several data sources. cules. These studies have lead to molecular example, we mentioned that with a few One is the Sequence Retrieval System, SRS simulation topics in which structural data exceptions, the tertiary structures of pro- [41], which allows flat-file databases to be are used to calculate the energetics in- teins adopt one of a limited repertoire indexed to each other; this allows the user volved in stabilising macromolecular struc- of folds. As the number of different fold to retrieve, link and access entries from tures, simulating movements within macro- families is considerably smaller than the nucleic acid, protein sequence, protein molecules, and computing the energies number of genes, categorising the proteins motif, protein structure and bibliographic involved in molecular docking. The increa- by fold provides a substantial simplification databases. Another is the Entrez facility sing availability of annotated genomic of the contents of a genome. Similar sim- [42], which provides similar gateways to sequences has resulted in the introduction plifications can be provided by other attri- DNA and protein sequences, genome of computational genomics and proteomics butes such as protein function. As such, we mapping data, 3D macromolecular structu- – large-scale analyses of complete genomes expect this notion of a finite parts list to res and the PubMed bibliographic database and the proteins that they encode. Re- become increasingly common in future [43].A search for a particular gene in either search includes characterisation of protein genomic analyses. database will allow smooth transitions to content and metabolic pathways between Clearly, an essential aspect of managing the genome it comes from, the protein different genomes, identification of interac- this large volume of data lies in developing sequence it encodes, its structure, biblio- ting proteins, assignment and prediction of methods for assessing similarities between graphic reference and equivalent entries for gene products, and large-scale analyses of different biomolecules and identifying all related genes. In our own group, we have gene expression levels. Some of these re- those that are related. There are well-docu- developed the SPINE [44] and PartsList search topics will be demonstrated in our mented classifications for all of the main [39] web resources; these databases inte- example analysis of transcription regula- types of data we described earlier. Al- grate many types of experimental data and tory systems. though detailed descriptions of these clas- organise them using the concept of the Other subject areas we have included in sification systems are beyond the scope of finite “parts list” we described above. Table 1 are:development of digital libraries the current review, they are of great impor- for automated bibliographical searches, tance as they ease comparisons between knowledge bases of biological information genomes and their products. Links to the from the literature, DNA analysis methods major databases are available from our in forensics, prediction of nucleic acid struc- supplementary website. 4. “…UNDERSTAND and tures, metabolic pathway simulations, and Organise the Information…” linkage analysis – linking specific genes to different disease traits. Having examined the data, we can discuss In addition to finding relationships be- 3.2 Data Integration the types of analyses that are conducted.As tween different proteins, much of bioin- The most profitable research in bioinfor- shown in Table 1, the broad subject areas in formatics involves the analysis of one type matics often results from integrating mul- bioinformatics can be separated according of data to infer and understand the obser- tiple sources of data [40]. For instance, the to the type of information that is used. For vations for another type of data. An exam- 3D coordinates of a protein are more useful raw DNA sequences, investigations involve ple is the use of sequence and structural if combined with data about the protein’s separating coding and non-coding regions, data to predict the secondary and tertiary function, occurrence in different genomes, and identification of introns, exons and structures of new protein sequences [51]. and interactions with other molecules. In promoter regions for annotating genomic These methods, especially the former, are this way, individual pieces of information DNA [45, 46]. For protein sequences, ana- often based on statistical rules derived are put in context with respect to other lyses include developing algorithms for from structures, such as the propensity for data. Unfortunately, it is not always sequence comparisons [47], methods for certain amino acid sequences to produce
Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 350 Luscombe, Greenbaum, Gerstein
Paradigm shifts during the past couple of decades have taken much of biology away from the laboratory bench and have allowed the integration of other scientific disciplines, specifically computing. The result is an expansion of biological research in breadth and depth. The vertical axis demonstrates how bioinformatics can aid rational drug design with minimal work in the wet lab. Starting with a single gene sequence, we can determine with strong certainty, the protein sequence. From there, we can determine the structure using structure prediction techniques. With geometry calculations, we can further resolve the protein’s surface and through molecular simulation determine the force fields surrounding the molecule. Finally docking algorithms can provide predictions of the ligands that will bind on the protein surface, thus paving the way for the design of a drug specific to that molecule. The horizontal axis shows how the influx of biological data and advances in computer technology have broadened the scope of biology. Initially with a pair of proteins, we can make comparisons between the between sequences and structures of evolutionary related proteins. With more data, algorithms for multiple alignments of several proteins become necessary. Using multiple sequences, we can also create phylogenetic trees to trace the evolutionary development of the proteins in question. Finally, with the deluge of data we currently face, we need to construct large databases to store, view and deconstruct the information. Alignments now become more complex, requiring sophisticated scoring schemes and there is enough data to compile a genome census – a genomic equivalent of a population census – providing comprehensive statistical accounting of protein features in genomes.
Fig. 2 Organizing and understanding biological data
different secondary structural elements. transferred between homologous proteins analysis in two dimension, depth and Another example is the use of structural [55]. breadth. The first is represented by the data to understand a protein’s function; vertical axis in the figure and outlines a here studies have investigated the rela- possible approach to the rational drug tionship different protein folds and their design process. The aim is to take a single functions [52, 53] and analysed similarities 4.1 The Bioinformatics Spectrum gene and follow through an analysis that between different binding sites in the ab- Fig. 2 summarises the main points we maximises our understanding of the sence of homology [54]. Combined with raised in our discussions of organising protein it encodes. Starting with a gene similarity measurements, these studies pro- and understanding biological data – the sequence, we can determine the protein vide us with an understanding of how much development of bioinformatics techniques sequence with strong certainty. From there, biological information can be accurately has allowed an expansion of biological prediction algorithms can be used to calcu-
Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 351 What is Bioinformatics?
late the structure adopted by the protein. son methods such as text search and one- 6.1 Structural Studies Geometry calculations can define the dimensional alignment algorithms. Motif shape of the protein’s surface and molecu- and pattern identification for multiple As of April 2001, there were 379 structures lar simulations can determine the force sequences depend on machine learning, of protein-DNA complexes in the PDB. fields surrounding the molecule. Finally, clustering and data-mining techniques. 3D Analyses of these structures have provided using docking algorithms, one could structural analysis techniques include Eu- valuable insight into the stereochemical identify or design ligands that may bind clidean geometry calculations combined principles of binding, including how par- the protein, paving the way for designing a with basic application of physical chemis- ticular base sequences are recognized drug that specifically alters the protein’s try, graphical representations of surfaces and how the DNA structure is quite often function. In practise, the intermediate steps and volumes, and structural comparison modified on binding. are still difficult to achieve accurately, and and 3D matching methods. For molecular A structural taxonomy of DNA-binding they are best combined with experimental simulations, Newtonian mechanics, quan- proteins, similar to that presented in SCOP methods to obtain some of the data, for tum mechanics, molecular mechanics and and CATH, was first proposed by Harrison example characterising the structure of the electrostatic calculations are applied. In [56] and periodically updated to accom- protein of interest. many of these areas, the computational modate new structures as they are solved The aim of the second dimension, the methods must be combined with good [57].The classification consists of a two-tier breadth in biological analysis, is to compare statistical analyses in order to provide an system: the first level collects proteins into a gene or gene product with others. Ini- objective measure for the significance of eight groups that share gross structural tially, simple algorithms can be used to the results. features for DNA-binding, and the second compare the sequences and structures of a comprises 54 families of proteins that are pair of related proteins. With a larger num- structurally homologous to each other. ber of proteins, improved algorithms can be Assembly of such a system simplifies the used to produce multiple alignments, and 6. Transcription Regulation – comparison of different binding methods; it extract sequence patterns or structural a Case Study in Bioinformatics highlights the diversity of protein-DNA templates that define a family of proteins. complex geometries found in nature, but Using this data, it is also possible to con- DNA-binding proteins have a central role also underlines the importance of inter- struct phylogenetic trees to trace the evolu- in all aspects of genetic activity within an actions between -helices and the DNA tionary path of proteins. Finally, with even organism, participating in processes such as major groove, the main mode of binding in more data, the information must be stored transcription, packaging, rearrangement, over half the protein families. While the in large-scale databases. Comparisons replication and repair. In this section, we number of structures represented in the become more complex, requiring multiple focus on the studies that have contributed PDB does not necessarily reflect the rela- scoring schemes, and we are able to con- to our understanding of transcription tive importance of the different proteins in duct genomic scale censuses that provide regulation in different organisms. Through the cell, it is clear that helix-turn-helix, comprehensive statistical accounts of this example, we demonstrate how bio- zinc-coordinating and leucine zipper motifs protein features, such as the abundance of informatics has been used to increase our are used repeatedly.These provide compact particular structures or functions in diffe- knowledge of biological systems and also frameworks to present the -helix on the rent genomes. It also allows us to build illustrate the practical applications of the surfaces of structurally diverse proteins. At phylogenetic trees that trace the evolution different subject areas that were briefly a gross level, it is possible to highlight the of whole organisms. outlined earlier. We start by considering differences between transcription factor structural analyses of how DNA-binding domains that “just” bind DNA and those proteins recognise particular base se- involved in catalysis [58]. Although there quences. Later, we review several genomic are exceptions, the former typically 5. “… applying INFORMATICS studies that have characterised the nature approach the DNA from a single face and TECHNIQUES…” of transcription factors in different orga- slot into the grooves to interact with base nisms, and the methods that have been used edges. The latter commonly envelope the The distinct subject areas we mention to identify regulatory binding sites in the substrate, using complex networks of require different types of informatics tech- upstream regions. Finally, we provide an secondary structures and loops. niques. Briefly, for data organisation, the overview of gene expression analyses that Focusing on proteins with -helices, the first biological databases were simple flat have been recently conducted and suggest structures show many variations, both in files. However with the increasing amount future uses of transcription regulatory ana- amino acid sequences and detailed geo- of information, relational database lyses to rationalise the observations made metry. They have clearly evolved indepen- methods with Web-page interfaces have in gene expression experiments. All the dently in accordance with the requirements become increasingly popular. In sequence results that we describe have been found of the context in which they are found. analysis, techniques include string compari- through computational studies. While achieving a close fit between the
Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 352 Luscombe, Greenbaum, Gerstein