What Is Bioinformatics? a Proposed Definition and Overview of the Field
Total Page:16
File Type:pdf, Size:1020Kb
346 © 2001 Schattauer GmbH What is Bioinformatics? A Proposed Definition and Overview of the Field N. M. Luscombe, D. Greenbaum, M. Gerstein Department of Molecular Biophysics and Biochemistry Yale University, New Haven, USA related to this new field has been surging, Summary 1. Introduction and now comprise almost 2% of the Background: The recent flood of data from genome annual total of papers in PubMed. sequences and functional genomics has given rise to Biological data are being produced at a This unexpected union between the two new field, bioinformatics, which combines elements of biology and computer science. phenomenal rate [1]. For example as subjects is attributed to the fact that life Objectives: Here we propose a definition for this of April 2001, the GenBank repository of itself is an information technology; an new field and review some of the research that is nucleic acid sequences contained organism’s physiology is largely deter- being pursued, particularly in relation to transcriptional 11,546,000 entries [2] and the SWISS- mined by its genes, which at its most basic regulatory systems. PROT database of protein sequences con- can be viewed as digital information.At the Methods: Our definition is as follows: Bioinformatics tained 95,320 [3]. On average, these databa- same time, there have been major advances is conceptualizing biology in terms of macromolecules ses are doubling in size every 15 months [2]. in the technologies that supply the initial (in the sense of physical-chemistry) and then applying In addition, since the publication of data; Anthony Kervalage of Celera recent- “informatics” techniques (derived from disciplines the H. influenzae genome [4], complete ly cited that an experimental laboratory such as applied maths, computer science, and statis- sequences for nearly 300 organisms have can produce over 100 gigabytes of data a tics) to understand and organize the information been released, ranging from 450 genes to day with ease [5]. This incredible pro- associated with these molecules, on a large-scale. over 100,000. Add to this the data from the cessing power has been matched by devel- Results and Conclusions: Analyses in bioinformatics predominantly focus on three types of large datasets myriad of related projects that study gene opments in computer technology; the most available in molecular biology: macromolecular struc- expression, determine the protein structu- important areas of improvements have tures, genome sequences, and the results of function- res encoded by the genes, and detail how been in the CPU, disk storage and Internet, al genomics experiments (eg expression data). these products interact with one another, allowing faster computations, better data Additional information includes the text of scientific and we can begin to imagine the enormous storage and revolutionalised the methods papers and “relationship data” from metabolic path- quantity and variety of information that is for accessing and exchanging data. ways, taxonomy trees, and protein-protein interaction being produced. networks. Bioinformatics employs a wide range As a result of this surge in data, compu- of computational techniques including sequence and ters have become indispensable to biologi- structural alignment, database design and data cal research. Such an approach is ideal 1.1 Aims of Bioinformatics mining, macromolecular geometry, phylogenetic tree construction, prediction of protein structure and because of the ease with which computers In general, the aims of bioinformatics are function, gene finding, and expression data clustering. can handle large quantities of data and three-fold. First, at its simplest bioinfor- The emphasis is on approaches integrating a variety of probe the complex dynamics observed in matics organises data in a way that allows computational methods and heterogeneous data nature. Bioinformatics, the subject of the researchers to access existing information sources. Finally, bioinformatics is a practical discipline. current review, is often defined as the appli- and to submit new entries as they are We survey some representative applications, such as cation of computational techniques to produced, e.g. the Protein Data Bank for finding homologues, designing drugs, and performing understand and organise the information 3D macromolecular structures [6, 7]. While large-scale censuses. Additional information pertinent associated with biological macromolecules. data-curation is an essential task, the in- to the review is available over the web at Fig. 1 shows that the number of papers formation stored in these databases is http://bioinfo.mbb.yale.edu/what-is-it. essentially useless until analysed. Thus the purpose of bioinformatics extends much Keywords further. The second aim is to develop tools Bioinformatics, Genomics, Introduction, Transcription Updated version of an invited review paper and resources that aid in the analysis of Regulation that appeared in Haux, R., Kulikowski, C. (eds.) (2001). IMIA Yearbook of Medical Informatics data. For example, having sequenced a par- Method Inform Med 2001; 40: 346–58 2001: Digital Libraries and Medicine, pp. 83–99. ticular protein, it is of interest to compare it Stuttgart: Schattauer. with previously characterised sequences. Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 347 What is Bioinformatics? 2001).At the next level are protein sequenc- es comprising strings of 20 amino acid- letters. At present there are about 400,000 known protein sequences [3], with a typical bacterial protein containing approximately 300 amino acids. Macromolecular struc- tural data represents a more complex form of information. There are currently 15,000 entries in the Protein Data Bank, PDB [6, 7], containing atomic structures of pro- teins, DNA and RNA solved by x-ray crystallography and NMR. A typical PDB file for a medium-sized protein contains the xyz-coordinates of approximately 2,000 atoms. Scientific euphoria has recently centred on whole genome sequencing. As with the raw DNA sequences, genomes consist of strings of base-letters, ranging from 1.6 million bases in Haemophilus influenzae Fig. 1 Plot showing the growth of scientific publications in bioinformatics between 1973 and 2000. The histogram bars (left vertical axis) counts the total number of scientific articles relating to bioinformatics, and the black line (right vertical [10] to 3 billion in humans [11, 12]. The axis) gives the percentage of the annual total of articles relating to bioinformatics. The data are taken from PubMed. Entrez database [13] currently has com- plete sequences for nearly 300 archaeal, bacterial and eukaryotic organisms. In This needs more than just a simple text- finally some of the major practical applica- addition to producing the raw nucleotide based search, and programs such as FASTA tions of bioinformatics. sequence, a lot of work is involved in [8] and PSI-BLAST [9] must consider what processing this data. An important aspect constitutes a biologically significant match. of complete genomes is the distinction Development of such resources dictates between coding regions and non-coding expertise in computational theory, as well regions -‘junk’ repetitive sequences making as a thorough understanding of biology. 2. “…the INFORMATION up the bulk of base sequences especially in The third aim is to use these tools to ana- associated with these eukaryotes. Within the coding regions, lyse the data and interpret the results in a genes are annotated with their translated biologically meaningful manner. Traditio- Molecules…” protein sequence, and often with their nally, biological studies examined individu- cellular function. al systems in detail, and frequently com- Table 1 lists the types of data that are pared them with a few that are related. In analysed in bioinformatics and the range of bioinformatics, we can now conduct global topics that we consider to fall within the Bioinformatics – a Definition1 analyses of all the available data with the field. Here we take a broad view and in- aim of uncovering common principles that clude subjects that may not normally be (Molecular) bio – informatics: bioinfor- apply across many systems and highlight listed. We also give approximate values matics is conceptualising biology in novel features. describing the sizes of data being discussed. terms of molecules (in the sense of Phy- In this review, we provide a systematic We start with an overview of the sources sical chemistry) and applying “informa- definition of bioinformatics as shown in of information. Most bioinformatics analy- tics techniques” (derived from disci- Box 1. We focus on the first and third aims ses focus on three primary sources of data: plines such as applied maths, computer just described, with particular reference to DNA or protein sequences, macromolecu- science and statistics) to understand and the keywords: information, informatics, lar structures and the results of functional organise the information associated organisation, understanding, large-scale genomics experiments. Raw DNA se- with these molecules, on a large scale. In and practical applications. Specifically, we quences are strings of the four base-letters short, bioinformatics is a management discuss the range of data that are currently comprising genes, each typically 1,000 bases information system for molecular biolo- being examined, the databases