<<

Review Paper

N.M. Luscombe, D. Greenbaum, Review M. Gerstein Department of Molecular Biophysics and Biochemistry What is ? An Yale University New Haven, USA introduction and overview

Abstract: A flood of means that many of the challenges in biology are now challenges in . Bioinformatics, the application of computational techniques to analyse the associated with biomolecules on a large-scale, has now firmly established itself as a discipline in molecular biology, and encompasses a wide range of subject areas from structural biology, genomics to gene expression studies. In this review we provide an introduction and overview of the current of the field. We discuss the main principles that underpin bioinformatics analyses, look at the types of biological information and that are commonly used, and finally examine some of the studies that are being conducted, particularly with reference to transcription regulatory .

Introduction Bioinformatics - a definition1

Biological data are being produced (Molecular) bio – : bioinformatics is conceptualising biology in at a phenomenal rate [1]. For terms of molecules (in the sense of physical chemistry) and applying example as of August 2000, the "informatics techniques" (derived from disciplines such as applied maths, GenBank repository of nucleic acid science and ) to understand and organise the information sequences contained 8,214,000 associated with these molecules, on a large scale. In short, bioinformatics entries [2] and the SWISS-PROT is a management information for molecular biology and has many of protein sequences practical applications. contained 88,166 [3]. On average, these databases are doubling in 1 As submitted to the Oxford English Dictionary size every 15 months [2]. In addition, since the publication of the H. As a result of this surge in data, that life itself is an information influenzae genome [4], complete have become indispensable technology; an organism’s physiology sequences for over 40 organisms to biological research. Such an approach is largely determined by its genes, which have been released, ranging from is ideal because of the ease with which at its most basic can be viewed as 450 genes to over 100,000. Add to computers can handle large quantities digital information. At the same time, this the data from the myriad of of data and probe the complex dynam- there have been major advances in the related projects that study gene ics observed in nature. Bioinformatics, technologies that supply the initial data; expression, determine the protein the subject of the current review, is Anthony Kerlavage of Celera recently structures encoded by the genes, often defined as the application of cited that an experimental laboratory and detail how these products inter- computational techniques to understand can produce over 100 gigabytes of act with one another, and we can and organise the information associated data a day with ease [5]. This incredible begin to imagine the enormous with biological macromolecules. This processing power has been matched quantity and of information uexpected union between the two by developments in computer technol- that is being produced. subjects is largely attributed to the fact ogy; the most important areas of

Yearbook of Medical Informatics 2001 83 Review Paper improvements have been in the CPU, understanding , large-scale and be listed. We also give approximate disk storage and Internet, allowing practical applications . Specifically, we values describing the sizes of data being faster , better data stor- discuss the range of data that are discussed. age and revolutionalised the methods currently being examined, the databases for accessing and exchanging data. into which they are organised, the types We start with an overview of the of analyses that are being conducted sources of information: these may Aims of bioinformatics using transcription regulatory systems be divided into raw DNA sequences, The aims of bioinformatics are three- as an example, and finally some of the protein sequences, macromolecular fold. First, at its simplest bioinformatics major practical applications of structures, genome sequences, and organises data in a way that allows bioinformatics. other whole genome data. Raw DNA researchers to access existing infor- sequences are strings of the four base- mation and to submit new entries as letters comprising genes, each typically they are produced, eg the Protein Data “…the INFORMATION 1,000 bases long. The GenBank Bank for 3D macromolecular struc- associated with these repository of nucleic acid sequences tures [6,7]. While data-curation is an molecules…” currently holds a total of 9.5 billion essential task, the information stored bases in 8.2 million entries (all database in these databases is essentially use- Table 1 lists the types of data that are figures as of August 2000). At the next le ss until analysed. Thus the purpose of analysed in bioinformatics and the range level are protein sequences comprising bioinformatics extends much further. of topics that we consider to fall within strings of 20 amino acid-letters. At The second aim is to develop tools and the field. Here we take a broad view and present there are about 300,000 known resources that aid in the analysis of include subjects that may not normally protein sequences, with a typical data. For example, having sequenced a particular protein, it is of interest to Table 1. Sources of data used in bioinformatics, the quantity of each type of data that is currently compare it with previously characte- (August 2000) available, and bioinformatics subject areas that utilise this data. rised sequences. This needs more than just a simple text-based search and Data source Data size Bioinformatics topics Raw DNA sequence 8.2 million sequences Separating coding and non-coding regions programs such as FASTA [8] and (9.5 billion bases) Identification of introns and exons PSI-BLAST [9] must consider what Gene product prediction comprises a biologically significant Forensic analysis match. Development of such resources Protein sequence 300,000 sequences Sequence comparison (~300 amino acids Multiple sequence alignments algorithms dictates expertise in computational each) Identification of conserved sequence motifs theory as well as a thorough under- Macromolecular 13,000 structures Secondary, tertiary structure prediction standing of biology. The third aim is to structure (~1,000 atomic 3D structural alignment algorithms use these tools to analyse the data and coordinates each) Protein geometry measurements Surface and volume shape calculations interpret the results in a biologically Intermolecular interactions meaningful manner. Traditionally, Molecular simulations biological studies examined individual (force-field calculations, systems in detail, and frequently molecular movements, compared them with a few that are docking predictions) related. In bioinformatics, we can now Genomes 40 complete genomes Characterisation of repeats (1.6 million – Structural assignments to genes conduct global analyses of all the 3 billion bases each) Phylogenetic analysis available data with the aim of un- Genomic-scale censuses (characterisation of protein content, metabolic pathways) covering common principles that apply Linkage analysis relating specific genes to diseases across many systems and highlight Gene expression largest: ~20 time Correlating expression patterns novel features. point measurements Mapping expression data to sequence, structural and for ~6,000 genes biochemical data In this review, we provide an intro- duction to bioinformatics. We focus on Other data the first and third aims just described, Literature 11 million citations Digital libraries for automated bibliographical searches with particular reference to the key- Knowledge databases of data from literature words underlined in the definition: infor- Metabolic pathways Pathway simulations mation,informatics, organisation,

84 Yearbook of Medical Informatics 2001 Review Paper bacterial protein containing approxi- sequences. While more biological infor- relationship between the two proteins mately 300 amino acids. Macromo- mation can be derived from a single is remote [17, 18]. Among homologues, lecular structural data represents a structure than a protein sequence, the it is useful to distinguish between more complex form of information. lack of depth in the latter is remedied orthologues, proteins in different There are currently 13,000 entries in by analysing larger quantities of data. species that have evolved from a the Protein Data Bank, PDB, most common ancestral gene, and of which are protein structures. A paralogues, proteins that are related by typical PDB file for a medium-sized “… ORGANISE the informa- gene duplication within a genome [19]. protein contains the xyz coordinates tion on a LARGE SCALE …” Normally, orthologues retain the same of approximately 2,000 atoms. function while paralogues evolve Redundancy and multiplicity of data distinct, but related functions [20]. Scientific euphoria has recently A concept that underpins most centred on whole genome sequencing. research methods in bioinformatics is An important concept that arises As with the raw DNA sequences, that much of this data can be grouped from these observations is that of a genomes consist of strings of base- together based on biologically meaning- finite “parts list” for different organisms letters, ranging from 1.6 million bases ful similarities. For example, sequence [21,22]: an inventory of proteins in Haemophilus influenzae to 3 billion segments are often repeated at contained within an organism, arranged in humans. An important aspect of different positions of genomic DNA according to different properties such complete genomes is the distinction [11]. Genes can be clustered into those as gene sequence, protein fold or between coding regions and non- with particular functions (eg enzymatic function. Taking protein folds as an coding regions –'junk' repetitive actions) or according to the metabolic example, we mentioned that with a sequences making up the bulk of base pathway to which they belong [12], few exceptions, the tertiary structures sequences especially in eukaryotes. although here, single genes may actually of proteins adopt one of a limited We can now measure expression levels possess several functions [13]. Going repertoire of folds. As the number of of almost every gene in a given cell further, distinct proteins frequently different fold families is considerably on a whole-genome level although have comparable sequences – orga- smaller than the number of gene public availability of such data is still nisms often have multiple copies of a families, categorising the proteins by limited. Expression level measurements particular gene through duplication fold provides a substantial simplifi- are made under different environmental while different species have equivalent cation of the contents of a genome. conditions, different stages of the cell or similar proteins that were inherited Similar simplifications can be cycle and different cell types in multi- when they diverged from each other in provided by other attributes such as cellular organisms. Currently the largest evolution. At a structural level, we protein function. As such, we expect dataset for yeast has made approxi- predict there to be a finite number of this notion of a finite parts list to become mately 20 time-point measurements different tertiary structures – estimates increasingly common in the future for 6,000 genes [10]. Other genomic- range between 1,000 and 10,000 folds genomic analyses. scale data include biochemical informa- [14,15] – and proteins adopt equivalent tion on metabolic pathways, regulatory structures even when they differ Clearly, an essential aspect of mana- networks, protein-protein interaction greatly in sequence [16]. As a result, ging this large volume of data lies in data from two-hybrid experiments, although the number of structures in developing methods for assessing and systematic knockouts of individ- the PDB has increased exponentially, similarities between different biomole- ual genes to test the viability of an the rate of discovery of novel folds has cules and identifying those that are organism. actually decreased. related. Below, we discuss the major databases that provide access to the What is apparent from this list is the There are common terms to describe primary sources of information, and diversity in the size and complexity of the relationship between pairs of also introduce some secondary data- different datasets. There are invariably proteins or the genes from which they bases that systematically group the more sequence-based data than struc- are derived: analogous proteins have data (Table 2). These classifications tural data because of the relative ease related folds, but unrelated sequences, ease comparisons between genomes with which they can be produced. This while homologous proteins are both and their products, allowing the identi- is partly related to the greater complex- sequentially and structurally similar. fication of common themes between ity and information-content of individual The two categories can sometimes be those that are related and highlighting structures compared to individual difficult to distinguish especially if the features that are unique to some.

Yearbook of Medical Informatics 2001 85 Review Paper

Table 2. List of URLs for the databases that are cited in the review. 3D-space when the protein is folded. By using multiple motifs, fingerprints Database URL can encode protein folds and Protein sequence (primary) functionalities more flexibly than SWISS-PROT www.expasy.ch/sprot/sprot-top.html PROSITE. Finally, Pfam [28] contains PIR-International www.mips.biochem.mpg.de/proj/protseqdb a large collection of multiple sequence

Protein sequence (composite) alignments and profile Hidden Markov OWL www.bioinf.man.ac.uk/dbbrowser/OWL Models covering many common protein NRDB www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Protein domains. Pfam-A comprises accurate

Protein sequence (secondary) manually compiled alignments while PROSITE www.expasy.ch/prosite Pfam-B is an automated clustering of PRINTS www.bioinf.man.ac.uk/dbbrowser/PRINTS/PRINTS.html the whole SWISS-PROT database. Pfam www.sanger.ac.uk/Pfam/ These different secondary databases Macromolecular have recently been incorporated into a structures single resource named InterPro [29]. Protein Data Bank (PDB) www.rcsb.org/pdb Nucleic Acids Database (NDB) ndbserver.rutgers.edu/ HIV Protease Database www.ncifcrf.gov/CRYS/HIVdb/NEW_DATABASE Structural databases ReLiBase www2.ebi.ac.uk:8081/home.html Next we look at databases of macro- PDBsum www.biochem.ucl.ac.uk/bsm/pdbsum molecular structures. The Protein Data CATH www.biochem.ucl.ac.uk/bsm/cath SCOP scop.mrc-lmb.cam.ac.uk/scop Bank, PDB [6,7], provides a primary FSSP www2.embl-ebi.ac.uk/dali/fssp archive of all 3D structures for macromolecules such as proteins, Nucleotide sequences RNA, DNA and various complexes. GenBank www.ncbi.nlm.nih.gov/Genbank EMBL www.ebi.ac.uk/embl Most of the ~13,000 structures (August DDBJ www.ddbj.nig.ac.jp 2000) are solved by x-ray crystallo- graphy and NMR, but some theoretical Genome sequences models are also included. As the infor- Entrez genomes www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome GeneCensus bioinfo.mbb.yale.edu/genome mation provided in individual PDB COGs www.ncbi.nlm.nih.gov/COG entries can be difficult to extract, PDBsum [30] provides a separate Web Integrated databases InterPro www.ebi.ac.uk/interpro page for every structure in the PDB Sequence retrieval system (SRS) www.expasy.ch/srs5 displaying detailed structural analyses, Entrez www.ncbi.nlm.nih.gov/Entrez schematic diagrams and data on inter- actions between different molecules in Protein sequence databases and also include protein sequence data a given entry. Three major databases Protein sequence databases are from the translated coding regions in classify proteins by structure in order categorised as primary, composite or DNA sequence databases (see to identify structural and evolutionary secondary. Primary databases contain below). Secondary databases contain relationships: CATH [31], SCOP [32], over 300,000 protein sequences and information derived from protein and FSSP databases [33]. All function as a repository for the raw sequences and help the user determine comprise hierarchical structural data. Some more common repositories, whether a new sequence belongs to a taxonomy where groups of proteins such as SWISS-PROT [3] and PIR- known protein family. One of the most increase in similarity at lower levels International [23], annotate the popular is PROSITE [26], a database of the classification tree. In addition, sequences as well as describe the of short sequence patterns and profiles numerous databases focus on particular proteins’ functions, its domain structure that characterise biologically significant types of macromolecules. These and post-translational modifications. sites in proteins. PRINTS [27] expands in clude the Nucleic Acids Database, Composite databases such as OWL on this concept and provides a NDB [34], for structures related to [24] and the NRDB [25] compile and compendium of protein fingerprints – nucleic acids, the HIV protease filter sequence data from different groups of conserved motifs that database [35] for HIV-1, HIV-2 and primary databases to produce com- characterise a protein family. Motifs SIV protease structures and their bined non-redundant sets that are more are usually separated along a protein complexes, and ReLiBase [36] for complete than the individual databases sequence, but may be contiguous in receptor-ligand complexes.

86 Yearbook of Medical Informatics 2001 Review Paper

Nucleotide and Genome in 21 completed genomes on the basis The technologies for measuring sequences of sequence similarity. Members of protein abundance are currently limited As described previously, the biggest the same Cluster of Orthologous Group, to 2D gel electrophoresis followed by excitement currently lies with the COG, are expected to have the same mass spectrometry [54]. As gels can availability of complete genome 3D domain architecture and often, simi- only routinely resolve about 1,000 sequences for different organisms. The lar functions. The most straightforward proteins [55], only the most abundant GenBank [2], EMBL [37] and DDBJ application of the database is to predict can be visualised. At present, data [38] databases contain DNA sequen- the function of uncharacterised proteins from these experiments are only ces for individual genes that encode through their homology to characterised available from the literature [56,57]. protein and RNA products. Much like proteins, and also to identify phylo- the composite protein sequence genetic patterns of protein occurrence Data integration database, the Entrez nucleotide – for example, whether a given COG The most profitable research in database [39] compiles sequence data is represented across most or all bioinformatics often results from from these primary databases. organisms or in just a few closely integrating multiple sources of data related species. [58]. For instance, the 3D coordinates As whole-genome sequencing is of a protein are more useful if combined often conducted through international Gene expression data with data about the protein’s function, collaborations, individual genomes are A most recent source of genomic- occurrence in different genomes, and published at different sites. The Entrez scale data has been from expression interactions with other molecules. In genome database [40] brings together experiments, which quantify the this way, individual pieces of infor- all complete and partial genomes in a expression levels of individual genes. mation are put in context with respect single location and currently represents These experiments measure the to other data. Unfortunately, it is not over 1,000 organisms (August 2000). amount of mRNA or protein products always straightforward to access and In addition to providing the raw that are produced by the cell. For the cross-reference these sources of infor- nucleotide sequence, information is former, there are three main mation because of differences in presented at several levels of detail technologies: the cDNA microarray nomenclature and file formats. including: a list of completed genomes, [42-44], Affymatrix GeneChip [45] and all chromosomes in an organism, SAGE methods [46]. The first method At a basic level, this problem is detailed views of single chromosomes measures relative levels of mRNA frequently addressed by providing marking coding and non-coding regions, abundance between different samples, external links to other databases, for and single genes. At each level there while the last two measure absolute example in PDBsum, web-pages for are graphical presentations, pre- levels. Most of the effort in gene individual structures direct the user computed analyses and links to other expression analysis has concentrated towards corresponding entries in the sections of Entrez. For example, on the yeast and human genomes and PDB, NDB, CATH, SCOP and annotations for single genes include as yet, there is no central repository for SWISS-PROT. At a more advanced the translated protein sequence, this data. For yeast, the Young [10], level, there have been efforts to sequence alignments with similar genes Church [47] and Samson datasets [48] integrate access across several data in other genomes and summaries of use the GeneChip method, while the sources. One is the Sequence Retrieval the experimentally characterised or Stanford cell cycle [49], diauxic shift System, SRS [59], which allows flat- predicted function. GeneCensus [41] [50] and deletion mutant datasets [51] file databases to be indexed to each also provides an entry point for genome use the microarray. Most measure other; this allows the user to retrieve, analysis with an interactive whole- mRNA levels throughout the whole link and access entries from nucleic genome comparison from an evolution- yeast cell cycle, although some focus acid, protein sequence, protein motif, ary perspective. The database allows on a particular stage in the cycle. For protein structure and bibliographic building of phylogenetic trees based on humans, the main application has been databases. Another is the Entrez facility different criteria such as ribosomal to understand expression in tumour [39], which provides similar gateways RNA or protein fold occurrence. The and cancer cells. The Molecular to DNA and protein sequences, site also enables multiple genome Portraits of Breast Tumours [52], genome mapping data, 3D macromo- comparisons, analysis of single Lymphoma and Leukaemia Molecular lecular structures and the PubMed genomes and retrieval of information Profiling [53] projects provide data bibliographic database [60]. A search for individual genes. The COGs data- from microarray experiments on for a particular gene in either database base [20] classifies proteins encoded human cancer cells. will allow smooth transitions to the

Yearbook of Medical Informatics 2001 87 Review Paper genome it comes from, the protein gene products, and large-scale analyses first is represented by the vertical axis in sequence it encodes, its structure, of gene expression levels. Some of these the figure and outlines a possible approach bibliographic reference and equivalent research topics will be demonstrated in to the rational drug design . The entries for all related genes. our example analysis of transcription aim is to take a single protein and follow regulatory systems. through an analysis that maximises our understanding of the protein it encodes. “…UNDERSTAND and Other subject areas we have included Starting with a gene sequence, we can organise the information…” in Table 1 are development of digital determine the protein sequence with libraries for automated bibliographical strong certainty. From there, prediction Having examined the data, we can searches, knowledge bases of biological algorithms can be used to calculate the discuss the types of analyses that are information from the literature, DNA structure adopted by the protein. conducted. As shown in Table 1, the analysis methods in forensics, prediction Geometry calculations can define the broad subject areas in bioinformatics of nucleic acid structures, metabolic shape of the protein’s surface and can be separated according to the sources pathway simulations, and linkage analysis molecular simulations can determine the of information that are used in the studies. – linking specific genes to different force fields surrounding the molecule. For raw DNA sequences, investigations disease traits. Finally, using docking algorithms, one involve separating coding and non-coding could identify or design ligands that may regions, and identification of introns, In addition to finding relationships bind the protein, paving the way for exons and promoter regions for annotating between different proteins, much of designing a drug that specifically alters genomic DNA [61,62]. For protein se- bioinformatics involves the analysis of the protein’s function. In practise, the quences, analyses include developing one type of data to infer and understand intermediate steps are still difficult to algorithms for sequence comparisons the observations for another type of achieve accurately, and they are best [63], methods for producing multiple data. An example is the use of sequence combined with experimental methods to sequence alignments [64], and searching and structural data to predict the obtain some of the data, for example for functional domains from conserved secondary and tertiary structures of new characterising the structure of the protein sequence motifs in such alignments. protein sequences [67]. These methods, of interest. Investigations of structural data include especially the former, are often based on prediction of secondary and tertiary pro- statistical rules derived from structures, The aims of the second dimension, the tein structures, producing methods for such as the propensity for certain amino breadth in biological analysis, is to 3D structural alignments [65,66], exami- acid sequences to produce different compare a gene with others. Initially, ning protein geometries using distance secondary structural elements. Another simple algorithms can be used to com- and angular measurements, calculations example is the use of structural data to pare the sequences and structures of a of surface and volume shapes and ana- understand a protein’s function; here pair of related proteins. With a larger lysis of protein interactions with other studies have investigated the relationship number of proteins, improved algorithms subunits, DNA, RNA and smaller mole- different protein folds and their functions can be used to produce multiple align- cules. These studies have lead to molecu- [68,69] and analysed similarities between ments, and extract sequence patterns or lar simulation topics in which structural different binding sites in the absence of structural templates that define a family data are used to calculate the energetics homology [70]. Combined with similarity of proteins. Using this data, it is also involved in stabilising macromolecular measurements, these studies provide us possible to construct phylogenetic trees structures, simulating movements within with an understanding of how much to trace the evolutionary path of proteins. macromolecules, and computing the biological information can be accurately Finally, with even more data, the infor- energies involved in molecular docking. transferred between homologous mation must be stored in large-scale The increasing availability of annotated proteins [71]. databases. Comparisons become more genomic sequences has resulted in the complex, requiring multiple scoring introduction of computational genomics The bioinformatic s spectrum schemes, and we are able to conduct and proteomics – large-scale analyses Figure 1 summarises the main points genomic scale censuses that provide of complete genomes and the proteins we raised in our discussions of comprehensive statistical accounts of that they encode. Research includes organising and understanding protein features, such as the abundance characterisation of protein content and biological data – the development of of particular structures or functions in metabolic pathways between different bioinformatics techniques has allowed different genomes. It also allows us to genomes, identification of interacting an expansion of biological analysis in build phylogenetic trees that trace the proteins, assignment and prediction of two dimension, depth and breadth. The evolution of whole organisms.

88 Yearbook of Medical Informatics 2001 Review Paper

Fig. 1. Paradigm shifts during the past couple of decades have taken much of biology away from the laboratory bench and have allowed the integration of other scientific disciplines, specifically computing. The result is an expansion of biological research in breadth and depth. The vertical axis demonstrates how bioinformatics can aid rational drug design with minimal work in the wet lab. Starting with a single gene sequence, we can determine with strong certainty, the protein sequence. From there, we can determine the structure using structure prediction techniques. With geometry calculations, we can further resolve the protein’s surface and through molecular simulation determine the force fields surrounding the molecule. Finally docking algorithms can provide predictions of the ligands that will bind on the protein surface, thus paving the way for the design of a drug specific to that molecule. The horizontal axis shows how the influx of biological data and advances in computer technology have broadened the scope of biology. Initially with a pair of proteins, we can make comparisons between the between sequences and structures of evolutionary related proteins. With more data, algorithms for multiple alignments of several proteins become necessary. Using multiple sequences, we can also create phylogenetic trees to trace the evolutionary development of the proteins in question. Finally, with the deluge of data we currently face, we need to construct large databases to store, view and deconstruct the information. Alignments now become more complex, requiring sophisticated scoring schemes and there is enough data to compile a genome census – a genomic equivalent of a population census – providing comprehensive statistical of protein features in genomes.

Yearbook of Medical Informatics 2001 89 Review Paper

“… applying INFORMATICS We start by considering structural of structures represented in the PDB TECHNIQUES…” analyses of how DNA-binding proteins does not necessarily reflect the relative recognise particular base sequences. importance of the different proteins in The distinct subject areas we Later, we review several genomic the cell, it is clear that helix-turn-helix, mention require different types of studies that have characterised the zinc-coordinating and leucine zipper informatics techniques. Briefly, for data nature of transcription factors in motifs are used repeatedly. This organisation, the first biological different organisms, and the methods provides compact frameworks that databases were simple flat files. that have been used to identify regula- present the a-helix on the surfaces of However with the increasing amount tory binding sites in the upstream structurally diverse proteins. At a gross of information, relational database regions. Finally, we provide an overview level, it is possible to highlight the methods with Web-page interfaces of gene expression analyses that have differences between transcription have become increasingly popular. In been recently conducted and suggest factor domains that “just” bind DNA sequence analysis, techniques include future uses of transcription regulatory and those involved in catalysis [74]. string comparison methods such as analyses to rationalise the observations Although there are exceptions, the text search and 1-dimensional align- made in gene expression experiments. former typically approach the DNA ment algorithms. Motif and pattern All the results that we describe have from a single face and slot into the identification for multiple sequences been found through computational grooves to interact with base edges. depend on machine , clustering studies. The latter commonly envelope the and data-mining techniques. 3D substrate, using complex networks of structural analysis techniques include Structural studies secondary structures and loops. Euclidean geometry calculations As of August 2000, there were 379 combined with basic application of structures of protein-DNA complexes Focusing on proteins with a-helices, physical chemistry, graphical repre- in the PDB. Analyses of these the structures show many variations, sentations of surfaces and volumes, structures have provided valuable both in amino acid sequences and and structural comparison and 3D insight into the stereochemical detailed geometry. They have clearly matching methods. For molecular principles of binding, including how evolved independently in accordance simulations, Newtonian mechanics, particular base sequences are with the requirements of the context in quantum mechanics, molecular me- recognized and how the DNA structure which they are found. While achieving chanics and electrostatic calculations is quite often modified on binding. a close fit between the a-helix and are applied. In many of these areas, major groove, there is enough flexibility the computational methods must be A structural taxonomy of DNA- to allow both the protein and DNA to combined with good statistical analyses binding proteins, similar to that adopt distinct conformations. However, in order to provide an objective measure presented in SCOP and CATH, was several studies that analysed the binding for the significance of the results. first proposed by Harrison [72] and geometries of a-helices demonstrated periodically updated to accommodate that most adopt fairly uniform confor- Transcription regulation – a case new structures as they are solved [73]. mations regardless of protein family. study in bioinformatics The classification consists of a two- They are commonly inserted in the DNA-binding proteins have a central tier system: the first level collects major groove sideways, with their role in all aspects of genetic activity proteins into eight groups that share lengthwise axis roughly parallel to the within an organism, participating in gross structural features for DNA- slope outlined by the DNA backbone. processes such as transcription, packa- binding, and the second comprises 54 Most start with the N-terminus in the ging, rearrangement, replication and families of proteins that are structurally groove and extend out, completing two repair. In this section, we focus on the homologous to each other. Assembly to three turns within contacting distance studies that have contributed to our of such a system simplifies the of the nucleic acid [75,76]. understanding of transcription regula- comparison of different binding tion in different organisms. Through methods; it highlights the diversity of Given the similar binding orientations, this example, we demonstrate how protein-DNA complex geometries it is surprising to find that the interactions bioinformatics has been used to increase found in nature, but also underlines the between each amino acid position along our knowledge of biological systems importance of interactions between a - the a-helices and nucleotides on the and also illustrate the practical helices and the DNA major groove, DNA vary considerably between applications of the different subject the main mode of binding in over half different protein families. However, areas that were briefly outlined earlier. the protein families. While the number by classifying the amino acids according

90 Yearbook of Medical Informatics 2001 Review Paper to the sizes of their side chains, we are DNA complexes, indeed exists. factors in genomes invariably depends able to rationalise the different However, many interactions that are on similarity search strategies, which interactions patterns. The rules of normally considered to be non-specific, assume a functional and evolutionary interactions are based on the simple such as those with the DNA backbone, relationship between homologous premise that for a given residue position can also provide specificity depending proteins. In E. coli, studies have so far on a-helices in similar conformations, on the context in which they are made. estimated a total of 300 to 500 small amino acids interact with transcription regulators [87] and nucleotides that are close in distance Armed with an understanding of PEDANT [88], a database of auto- and large amino acids with those that protein structure, DNA-binding motifs matically assigned gene functions, are further [76,77]. Equivalent studies and side chain stereochemistry, a major shows that typically 2-3% of for binding by other structural motifs, application has been the prediction of prokaryotic and 6-7% of eukaryotic like b-hairpins, have also been binding either by proteins known to genomes comprise DNA-binding conducted [78]. When considering contain a particular motif, or those with proteins. As assignments were only these interactions, it is important to structures solved in the uncomplexed complete for 40-60% of genomes as of remember that different regions of the form. Most common are predictions August 2000, these figures most likely protein surface also provide interfaces for a-helix-major groove interactions underestimate the actual number. with the DNA. – given the amino acid sequence, what Nonetheless, they already represent a DNA sequence would it recognise large quantity of proteins and it is clear This brings us to look at the atomic [77,83]. In a different approach, that there are more transcription level interactions between individual molecular simulation techniques have regulators in eukaryotes than other amino acid-base pairs. Such analyses been used to dock whole proteins and species. This is unsurprising, consider- are based on the premise that a DNAs on the basis of force-field ing the organisms have developed a significant proportion of specific DNA- calculations around the two molecules relatively sophisticated transcription binding could be rationalised by a [84,85]. mechanism. universal code of recognition between amino acids and bases, ie whether The that both methods have From the conclusions of the structural certain protein residues preferably only been met with limited success is studies, the best strategy for charac- interact with particular nucleotides because even for apparently simple terising DNA-binding of the putative regardless of the type of protein-DNA cases like a-helix-binding, there are transcription factors in each genome is complex [79]. Studies have considered many other factors that must be to group them by homology and analyse hydrogen bonds, van der Waals contacts considered. Comparisons between the individual families. Such classifi- and water-mediated bonds [80-82]. bound and unbound nucleic acid cations are provided in the secondary Results showed that about 2/3 of all structures show that DNA-bending is sequence databases described earlier interactions are with the DNA a common feature of complexes formed and also those that specialise in backbone and that their main role is with transcription factors [74, 86]. This regulatory proteins such as RegulonDB one of sequence-independent stabilisa- and other factors such as electrostatic [89] and TRANSFAC [90]. Of even tion. In contrast, interactions with bases and cation-mediated interactions assist greater use is the provision of structural display some strong preferences, indirect recognition of the nucleotide assignments to the proteins; given a including the interactions of arginine or sequence, although they are not well transcription factor, it is helpful to know lysine with guanine, asparagine or understood yet. Therefore, it is now the structural motif that it uses for glutamine with adenine and threonine clear that detailed rules for specific binding, therefore providing us with a with thymine. Such preferences were DNA-binding will be family specific, better understanding of how it recog- explained through examination of the but with underlying trends such as the nises the target sequence. Structural stereochemistry of the amino acid side arginine-guanine interactions. genomics through bioinformatics chains and base edges. Also highlighted assigns structures to the protein were more complex types of inter- Genomic studies products of genomes by demonstrating actions where single amino acids Due to the wealth of biochemical similarity to proteins of known structure contact more than one base-step data that are available, genomic studies [91]. These studies have shown that simultaneously, thus recognising a short in bioinformatics have concentrated prokaryotic transcription factors most DNA sequence. These results on model organisms, and the analysis frequently contain helix-turn-helix suggested that universal specificity, of regulatory systems has been no motifs [87,92] and eukaryotic factors one that is observed across all protein- exception. Identification of transcription contain homeodomain type helix-turn-

Yearbook of Medical Informatics 2001 91 Review Paper helix, zinc finger or leucine zipper motifs. contacts are commonly used to stabilise sequences, it is of interest to search for From the protein classifications in each deformations in the nucleic acid their potential binding sites within genome, it is clear that different types structure, particularly in widening the genome sequences [95]. For of regulatory proteins differ in abun- DNA minor groove. The second class prokaryotes, most analyses have dance and families significantly differ comprise families whose members all involved compiling data on experi- in size. A study by Huynen and van target the same nucleotide sequence; mentally known binding sites for Nimwegen [93] has shown that mem- here, base-contacting positions are particular proteins and building a bers of a single family have similar absolutely or highly conserved allowing consensus sequence that incorporates functions, but as the requirements of related proteins to target the same any variations in nucleotides. Additional this function vary over time, so does sequence. sites are found by conducting word- the presence of each gene family in the matching searches over the entire genome. The third, and most interesting, class genome and scoring candidate sites by comprises families in which binding similarity [96-99]. Unsurprisingly, most Most recently, using a combination is also specific but different members of the predicted sites are found in non- of sequence and structural data, we bind distinct base sequences. Here coding regions of the DNA [96] and examined the conservation of amino protein residues undergo frequent the results of the studies are often acid sequences between related DNA- mutations, and family members can presented in databases such as binding proteins, and the effect that be divided into subfamilies according RegulonDB [89]. The consensus mutations have on DNA sequence to the amino acid sequences at base- search approach is often complemented recognition. The structural families contacting positions; those in the by comparative genomic studies described above were expanded to same subfamily are predicted to bind searching upstream regions of include proteins that are related by the same DNA sequence and those orthologous genes in closely related sequence similarity, but whose of different subfamilies to bind organisms. Through such an approach, structures remain unsolved. Again, distinct sequences. On the whole, it was found that at least 27% of members of the same family are the subfamilies corresponded well known E. coli DNA-regulatory motifs homologous, and probably derive from with the proteins’ functions and are conserved in one or more distantly a common ancestor. members of the same subfamilies were related bacteria [100]. found to regulate similar transcription Amino acid conservations were pathways. The combined analysis of The detection of regulatory sites in calculated for the multiple sequence sequence and structural data described eukaryotes poses a more difficult alignments of each family [94]. by this study provided an insight into problem because consensus sequences Generally, alignment positions that how homologous DNA-binding tend to be much shorter, variable, and interact with the DNA are better scaffolds achieve different specificities dispersed over very large distances. conserved than the rest of the protein by altering their amino acid sequences. However, initial studies in S. surface, although the detailed patterns In doing so, proteins evolved distinct cerevisiae provided an interesting of conservation are quite complex. functions, therefore allowing structur- observation for the GATA protein in Residues that contact the DNA back- ally related transcription factors to nitrogen metabolism regulation. bone are highly conserved in all protein regulate expression of different genes. While the 5 base-pair GATA families, providing a set of stabilising Therefore, the relative abundance of consensus sequence is found almost interactions that are common to all transcription regulatory families in a everywhere in the genome, a single homologous proteins. The conservation genome depends, not only on the isolated binding site is insufficient to of alignment positions that contact importance of a particular protein exert the regulatory function [101]. bases, and recognise the DNA se- function, but also in the adaptability Therefore specificity of GATA activity quence, are more complex and could of the DNA-binding motifs to comes from the repetition of the be rationalised by defining a 3-class recognise distinct nucleotide consensus sequence within the model for DNA-binding. First, protein sequences. This, in turn, appears to upstream regions of controlled genes families that bind non-specifically be best accommodated by simple in multiple copies. An initial study has usually contain several conserved base- binding motifs, such as the zinc fingers. used this observation to predict new contacting residues; without exception, Given the knowledge of the tran- regulatory sites by searching for over- interactions are made in the minor scription regulators that are contained represented oligonucleotides in non- groove where there is little discrim- in each organism, and an understanding coding regions of yeast and worm ination between base types. The of how they recognise DNA genomes [102,103].

92 Yearbook of Medical Informatics 2001 Review Paper

Having detected the regulatory trees, and group genes in a “bottom- More complex relationships have binding sites, there is the problem of up” fashion; genes with the most similar also been assessed. Conventional defining the genes that are actually expression profiles are clustered first, wisdom is that gene products that regulated, commonly termed regulons. and those with more diverse profiles interact with each other are more likely Generally, binding sites are assumed to are included iteratively [106-108]. In to have similar expression profiles than be located directly upstream of the contrast, the self-organising map [109, if they do not [116,117]. However, a regulons; however there are different 110] and K-means methods [111] recent study showed that this relation- problems associated with this assump- employ a “top-down” approach in which ship is not so simple [118]. While tion depending on the organism. For the user pre-defines the number of expression profiles are similar for gene prokaryotes, it is complicated by the clusters for the dataset. The clusters products that are permanently associ- presence of operons; it is difficult to are initially assigned randomly, and the ated, for example in the large ribosomal locate the regulated gene within an genes are regrouped iteratively until subunit, profiles differ significantly for operon since it can lie several genes they are optimally clustered. products that are only associated downstream of the regulatory se- transiently, including those belonging quence. It is often difficult to predict Given these methods, it is of interest to the same metabolic pathway. the organisation of operons [104], to relate the expression data to other especially to define the gene that is attributes such as structure, function As described below, one of the main found at the head, and there is often a and subcellular localisation of each driving forces behind expression lack of long-range conservation in gene gene product. Mapping these properties analysis has been to analyse cancerous order between related organisms [105]. provides an insight into the cell lines [119]. In general, it has been The problem in eukaryotes is even characteristics of proteins that are shown that different cell lines (eg more severe; regulatory sites often act expressed together, and also suggest epithelial and ovarian cells) can be in both directions, binding sites are some interesting conclusions about the distinguished on the basis of their usually distant from regulons because overall biochemistry of the cell. In expression profiles, and that these of large intergenic regions, and yeast, shorter proteins tend to be more profiles are maintained when cells are transcription regulation is usually a highly expressed than longer proteins, transferred from an in vivo to an in result of combined action by multiple probably because of the relative ease vitro environment [120]. The basis for transcription factors in a combinatorial with which they are produced [112]. their physiological differences were manner. Looking at the amino acid content, apparent in the expression of specific highly expressed genes are generally genes; for example, expression levels Despite these problems, these enriched in alanine and glycine, and of gene products necessary for studies have succeeded in confirming depleted in asparagine; these are progression through the cell cycle, the transcription regulatory pathways thought to reflect the requirements of especially ribosomal genes, correlated of well-characterised systems such as amino acid usage in the organism, where well with variations in cell proliferation the heat shock response system [99]. synthesis of alanine and glycine are rate. Comparative analysis can be In addition, it is feasible to experi- energetically less expensive than extended to tumour cells, in which the mentally verify any predictions, most asparagine. Turning to protein underlying causes of cancer can be notably using gene expression data. structure, expression levels of the TIM uncovered by pinpointing areas of barrel and NTP hydrolase folds are biological variations compared to Gene expression studies highest, while those for the leucine normal cells. For example in breast Many expression studies have so zipper, zinc finger and transmembrane cancer, genes related to cell prolifera- far focused on devising methods to helix-containing folds are lowest. This tion and the IFN-regulated cluster genes by similarities in relates to the functions associated with transduction pathway were found to expression profiles. This is in order to these folds; the former are commonly be upregulated [52,121]. One of the determine the proteins that are involved in metabolic pathways and difficulties in cancer treatment has expressed together under different the latter in signalling or transport been to target specific therapies to cellular conditions. Briefly, the most processes [113]. This is also reflected pathogenetically distinct tumour types, common methods are hierarchical in the relationship with subcellular in order to maximise efficacy and clustering, self-organising maps, and localisations of proteins, where minimise toxicity. Thus, improvements K-means clustering. Hierarchical expression of cytoplasmic proteins is in cancer classifications have been methods originally derived from high, but nuclear and membrane central to advances in cancer treat- algorithms to construct phylogenetic proteins tend to be low [114,115]. ment. Although the distinction between

Yearbook of Medical Informatics 2001 93 Review Paper different forms of cancer – for example of remote homologues and checking can be determined using translation subclasses of acute leukaemia – has whether the prediction is energetically software. Sequence search techniques been well established, it is still not viable [124]. Where biochemical or can then be used to find homologues in possible to establish a clinical diagnosis structural data are lacking, studies could model organisms, and based on on the basis of a single test. In a recent be made in low-level organisms like sequence similarity, it is possible to study, acute myeloid leukaemia and yeast and the results applied to model the structure of the human acute lymphoblastic leukaemia were homologues in higher-level organisms protein on experimentally characterised successfully distinguished based on the such as humans, where experiments structures. Finally, docking algorithms expression profiles of these cells [53]. are more demanding. could design molecules that could bind As the approach does not require prior the model structure, leading the way biological knowledge of the diseases, it An equivalent approach is also for biochemical assays to test their may provide a generic strategy for employed in genomics. Homologue- biological activity on the actual protein. classifying all types of cancer. finding is extensively used to confirm coding regions in newly sequenced Large-scale censuses Clearly, an essential aspect of genomes and functional data is fre- Although databases can efficiently understanding expression data lies in quently transferred to annotate individ- store all the information related to understanding the basis of transcription ual genes. On a larger scale, it also genomes, structures and expression regulation. However, analysis in this area simplifies the problem of understanding datasets, it is useful to condense all this is still limited to preliminary analyses of complex genomes by analysing simple information into understandable trends expression levels in yeast mutants lacking organisms first and then applying the and facts that users can readily under- key components of the transcription same principles to more complicated stand. Broad generalisations help initiation complex [10,122]. ones – this is one reason why early identify interesting subject areas for structural genomics projects focused further detailed analysis, and place on Mycoplasma genitalium [91]. new observations in a proper context. “… many PRACTICAL This enables one to see whether they APPLICATIONS…” Ironically, the same idea can be are unusual in any way. applied in reverse. Potential drug Here, we describe some of the major targets are quickly discovered by Through these large-scale uses of bioinformatics. checking whether homologues of censuses, one can address a number essential microbial proteins are missing of evolutionary, biochemical and Finding Homologues in humans. On a smaller scale, structural biophysical questions. For example, As described earlier, one of the differences between similar proteins are specific protein folds associated driving forces behind bioinformatics is may be harnessed to design drug with certain phylogenetic groups? the search for similarities between molecules that specifically bind to one How common are different folds different biomolecules. Apart from structure but not another. within particular organisms? And to enabling systematic organisation of what degree are folds shared between data, identification of protein homol- Rational Drug Design related organisms? Does this extent of ogues has some direct practical uses. One of the earliest medical applica- sharing parallel measures of The most obvious is transferring infor- tions of bioinformatics has been in relatedness derived from traditional mation between related proteins. For aiding rational drug design. Figure 2 evolutionary trees? Initial studies show example, given a poorly characterised outlines the commonly cited approach, that the frequency of folds differs protein, it is possible to search for taking the MLH1 gene product as an greatly between organisms and that homologues that are better understood example drug target. MLH1 is a human the sharing of folds between organisms and with caution, apply some of the gene encoding a mismatch repair does in fact follow traditional knowledge of the latter to the former. protein (mmr) situated on the short phylogenetic classifications [21,41]. Specifically with structural data, arm of chromosome 3 [125]. Through We can also integrate data on protein theoretical models of proteins are linkage analysis and its similarity to functions; given that the particular usually based on experimentally solved mmr genes in mice, the gene has protein folds are often related to specific structures of close homologues [123]. been implicated in nonpolyposis colo- biochemical functions [68, 69], these Similar techniques are used in fold rectal cancer [126]. Given the nucle- findings highlight the diversity of recognition in which tertiary structure otide sequence, the probable amino metabolic pathways in different predictions depend on finding structures acid sequence of the encoded protein organisms [20,105].

94 Yearbook of Medical Informatics 2001 Review Paper

Fig.2. Above is a schematic outlining how scientists can use bioinformatics to aid rational drug discovery. MLH1 is a human gene encoding a mismatch repair protein (mmr) situated on the short arm of chromosome 3. Through linkage analysis and its similarity to mmr genes in mice, the gene has been implicated in nonpolyposis colorectal cancer. Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence search techniques can be used to find homologues in model organisms, and based on sequence similarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms could design molecules that could bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.

As we discussed earlier, one of the localisations of proteins and their inter- usually involves compiling expression most exciting new sources of genomic actions with each other [127-129]. In data for cells affected by different information is the expression data. conjunction with structural data, we can diseases [131], eg cancer [53,132, Combining expression information with then begin to compile a map of all protein- 133] and ateriosclerosis [134], and structural and functional classifications protein interactions in an organism. comparing the measurements against of proteins we can ask whether the normal expression levels. Identifi- high occurrence of a protein fold in a Further applications in medical cation of genes that are expressed genome is indicative of high expression sciences differently in affected cells provides levels [112]. Further genomic scale data Most recent applications in the a basis for explaining the causes of that we can consider in large-scale medical sciences have centred on illnesses and highlights potential drug surveys include the subcellular gene expression analysis [130]. This targets. Using the process described

Yearbook of Medical Informatics 2001 95 Review Paper in Figure 2, one would design conducted – with reference to trans- The Protein Data Bank. A computer-based compounds that bind the expressed cription regulatory systems – and finally archival file for macromolecular structures. Eur J Biochem 1977;80(2):319-24. protein, or perhaps more importantly, looked at several practical applications 7. Berman HM, Westbrook J, Feng Z, Gilliland the transcription regulator has caused of the field. G, Bhat TN, Weissig H, et al. The Protein the change in expression levels. Given Data Bank. Nucleic Acids Res a lead compound, microarray experi- Two principal approaches underpin 2000;28(1):235-42. ments can then be used to evaluate all studies in bioinformatics. First is 8. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc responses to pharmacological inter- that of comparing and grouping the Natl Acad Sci U S A 1988;85(8):2444-2448. vention, [135,136] and also provide data according to biologically meaning- 9. Altschul SF, Madden TL, Schaffer AA, early tests to detect or predict the ful similarities and second, that of Zhang J, Zhang Z, Miller W, et al. Gapped toxicity of trial drugs. analysing one type of data to infer and BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Further advances in bioinformatics understand the observations for another Acids Res. 1997;25(17):3389-3402. combined with experimental genomics type of data. These approaches are 10. Holstege FC JE, Wyrick JJ, Lee TI, for individuals are predicted to reflected in the main aims of the field, Hengartner CJ, Green MR, Golub TR, revolutionalise the future of healthcare. which are to understand and organise Lander ES, Young RA. Dissecting the A typical scenario for a patient may the information associated with biolo- regulatory circuitry of a eukaryotic genome. Cell 1998;95(5):717-728. start with post-natal genotyping to gical molecules on a large scale. As a 11. Pedersendagger AG, Jensendagger LJ, assess susceptibility or immunity from result, bioinformatics has not only Brunak S, Staerfeldt HH, Ussery DW. A specific diseases and pathogens. With provided greater depth to biological DNA structural atlas for Escherichia coli. J this information, a unique combination investigations, but added the dimension Mol Biol 2000;299(4):907-930. 12. Kanehisa M, Goto S. KEGG: kyoto of vaccines could be prescribed, mini- of breadth as well. In this way, we are encyclopedia of genes and genomes. Nucleic mising the healthcare costs of unneces- able to examine individual systems in Acids Res 2000;28(1):27-30. sary treatments and anticipating the detail and also compare them with 13. Jeffery CJ. Moonlighting proteins. TIBS onslaught of diseases later in life. those that are related in order to 1999;24(1):8-11. Regular lifetime screenings could lead uncover common principles that apply 14. Chothia C. Proteins. One thousand families for the molecular biologist [news]. Nature to guidance for nutrition intake and across many systems and highlight 1992;357(6379):543-4. early detections of any illnesses [137]. unusual features that are unique to 15. Orengo CA, Jones DT, Thornton JM. In addition, drug-based treatments some. Protein superfamilies and domain could be tailored specifically to the superfolds. Nature 1994;372(6507):631-4. Acknowledgements 16. Lesk AM, Chothia C. How different amino patient and disease, thus providing the acid sequences determine similar protein most effective course of medication We thank Patrick McGarvey for comments structures: the structure and evolutionary with minimal side-effects [138]. Given on the manuscript. dynamics of the globins. J Mol Biol the present rate of development, such 1980;136(3):225-70. a scenario in healthcare appears to be 17. Russell RB, Saqi MA, Sayle RA, Bates PA, References Sternberg MJ. Recognition of analogous possible in the not too distant future. and homologous protein folds: analysis of 1. Reichhardt T. It’s sink or swim as a tidal sequence and structure conservation. J Mol Conclusions wave of data approaches. Nature Biol 1997;269(3):423-39. With the current deluge of data, 1999;399(6736):517-20. 18. Russell RB, Saqi MA, Bates PA, Sayle RA, computational methods have become 2. Benson DA, Karsch-Mizrachi I, Lipman Sternberg MJ. Recognition of analogous DJ, Ostell J, Rapp BA, Wheeler DL. and homologous protein folds—assessment indispensable to biological investiga- GenBank. Nucleic Acids Res 2000;28 of prediction success and associated tions. Originally developed for the (1):15-8. alignment accuracy using empirical analysis of biological sequences, bioin- 3. Bairoch A, Apweiler R. The SWISS-PROT substitution matrices. Protein Eng formatics now encompasses a wide protein sequence database and its 1998;11(1):1-9. 19. Fitch WM. Distinguishing homologous range of subject areas including struc- supplement TrEMBL in 2000. Nucleic Acids Res 2000;28(1):45-8. from analogous proteins. Syst Zool tural biology, genomics and gene ex- 4. Fleischmann RD, Adams MD, White O, 1970;19:99-110. pression studies. In this review, we Clayton RA, Kirkness EF, Kerlavage AR, 20. Tatusov RL, Koonin EV, Lipman DJ. A provided an introduction and overview et al. Whole-genome random sequencing genomic perspective on protein families. Science 1997;278(5338):631-7. of the current state of field. In and assembly of Haemophilus influenzae Rd. Science 1995;269 (5223):496-512. 21. Gerstein M, Hegyi H. Comparing genomes particular, we discussed the types of 5. Drowning in data. The Economist 26 June in terms of protein structure: surveys of a biological information and databases 1999. finite parts list. FEMS Microbiol Rev that are commonly used, examined 6. Bernstein FC, Koetzle TF, Williams GJ, 1998;22(4):277-304. some of the studies that are being Meyer EF, Jr., Brice MD, Rodgers JR, et al. 22. Skolnick J, Fetrow JS. From genes to protein

96 Yearbook of Medical Informatics 2001 Review Paper

structure and function: novel applications 38. Okayama T, Tamura T, Gojobori T, Tateno 2000;406(6797):747-52. of computational approaches in the genomic Y, Ikeo K, Miyazaki S, et al. Formal design 53. Golub TR, Slonim DK, Tamayo P, Huard era. TIBtech 2000;18:34-39. and implementation of an improved DDBJ C, Gaasenbeek M, Mesirov JP, et al. 23. McGarvey PB, Huang H, Barker WC, DNA database with a new schema and Molecular classification of cancer: class Orcutt BC, Garavelli JS, Srinivasarao GY, object-oriented . Bioinformatics discovery and class prediction by gene et al. PIR: a new resource for bioinformatics. 1998;14(6):472-8. expression monitoring. Science 1999;286 Bioinformatics 2000;16(3):290-291. 39. Schuler GD, Epstein JA, Ohkawa H, Kans (5439):531-7. 24. Bleasby AJ, Akrigg D, Attwood TK. JA. Entrez: molecular biology database and 54. Celis JE, Gromov P. 2D protein OWL—a non-redundant composite protein retrieval system. Methods Enzymol electrophoresis: can it be perfected? Curr sequence database. Nucleic Acids Res 1996;266:141-62. Opin Biotechnol 1999;10(1):16-21. 1994;22(17):3574-3577. 40. Tatusova TA, Karsch-Mizrachi I, Ostell 55. Pandey A, Mann M. Proteomics to study 25. Bleasby AJ, Wootton JC. Construction of JA. Complete genomes in WWW Entrez: genes and genomes. Nature 2000;405 validated, non-redundant composite protein data representation and analysis. (6788):837-46. sequence databases. Protein Eng Bioinformatics 1999;15(7-8):536-43. 56. Futcher B, Latter GI, Monardo P, 1990;3(3):153-159. 41. Lin J, Gerstein M. Whole-genome trees McLaughlin CS, Garrels JI. A sampling of 26. Hofmann K, Bucher P, Falquet L, Bairoch A. based on the occurrence of folds and the yeast proteome. Mol Cell Biol The PROSITE database, its status in 1999. orthologs: implications for comparing 1999;19(11):7357-68. Nucleic Acids Res 1999;27(1):215-219. genomes on different levels. Genome Res 57. Gygi SP, Rist B, Gerber SA, Turecek F, 27. Attwood TK, Croning MD, Flower DR, 2000;10(6):808-18. Gelb MH, Aebersold R. Quantitative Lewis AP, Mabey JE, Scordis P, et al. 42. Eisen MB, Brown PO. DNA arrays for analysis of complex protein mixtures using PRINTS-S: the database formerly known analysis of gene expression. Methods isotope-coded affinity tags. Nat Biotechnol as PRINTS. Nucleic Acids Res Enzymol 1999;303:179-205. 1999;17(10):994-9. 2000;28(1):225-227. 43. Cheung VG, Morley M, Aguilar F, Massimi 58. Gerstein M. Integrative database analysis 28. Bateman A, Birney E, Durbin R, Eddy SR, A, Kucherlapati R, Childs G. Making and in structural genomics. Nature Struct Biol Howe KL, Sonnhammer EL. The Pfam reading microarrays. Nat Genet 1999;21(1 2000;7:960-3. protein families database. Nucleic Acids Suppl):15-9. 59. Etzold T, Ulyanov A, Argos P. SRS: Res 2000;28(1):263-266. 44. Duggan DJ, Bittner M, Chen Y, Meltzer P, system for molecular 29. Attwood TK, Flower DR, Lewis AP, Trent JM. Expression profiling using cDNA biology data banks. Methods Enzymol Mabey JE, Morgan SR, Scordis P, et al. microarrays. Nat Genet 1999;21(1 1996;266:114-28. PRINTS prepares for the new millennium. Suppl):10-4. 60. Wade K. Searching Entrez PubMed and Nucleic Acids Res 1999;27(1):220-225. 45. Lipshutz RJ FS, Gingeras TR, Lockhart uncover on the internet [news]. Aviat Space 30. Laskowski RA, Hutchinson EG, Michie DJ. High density synthetic oligonucleotide Environ Med 2000;71(5):559. AD, Wallace AC, Jones ML, Thornton arrays. Nat Gen 1999;21(1):20-24. 61. Zhang MQ. Promoter analysis of co- JM. PDBsum: a Web-based database of 46. Velculescu VE ZL, Zhou, W Traverso, G St regulated genes in the yeast genome. summaries and analyses of all PDB Croix, B Vogelstein B, Kinzler KW. Serial Comput Chem 1999;23(3-4):233-50. structures. TIBS 1997;22(12):488-490. Analysis of Gene Expression Detailed 62. Boguski MS. Biosequence exegesis. Science 31. Pearl FM, Lee D, Bray JE, Sillitoe I, Todd Protocol. 1999. 1999;286(5439):453-5. AE, Harrison AP, et al. Assigning genomic 47. Roth FP HJ, Estep PW, Church GM. 63. Miller C, Gurd J, Brass A. A RAPID sequences to CATH. Nucleic Acids Res Finding DNA regulatory motifs within for sequence database 2000;28(1):277-282. unaligned noncoding sequences clustered comparisons: application to the 32. Lo Conte L, Ailey B, Hubbard TJ, Brenner by whole-genome mRNA quantitation. Nat identification of vector contamination in SE, Murzin AG, Chothia C. SCOP: a Biotechnol 1998;16(10):939-45. the EMBL databases. Bioinformatics structural classification of proteins database. 48. Jelinsky SA, Samson LD. Global response 1999;15(2):111-21. Nucleic Acids Res 2000;28(1):257-259. of Saccharomyces cerevisiae to an alkylating 64. Gonnet GH, Korostensky C, Benner S. 33. Holm L, Sander C. Touring protein fold agent. Proc Natl Acad Sci U S A measures of multiple sequence space with Dali/FSSP. Nucleic Acids Res 1999;96(4):1486-91. alignments [In Process Citation]. J Comput 1998;26(1):316-319. 49. Cho RJ, Campbell MJ, Winzeler EA, Biol 2000;7(1-2):261-76. 34. Berman HM, Olson WK, Beveridge DL, Steinmetz L, Conway A, Wodicka L, et al. 65. Orengo CA, Taylor WR. SSAP: sequential Westbrook J, Gelbin A, Demeny T, et al. A genome-wide transcriptional analysis of structure alignment program for protein The Nucleic Acid Database. A the mitotic cell cycle. Mol Cell structure comparison. Methods Enzymol comprehensive relational database of three- 1998;2(1):65-73. 1996;266:617-35. dimensional structures of nucleic acids. 50. DeRisi JL, Iyer VR, Brown PO. Exploring 66. Orengo CA. CORA—topological Biophys J 1992;63(3):751-759. the metabolic and genetic control of gene fingerprints for protein structural families. 35. Vondrasek J, Wlodawer A. Database of expression on a genomic scale. Science Protein Sci 1999;8(4):699-715. HIV proteinase structures. TIBS 1997;278(5338):680-6. 67. Russell RB, Sternberg MJ. Structure 1997;22(5):183. 51. Winzeler EA, Shoemaker DD, Astromoff A, prediction. How good are we? Curr Biol 36. Hendlich M. Databases for protein-ligand Liang H, Anderson K, Andre B, et al. 1995;5(5):488-90. complexes. Acta Cryst D 1998;54(1):1178- Functional characterization of the S. cerevisiae 68. Martin AC, Orengo CA, Hutchinson EG, 1182. genome by gene deletion and parallel analysis. Jones S, Karmirantzou M, Laskowski RA, 37. Baker W, van den Broek A, Camon E, Science 1999;285(5429):901-6. et al. Protein folds and functions. Structure Hingamp P, Sterk P, Stoesser G, et al. The 52. Perou CM, Sorlie T, Eisen MB, van de Rijn 1998;6(7):875-84. EMBL nucleotide sequence database. M, Jeffrey SS, Rees CA, et al. Molecular 69. Hegyi H, Gerstein M. The relationship Nucleic Acids Res 2000;28(1):19-23. portraits of human breast tumours. Nature between protein structure and function: a

Yearbook of Medical Informatics 2001 97 Review Paper

comprehensive survey with application to Aviles FX, Sternberg MJ. Modelling 100. McGuire AM, Hughes JD, Church GM. the yeast genome. J Mol Biol 1999;288(1): repressor proteins docking to DNA. Conservation of DNA regulatory motifs 147-64. Proteins 1998;33(4):535-49. and discovery of new motifs in microbial 70. Russell RB, Sasieni PD, Sternberg MJE. 86. Dickerson RE. DNA bending: the prevalence genomes [In Process Citation]. Genome Supersites within superfolds. Binding site of kinkiness and the virtues of normality. Res 2000;10(6):744-57. similarity in the absence of homology. J Nucleic Acids Res 1998;26(8):1906-26. 101. Bysani N, Daugherty JR, Cooper TG. Mol Biol 1998;282(4):903-18. 87. Perez-Rueda E, Collado-Vides J. The Saturation mutagenesis of the UASNTR 71. Wilson CA, Kreychman J, Gerstein M. repertoire of DNA-binding transcriptional (GATAA) responsible for nitrogen Assessing annotation transfer for genomics: regulators in Escherichia coli K-12. Nucleic catabolite repression-sensitive quantifying the relations between protein Acids Res 2000;28(8):1838-47. transcriptional activation of the allantoin sequence, structure and function through 88. Mewes HW, Frishman D, Gruber C, Geier pathway genes in Saccharomyces traditional and probabilistic scores. J Mol B, Haase D, Kaps A, et al. MIPS: a database cerevisiae. J Bacteriol 1991;173 Biol 2000;297(1):233-49. for genomes and protein sequences. Nucleic (16):4977-82. 72. Harrison SC. A structural taxonomy of Acids Res 2000;28(1):37-40. 102. Clarke ND, Berg JM. Zinc fingers in DNA-binding domains. Nature 89. Salgado H, Santos-Zavaleta A, Gama- Caenorhabditis elegans: finding families 1991;353(6346):715-9. Castro S, Millan-Zarate D, Blattner FR, and probing pathways. Science 73. Luscombe NM, Austin SE, Berman HM, Collado-Vides J. RegulonDB (version 3.0): 1998;282(5396):2018-22. Thornton JM. An overview of the structures transcriptional regulation and operon 103. van Helden J, Andre B, Collado-Vides J. of protein-DNA complexes. Genome organization in Escherichia coli K-12. Extracting regulatory sites from the Biology 2000;1(1):1-37. Nucleic Acids Res 2000;28(1):65-7. upstream region of yeast genes by 74. Jones S, van Heyningen P, Berman HM, 90. Wingender E, Chen X, Hehl R, Karas H, computational analysis of oligonucleotide Thornton JM. Protein-DNA interactions: Liebich I, Matys V, et al. TRANSFAC: an frequencies. J Mol Biol 1998;281(5):827-42. A structural analysis. J Mol Biol integrated system for gene expression 104. Salgado H, Moreno-Hagelsieb G, Smith 1999;287(5):877-96. regulation. Nucleic Acids Res TF, Collado-Vides J. Operons in 75. Suzuki M, Gerstein M. Binding geometry 2000;28(1):316-9. Escherichia coli: genomic analyses and of alpha-helices that recognize DNA. 91. Teichmann SA, Chothia C, Gerstein M. predictions. Proc Natl Acad Sci U S A Proteins 1995;23(4):525-35. Advances in structural genomics. Curr Opin 2000;97(12):6652-7. 76. Luscombe NM, Thornton JM. Protein- Struct Biol 1999;9(3):390-9. 105. Tatusov RL, Mushegian AR, Bork P, DNA interactions: a 3D analysis of alpha- 92. Aravind L, Koonin EV. DNA-binding Brown NP, Hayes WS, Borodovsky M, et helix-binding in the major groove. proteins and evolution of transcription al. Metabolism and evolution of Haemophilus Manuscript in preparation. regulation in the archaea. Nucleic Acids Res influenzae deduced from a whole- genome 77. Suzuki M, Brenner SE, Gerstein M, Yagi N. 1999;27(23):4658-70. comparison with Escherichia coli. Curr DNA recognition code of transcription 93. Huynen MA, van Nimwegen E. The Biol 1996;6(3):279-91. factors. Protein Eng 1995;8(4):319-28. frequency distribution of gene family sizes 106. Eisen MB, Spellman PT, Brown PO, Botstein 78. Suzuki M. DNA recognition by a beta- in complete genomes. Mol Biol Evol D. Cluster analysis and display of genome- sheet. Protein Eng 1995;8(1):1-4. 1998;15(5):583-9. wide expression patterns. Proc Natl Acad Sci 79. Seeman NC, Rosenberg JM, Rich A. 94. Luscombe NM, Thornton JM. Protein- U S A 1998;95(25):14863-8. Sequence specific recognition of double DNA interactions: an analysis of amino 107. Wen X, Fuhrman S, Michaels GS, Carr helical nucleic acids by proteins. Proc Natl acid conservation and the effect on binding DB, Smith S, Barker JL, et al. Large-scale Acad Sci U S A 1976;73:804-808. specificity. Manuscript in preparation. temporal gene expression mapping of 80. Suzuki M. A framework for the DNA- 95. Gelfand MS. Prediction of function in DNA central development. Proc protein recognition code of the probe helix sequence analysis. J Comp Biol 1995;1:87- Natl Acad Sci U S A 1998;95(1):334-9. in transcription factors: the chemical and 115. 108. Alon U, Barkai N, Notterman DA, Gish K, stereochemical rules [see comments]. 96. Robison K, McGuire AM, Church GM. A Ybarra S, Mack D, et al. Broad patterns of Structure 1994;2(4):317-26. comprehensive library of DNA-binding site gene expression revealed by clustering 81. Mandel-Gutfreund Y, Schueler O, Margalit matrices for 55 proteins applied to the analysis of tumor and normal colon tissues H. Comprehensive analysis of hydrogen complete Escherichia coli K-12 genome. J probed by oligonucleotide arrays. Proc bonds in regulatory protein DNA- Mol Biol 1998;284(2):241-54. Natl Acad Sci U S A 1999;96(12):6745-50. complexes: in search of common principles. 97. Thieffry D, Salgado H, Huerta AM, Collado- 109. Tamayo P, Slonim D, Mesirov J, Zhu Q, J Mol Biol 1995;253(2):370-82. Vides J. Prediction of transcriptional Kitareewan S, Dmitrovsky E, et al. 82. Luscombe NM, Laskowski RA, Thornton regulatory sites in the complete genome Interpreting patterns of gene expression JM. Protein-DNA interactions: a 3D sequence of Escherichia coli K-12. with self-organizing maps: methods and analysis of amino acid-base interactions. Bioinformatics 1998;14(5):391-400. application to hematopoietic differentia- Manuscript in preparation. 98. Mironov AA, Koonin EV, Roytberg MA, tion. Proc Natl Acad Sci U S A 83. Mandel-Gutfreund Y, Margalit H, Jernigan Gelfand MS. Computer analysis of 1999;96(6):2907-12. RL, Zhurkin VB. A role for CH...O transcription regulatory patterns in 110. Toronen P, Kolehmainen M, Wong G, interactions in protein-DNA recognition. J completely sequenced bacterial genomes. Castren E. Analysis of gene expression Mol Biol 1998;277(5):1129-40. Nucleic Acids Res 1999;27(14):2981-9. data using self-organizing maps. FEBS 84. Sternberg MJ, Gabb HA, Jackson RM. 99. Gelfand MS, Koonin EV, Mironov AA. Lett 1999;451(2):142-6. Predictive docking of protein-protein and Prediction of transcription regulatory sites 111. Tavazoie S, Hughes JD, Campbell MJ, protein-DNA complexes. Curr Opin Struct in Archaea by a comparative genomic Cho RJ, Church GM. Systematic deter- Biol 1998;8(2):250-6. approach. Nucleic Acids Res 2000;28(3): mination of genetic . 85. Aloy P, Moont G, Gabb HA, Querol E, 695-705. Nat Genet 1999;22(3):281-5.

98 Yearbook of Medical Informatics 2001 Review Paper

112. Jansen R, Gerstein M. Analysis of the 122. Livesey FJ, Furukawa T, Steffen MA, expression with self-organizing maps: yeast transcriptome with structural and Church GM, Cepko CL. Microarray methods and application to functional categories: characterizing analysis of the transcriptional network hematopoietic differentiation. Proc Natl highly expressed proteins. Nucleic Acids controlled by the photoreceptor Acad Sci U S A 1999;96(6):2907-12. Res 2000;28(6):1481-8. homeobox gene Crx. Curr Biol 133. Perou CM JS, van de Rijn M, Rees CA, 113. Gerstein M, Jansen R. The current 2000;10(6):301-10. Eisen MB, Ross DT, Pergamenschikov excitment in bioinformatics, analysis of 123. Sali A, Blundell TL. Comparative protein A, Williams CF, Zhu SX, Lee JC, Lashkari whole-genome expression data: how does modelling by satisfaction of spatial D, Shalon D, Brown PO, Botstein D. it relate to protein structure and function. restraints. J Mol Biol 1993;234(3):779- Distinctive gene expression patterns in Current Opinion in Structural Biology 815. human mammary epithelial cells and 2000;10:574-84. 124. Jones DT, Taylor WR, Thornton JM. A breast cancers. Proc Natl Acad Sci 114. Drawid A, Gerstein M. A Bayesian new approach to protein fold recognition. 1999;96(16):9212-7. System Integrating Expression Data with Nature 1992;358(6381):86-9. 134. Hiltunen MO, Niemi M, Yla-Herttuala S. Sequence Patterns for Localizing Proteins: 125. Kok K, Naylor SL, Buys CH. Deletions Functional genomics and DNA array Comprehensive Application to the Yeast of the short arm of chromosome 3 in solid techniques in atherosclerosis research. Genome. J Mol Biol 2000;301:1059-75. tumors and the search for suppressor Curr Opin Lipidol 1999;10(6):515-9. 115. Drawid A, Jansen R, Gerstein M. Genom- genes. Adv Cancer Res 1997;71:27-92. 135. Colantuoni C, Purcell AE, Bouton CM, wide analysis relating expression level 126. Syngal S, Fox EA, Eng C, Kolodner RD, Pevsner J. High throughput analysis of with protein subcellular localisation. Garber JE. Sensitivity and specificity of gene expression in the human brain. J TIGS 2000;16:426-30. clinical criteria for hereditary non- Neurosci Res 2000;59(1):1-10. 116. Marcotte EM, Pellegrini M, Ng HL, Rice polyposis colorectal cancer associated 136. Debouck C, Metcalf B. The impact of DW, Yeates TO, Eisenberg D. Detecting mutations in MSH2 and MLH1. J Med genomics on drug discovery. Annu Rev protein function and protein-protein Gen 2000;37(9):641-645. Pharmacol Toxicol 2000;40:193-207. interactions from genome sequences. 127. Uetz P, Giot L, Cagney G, Mansfield 137. Sander C. Genomic medicine and the Science 1999;285(5428):751-3. TA, Judson RS, Knight JR, et al. A future of health care. Science 2000;287 117. Eisenberg D, Marcotte EM, Xenarios I, comprehensive analysis of protein-protein (5460):1977-8. Yeates TO. Protein function in the post- interactions in Saccharomyces cerevisiae. 138. Ohlstein EH, Ruffolo RR, Jr., Elliott JD. genomic era. Nature 2000;405 Nature 2000;403(6770):623-7. Drug discovery in the next millennium. (6788):823-6. 128. Ross-Macdonald P, Sheehan A, Friddle Annu Rev Pharmacol Toxicol 118. Jansen R, Greenbaum D, Gerstein M. C, Roeder GS, Snyder M. Transposon 2000;40:177-91. Relating whole-genome expression data mutagenesis for the analysis of protein with protein-protein interactions. production, function, and localization. Manuscript in preparation. Methods Enzymol 1999;303:512-32. 119. Marx J. Medicine. DNA arrays reveal 129. Mewes HW, Heumann K, Kaps A, Mayer cancer in its many forms. Science K, Pfeiffer F, Stocker S, et al. MIPS: a 2000;289(5485):1670-2. database for genomes and protein Address of the authors: 120. Ross DT, Scherf U, Eisen MB, Perou sequences. Nucleic Acids Res Nicholas M. Luscombe, Dov Greenbaum, CM, Rees C, Spellman P, et al. Systematic 1999;27(1):44-8. Mark Gerstein* variation in gene expression patterns in 130. Murray-Rust P. Bioinformatics and drug Department of Molecular Biophysics and human cancer cell lines. Nat Genet discovery. Curr Opin Biotechnol Biochemistry 2000;24(3):227-35. 1994;5(6):648-53. Yale University 121. Perou CM, Jeffrey SS, van de Rijn M, 131. Friend SH. How DNA microarrays and 266 Whitney Avenue Rees CA, Eisen MB, Ross DT, et al. expression profiling will affect clinical PO Box 208 114 Distinctive gene expression patterns in practice. BMJ 1999;319(7220):1306-7. New Haven CT 06520-8114, USA human mammary epithelial cells and 132. Tamayo P SD, Mesirov J, Zhu Q, [email protected] breast cancers. Proc Natl Acad Sci U S A Kitareewan S, Dmitrovsky E, Lander ES, 1999;96(16):9212-7. Golub TR. Interpreting patterns of gene *corresponding author

Yearbook of Medical Informatics 2001 99 Review Paper

100 Yearbook of Medical Informatics 2001