<<

346 © 2001 Schattauer GmbH

What is ? A Proposed Definition and Overview of the Field

N. M. Luscombe, D. Greenbaum, M. Gerstein Department of Molecular and Yale University, New Haven, USA

related to this new field has been surging, Summary 1. Introduction and now comprise almost 2% of the Background: The recent flood of from annual total of papers in PubMed. sequences and functional has given rise to are being produced at a This unexpected union between the two new field, bioinformatics, which combines elements of and science. phenomenal rate [1]. For example as subjects is attributed to the fact that life Objectives: Here we propose a definition for this of April 2001, the GenBank repository of itself is an technology; an new field and review some of the research that is sequences contained ’s is largely deter- being pursued, particularly in relation to transcriptional 11,546,000 entries [2] and the SWISS- mined by its , which at its most basic regulatory . PROT of sequences con- can be viewed as digital information.At the Methods: Our definition is as follows: Bioinformatics tained 95,320 [3]. On average, these databa- same time, there have been major advances is conceptualizing biology in terms of ses are doubling in size every 15 months [2]. in the technologies that supply the initial (in the sense of physical-chemistry) and then applying In addition, since the publication of data; Anthony Kervalage of Celera recent- “” techniques (derived from disciplines the H. influenzae genome [4], complete ly cited that an experimental laboratory such as , , and statis- sequences for nearly 300 have can produce over 100 gigabytes of data a tics) to understand and organize the information been released, ranging from 450 genes to day with ease [5]. This incredible pro- associated with these , on a large-scale. over 100,000. Add to this the data from the cessing power has been matched by devel- Results and Conclusions: Analyses in bioinformatics predominantly focus on three types of large datasets myriad of related projects that study opments in computer technology; the most available in : macromolecular struc- expression, determine the protein structu- important areas of improvements have tures, genome sequences, and the results of - res encoded by the genes, and detail how been in the CPU, disk storage and Internet, al genomics (eg expression data). these products interact with one another, allowing faster , better data Additional information includes the text of scientific and we can begin to imagine the enormous storage and revolutionalised the methods papers and “relationship data” from metabolic path- quantity and of information that is for accessing and exchanging data. ways, trees, and protein-protein being produced. networks. Bioinformatics employs a wide As a result of this surge in data, compu- of computational techniques including sequence and ters have become indispensable to biologi- , database design and data cal research. Such an approach is ideal 1.1 Aims of Bioinformatics mining, macromolecular , construction, prediction of and because of the ease with which In general, the aims of bioinformatics are function, gene finding, and expression data clustering. can handle large quantities of data and three-fold. First, at its simplest bioinfor- The emphasis is on approaches integrating a variety of probe the complex dynamics observed in matics organises data in a way that allows computational methods and heterogeneous data . Bioinformatics, the subject of the researchers to access existing information sources. Finally, bioinformatics is a practical discipline. current review, is often defined as the appli- and to submit new entries as they are We survey some representative applications, such as cation of computational techniques to produced, e.g. the for finding homologues, designing drugs, and performing understand and organise the information 3D macromolecular structures [6, 7]. While large-scale . Additional information pertinent associated with biological macromolecules. data-curation is an essential task, the in- to the review is available over the web at Fig. 1 shows that the number of papers formation stored in these is http://bioinfo.mbb.yale.edu/what-is-it. essentially useless until analysed. Thus the purpose of bioinformatics extends much Keywords further. The second aim is to develop tools Bioinformatics, Genomics, Introduction, Transcription Updated version of an invited review paper and resources that aid in the analysis of Regulation that appeared in Haux, R., Kulikowski, C. (eds.) (2001). IMIA Yearbook of Medical Informatics data. For example, having sequenced a par- Method Inform Med 2001; 40: 346–58 2001: Digital Libraries and Medicine, pp. 83–99. ticular protein, it is of interest to compare it Stuttgart: Schattauer. with previously characterised sequences.

Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 347 What is Bioinformatics?

2001).At the next level are protein sequenc- es comprising strings of 20 - letters. At present there are about 400,000 known protein sequences [3], with a typical bacterial protein containing approximately 300 amino acids. Macromolecular struc- tural data represents a more complex form of information. There are currently 15,000 entries in the Protein Data Bank, PDB [6, 7], containing atomic structures of pro- teins, DNA and RNA solved by x-ray crystallography and NMR. A typical PDB file for a medium-sized protein contains the xyz-coordinates of approximately 2,000 . Scientific euphoria has recently centred on whole genome . As with the raw DNA sequences, consist of strings of base-letters, ranging from 1.6 million bases in Fig. 1 Plot showing the growth of scientific publications in bioinformatics between 1973 and 2000. The bars (left vertical axis) counts the total number of scientific articles relating to bioinformatics, and the black line (right vertical [10] to 3 billion in humans [11, 12]. The axis) gives the percentage of the annual total of articles relating to bioinformatics. The data are taken from PubMed. Entrez database [13] currently has com- plete sequences for nearly 300 archaeal, bacterial and eukaryotic organisms. In This needs more than just a simple text- finally some of the major practical applica- addition to producing the raw based search, and programs such as FASTA tions of bioinformatics. sequence, a lot of work is involved in [8] and PSI-BLAST [9] must consider what processing this data. An important aspect constitutes a biologically significant match. of complete genomes is the distinction Development of such resources dictates between coding regions and non-coding expertise in computational theory, as well regions -‘junk’ repetitive sequences making as a thorough understanding of biology. 2. “…the INFORMATION up the bulk of base sequences especially in The third aim is to use these tools to ana- associated with these eukaryotes. Within the coding regions, lyse the data and interpret the results in a genes are annotated with their translated biologically meaningful manner. Traditio- Molecules…” protein sequence, and often with their nally, biological studies examined individu- cellular function. al systems in detail, and frequently com- Table 1 lists the types of data that are pared them with a few that are related. In analysed in bioinformatics and the range of bioinformatics, we can now conduct global topics that we consider to fall within the Bioinformatics – a Definition1 analyses of all the available data with the field. Here we take a broad view and in- aim of uncovering common principles that clude subjects that may not normally be (Molecular) bio – informatics: bioinfor- apply across many systems and highlight listed. We also give approximate values matics is conceptualising biology in novel features. describing the sizes of data being discussed. terms of molecules (in the sense of Phy- In this review, we provide a systematic We start with an overview of the sources sical chemistry) and applying “informa- definition of bioinformatics as shown in of information. Most bioinformatics analy- tics techniques” (derived from disci- Box 1. We focus on the first and third aims ses focus on three primary sources of data: plines such as applied maths, computer just described, with particular reference to DNA or protein sequences, macromolecu- science and ) to understand and the keywords: information, informatics, lar structures and the results of functional organise the information associated organisation, understanding, large-scale genomics experiments. Raw DNA se- with these molecules, on a large scale. In and practical applications. Specifically, we quences are strings of the four base-letters short, bioinformatics is a management discuss the range of data that are currently comprising genes, each typically 1,000 bases information for molecular biolo- being examined, the databases into which long. The GenBank [2] repository of gy and has many practical applications. they are organised, the types of analyses nucleic acid sequences currently holds a 1As submitted to the Oxford English that are being conducted using transcrip- total of 12.5 billion bases in 11.5 million Dictionary. tion regulatory systems as an example, and entries (all database figures as of April

Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 348 Luscombe, Greenbaum, Gerstein

Table 1 Sources of data used in bioinformatics, the quantity of each type of data that is currently (April 2001) available, quences. While more biological informa- and bioinformatics subject areas that utilize this data. tion can be derived from a single structure than a protein sequence, the lack of depth in the latter is compensated by analysing larger quantities of data.

3. “… ORGANISE the Infor- mation on a LARGE SCALE…” 3.1 Redundancy and Multiplicity of Data A concept that underpins most research methods in bioinformatics is that much of the data can be grouped together based on biologically meaningful similarities. For example, sequence segments are often repeated at different positions of genomic DNA [27]. Genes can be clustered into those with particular functions (eg enzy- matic actions) or according to the meta- bolic pathway to which they belong [28], although here, single genes may actually possess several functions [29]. Going further, distinct frequently have comparable sequences – organisms often have multiple copies of a particular gene through duplication and different have equivalent or similar proteins that were inherited when they diverged from each other in . At a structural level, we predict there to be a finite number of different tertiary structures – estimates More recent sources of data have been tities of data when experiments are con- range between 1,000 and 10,000 folds from experiments, of ducted for larger organisms and at more [30, 31] – and proteins adopt equivalent which the most common are gene expres- time-points. structures even when they differ greatly in sion studies.We can now determine expres- Further genomic-scale data include sequence [32]. As a result, although the sion levels of almost every gene in a given biochemical information on metabolic number of structures in the PDB has on a whole-genome level, however pathways, regulatory networks, protein- increased exponentially, the rate of discov- there is currently no central repository for protein interaction data from two-hybrid ery of novel folds has actually decreased. this data and public availability is limited. experiments, and systematic knockouts of There are common terms to describe the These experiments measure the amount of individual genes to test the viability of an relationship between pairs of proteins or mRNA that is produced by the cell [14-18] organism. the genes from which they are derived: under different environmental conditions, What is apparent from this list is the analogous proteins have related folds, but different stages of the and differ- diversity in the size and complexity of dif- unrelated sequences, while homologous ent cell types in multi-cellular organisms. ferent datasets. There are invariably more proteins are both sequentially and structu- Much of the effort has so far focused on the sequence-based data than others because rally similar. The two categories can some- yeast [19-24] and human genomes [25, 26]. of the relative ease with which they can be times be difficult to distinguish especially if One of the largest dataset for yeast has produced.This is partly related to the great- the relationship between the two proteins made approximately 20 time-point meas- er complexity and information-content of is remote [33, 34]. Among homologues, it is urements for 6,000 genes [19]. However, individual structures or useful to distinguish between orthologues, there is potential for much greater quan- experiments compared to individual se- proteins in different species that have evolv-

Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 349 What is Bioinformatics?

ed from a common ancestral gene, and straightforward to access and cross- producing multiple sequence alignments paralogues, proteins that are related by reference these sources of information be- [48], and searching for functional domains within a genome [35]. cause of differences in nomenclature and from motifs in such Normally, orthologues retain the same file formats. alignments. Investigations of structural function while paralogues evolve distinct, At a basic level, this problem is fre- data include prediction of secondary and but related functions [36]. quently addressed by providing external tertiary protein structures, producing An important concept that arises from links to other databases. For example in methods for 3D structural alignments [49, these observations is that of a finite “parts PDBsum, web-pages for individual struc- 50], examining protein using list” for different organisms [37-39]: an tures direct the user towards corresponding distance and angular measurements, calcu- inventory of proteins contained within an entries in the PDB, NDB, CATH, SCOP lations of surface and volume shapes and organism, arranged according to different and SWISS-PROT databases. At a more analysis of protein interactions with other properties such as gene sequence, protein advanced level, there have been efforts to subunits, DNA, RNA and smaller mole- fold or function. Taking protein folds as an integrate access across several data sources. cules. These studies have lead to molecular example, we mentioned that with a few One is the Sequence Retrieval System, SRS simulation topics in which structural data exceptions, the tertiary structures of pro- [41], which allows flat-file databases to be are used to calculate the energetics in- teins adopt one of a limited repertoire indexed to each other; this allows the user volved in stabilising macromolecular struc- of folds. As the number of different fold to retrieve, link and access entries from tures, simulating movements within macro- families is considerably smaller than the nucleic acid, protein sequence, protein molecules, and the number of genes, categorising the proteins motif, protein structure and bibliographic involved in molecular docking. The increa- by fold provides a substantial simplification databases. Another is the Entrez facility sing availability of annotated genomic of the contents of a genome. Similar sim- [42], which provides similar gateways to sequences has resulted in the introduction plifications can be provided by other attri- DNA and protein sequences, genome of and butes such as protein function. As such, we mapping data, 3D macromolecular structu- – large-scale analyses of complete genomes expect this notion of a finite parts list to res and the PubMed bibliographic database and the proteins that they . Re- become increasingly common in future [43].A search for a particular gene in either search includes characterisation of protein genomic analyses. database will allow smooth transitions to content and metabolic pathways between Clearly, an essential aspect of managing the genome it comes from, the protein different genomes, identification of interac- this large volume of data in developing sequence it encodes, its structure, biblio- ting proteins, assignment and prediction of methods for assessing similarities between graphic reference and equivalent entries for gene products, and large-scale analyses of different and identifying all related genes. In our own group, we have gene expression levels. Some of these re- those that are related. There are well-docu- developed the SPINE [44] and PartsList search topics will be demonstrated in our mented classifications for all of the main [39] web resources; these databases inte- example analysis of transcription regula- types of data we described earlier. Al- grate many types of experimental data and tory systems. though detailed descriptions of these clas- organise them using the concept of the Other subject areas we have included in sification systems are beyond the scope of finite “parts list” we described above. Table 1 are:development of digital libraries the current review, they are of great impor- for automated bibliographical searches, tance as they ease comparisons between knowledge bases of biological information genomes and their products. Links to the from the literature, DNA analysis methods major databases are available from our in forensics, prediction of nucleic acid struc- supplementary website. 4. “…UNDERSTAND and tures, metabolic pathway simulations, and Organise the Information…” linkage analysis – linking specific genes to different disease traits. Having examined the data, we can discuss In addition to finding relationships be- 3.2 the types of analyses that are conducted.As tween different proteins, much of bioin- The most profitable research in bioinfor- shown in Table 1, the broad subject areas in formatics involves the analysis of one type matics often results from integrating mul- bioinformatics can be separated according of data to infer and understand the obser- tiple sources of data [40]. For instance, the to the type of information that is used. For vations for another type of data. An exam- 3D coordinates of a protein are more useful raw DNA sequences, investigations involve ple is the use of sequence and structural if combined with data about the protein’s separating coding and non-coding regions, data to predict the secondary and tertiary function, occurrence in different genomes, and identification of introns, exons and structures of new protein sequences [51]. and interactions with other molecules. In promoter regions for annotating genomic These methods, especially the former, are this way, individual pieces of information DNA [45, 46]. For protein sequences, ana- often based on statistical rules derived are put in context with respect to other lyses include developing for from structures, such as the propensity for data. Unfortunately, it is not always sequence comparisons [47], methods for certain amino acid sequences to produce

Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 350 Luscombe, Greenbaum, Gerstein

Paradigm shifts during the past couple of decades have taken much of biology away from the laboratory bench and have allowed the integration of other scientific disciplines, specifically computing. The result is an expansion of biological research in breadth and depth. The vertical axis demonstrates how bioinformatics can aid rational with minimal work in the wet lab. Starting with a single gene sequence, we can determine with strong certainty, the protein sequence. From there, we can determine the structure using structure prediction techniques. With geometry calculations, we can further resolve the protein’s surface and through molecular simulation determine the force fields surrounding the . Finally docking algorithms can provide predictions of the ligands that will bind on the protein surface, thus paving the way for the design of a drug specific to that molecule. The horizontal axis shows how the influx of biological data and advances in computer technology have broadened the scope of biology. Initially with a pair of proteins, we can make comparisons between the between sequences and structures of evolutionary related proteins. With more data, algorithms for multiple alignments of several proteins become necessary. Using multiple sequences, we can also create phylogenetic trees to trace the evolutionary development of the proteins in question. Finally, with the deluge of data we currently face, we need to construct large databases to store, view and deconstruct the information. Alignments now become more complex, requiring sophisticated scoring schemes and there is enough data to compile a genome – a genomic equivalent of a census – providing comprehensive statistical of protein features in genomes.

Fig. 2 Organizing and understanding biological data

different secondary structural elements. transferred between homologous proteins analysis in two dimension, depth and Another example is the use of structural [55]. breadth. The first is represented by the data to understand a protein’s function; vertical axis in the figure and outlines a here studies have investigated the rela- possible approach to the rational drug tionship different protein folds and their design . The aim is to take a single functions [52, 53] and analysed similarities 4.1 The Bioinformatics Spectrum gene and follow through an analysis that between different binding sites in the ab- Fig. 2 summarises the main points we maximises our understanding of the sence of [54]. Combined with raised in our discussions of organising protein it encodes. Starting with a gene similarity measurements, these studies pro- and understanding biological data – the sequence, we can determine the protein vide us with an understanding of how much development of bioinformatics techniques sequence with strong certainty. From there, biological information can be accurately has allowed an expansion of biological prediction algorithms can be used to calcu-

Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 351 What is Bioinformatics?

late the structure adopted by the protein. son methods such as text search and one- 6.1 Structural Studies Geometry calculations can define the dimensional alignment algorithms. Motif shape of the protein’s surface and molecu- and pattern identification for multiple As of April 2001, there were 379 structures lar simulations can determine the force sequences depend on machine , of protein-DNA complexes in the PDB. fields surrounding the molecule. Finally, clustering and data-mining techniques. 3D Analyses of these structures have provided using docking algorithms, one could structural analysis techniques include Eu- valuable insight into the stereochemical identify or design ligands that may bind clidean geometry calculations combined principles of binding, including how par- the protein, paving the way for designing a with basic application of physical chemis- ticular base sequences are recognized drug that specifically alters the protein’s try, graphical representations of surfaces and how the DNA structure is quite often function. In practise, the intermediate steps and volumes, and structural comparison modified on binding. are still difficult to achieve accurately, and and 3D methods. For molecular A structural taxonomy of DNA-binding they are best combined with experimental simulations, Newtonian mechanics, quan- proteins, similar to that presented in SCOP methods to obtain some of the data, for tum mechanics, molecular mechanics and and CATH, was first proposed by Harrison example characterising the structure of the electrostatic calculations are applied. In [56] and periodically updated to accom- protein of interest. many of these areas, the computational modate new structures as they are solved The aim of the second dimension, the methods must be combined with good [57].The classification consists of a two-tier breadth in biological analysis, is to compare statistical analyses in order to provide an system: the first level collects proteins into a gene or gene product with others. Ini- objective measure for the significance of eight groups that share gross structural tially, simple algorithms can be used to the results. features for DNA-binding, and the second compare the sequences and structures of a comprises 54 families of proteins that are pair of related proteins. With a larger num- structurally homologous to each other. ber of proteins, improved algorithms can be Assembly of such a system simplifies the used to produce multiple alignments, and 6. Transcription Regulation – comparison of different binding methods; it extract sequence patterns or structural a Case Study in Bioinformatics highlights the diversity of protein-DNA templates that define a family of proteins. complex geometries found in nature, but Using this data, it is also possible to con- DNA-binding proteins have a central role also underlines the importance of inter- struct phylogenetic trees to trace the evolu- in all aspects of genetic activity within an actions between -helices and the DNA tionary path of proteins. Finally, with even organism, participating in processes such as major groove, the main of binding in more data, the information must be stored transcription, packaging, rearrangement, over half the protein families. While the in large-scale databases. Comparisons and repair. In this section, we number of structures represented in the become more complex, requiring multiple focus on the studies that have contributed PDB does not necessarily reflect the rela- scoring schemes, and we are able to con- to our understanding of transcription tive importance of the different proteins in duct genomic scale censuses that provide regulation in different organisms. Through the cell, it is clear that helix-turn-helix, comprehensive statistical accounts of this example, we demonstrate how bio- zinc-coordinating and leucine zipper motifs protein features, such as the abundance of informatics has been used to increase our are used repeatedly.These provide compact particular structures or functions in diffe- knowledge of biological systems and also frameworks to present the -helix on the rent genomes. It also allows us to build illustrate the practical applications of the surfaces of structurally diverse proteins. At phylogenetic trees that trace the evolution different subject areas that were briefly a gross level, it is possible to highlight the of whole organisms. outlined earlier. We start by considering differences between transcription factor structural analyses of how DNA-binding domains that “just” bind DNA and those proteins recognise particular base se- involved in catalysis [58]. Although there quences. Later, we review several genomic are exceptions, the former typically 5. “… applying INFORMATICS studies that have characterised the nature approach the DNA from a single face and TECHNIQUES…” of transcription factors in different orga- slot into the grooves to interact with base nisms, and the methods that have been used edges. The latter commonly envelope the The distinct subject areas we mention to identify regulatory binding sites in the substrate, using complex networks of require different types of informatics tech- upstream regions. Finally, we provide an secondary structures and loops. niques. Briefly, for data organisation, the overview of gene expression analyses that Focusing on proteins with -helices, the first biological databases were simple flat have been recently conducted and suggest structures show many variations, both in files. However with the increasing amount future uses of transcription regulatory ana- amino acid sequences and detailed geo- of information, relational database lyses to rationalise the observations made metry. They have clearly evolved indepen- methods with Web-page interfaces have in gene expression experiments. All the dently in accordance with the requirements become increasingly popular. In sequence results that we describe have been found of the context in which they are found. analysis, techniques include string compari- through computational studies. While achieving a close fit between the

Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 352 Luscombe, Greenbaum, Gerstein

-helix and major groove, there is enough interactions of arginine or lysine with 6.2 Genomic Studies flexibility to allow both the protein and guanine, asparagine or glutamine with DNA to adopt distinct conformations. adenine and threonine with thymine. Such Due to the wealth of biochemical data that However, several studies that analysed the preferences were explained through exami- are available, genomic studies in bioin- binding geometries of -helices demon- nation of the stereochemistry of the amino formatics have concentrated on model strated that most adopt fairly uniform con- acid side chains and base edges. Also organisms, and the analysis of regulatory formations regardless of . highlighted were more complex types of systems has been no exception. Identification They are commonly inserted in the major interactions where single amino acids of transcription factors in genomes invari- groove sideways, with their lengthwise axis contact more than one base-step simulta- ably depends on similarity search strate- roughly parallel to the slope outlined by neously, thus recognising a short DNA gies, which assume a functional and evolu- the DNA backbone. Most start with the sequence. These results suggested that tionary relationship between homologous N-terminus in the groove and extend out, universal specificity, one that is observed proteins. In E. coli, studies have so far completing two to three turns within across all protein-DNA complexes, indeed estimated a total of 300 to 500 transcription contacting distance of the nucleic acid [59, exists. However, many interactions that are regulators [71] and PEDANT [72], a data- 60]. normally considered to be non-specific, base of automatically assigned gene funct- Given the similar binding orientations, it such as those with the DNA backbone, can ions, shows that typically 2-3% of pro- is surprising to find that the interactions also provide specificity depending on the karyotic and 6-7% of eukaryotic genomes between each amino acid position along context in which they are made. comprise DNA-binding proteins.As assign- the -helices and on the DNA Armed with an understanding of ments were only complete for 40-60% of vary considerably between different pro- protein structure, DNA-binding motifs and genomes as of August 2000, these figures tein families. However, by classifying the side chain stereochemistry, a major applica- most likely underestimate the actual num- amino acids according to the sizes of their tion has been the prediction of binding ber. Nonetheless, they already represent a side chains, we are able to rationalise the either by proteins known to contain a parti- large quantity of proteins and it is clear that different interactions patterns. The rules of cular motif, or those with structures solved there are more transcription regulators interactions are based on the simple pre- in the uncomplexed form. Most common in eukaryotes than other species. This is mise that for a given residue position on are predictions for -helix-major groove unsurprising, considering the organisms -helices in similar conformations, small interactions – given the amino acid se- have developed a relatively sophisticated amino acids interact with nucleotides that quence, what DNA sequence would it transcription mechanism. are close in distance and large amino acids recognise [61, 67]. In a different approach, From the conclusions of the structural with those that are further [60, 61]. Equi-va- molecular simulation techniques have been studies, the best strategy for characterising lent studies for binding by other structural used to dock whole proteins and on DNA-binding of the putative transcription motifs, like -hairpins, have also been con- the basis of force-field calculations around factors in each genome is to group them ducted [62]. When considering these the two molecules [68, 69]. by homology and to analyse the individual interactions, it is important to remember The that both methods have families. Such classifications are provided that different regions of the protein surface been met with limited success is because in the secondary sequence databases also provide interfaces with the DNA. even for apparently simple cases like - described earlier and also those that This brings us to look at the atomic level helix-binding, there are many other factors specialise in regulatory proteins such as interactions between individual amino that must be considered. Comparisons RegulonDB [73] and TRANSFAC [74]. acid-base pairs. Such analyses are based on between bound and unbound nucleic acid Of even greater use is the provision of the premise that a significant proportion of structures show that DNA-bending is a structural assignments to the proteins; specific DNA-binding could be rationalised common feature of complexes formed with given a transcription factor, it is helpful to by a universal code of recognition between transcription factors [58, 70].This and other know the that it uses for amino acids and bases, ie whether certain factors such as electrostatic and cation- binding, therefore providing us with a protein residues preferably interact with mediated interactions assist indirect better understanding of how it recognises particular nucleotides regardless of the recognition of the nucleotide sequence, the target sequence. type of protein-DNA complex [63]. Studies although they are not well understood yet. through bioinformatics assigns structures have considered hydrogen bonds, van der Therefore, it is now clear that detailed rules to the protein products of genomes by Waals contacts and water-mediated bonds for specific DNA-binding will be family demonstrating similarity to proteins of [64-66]. Results showed that about 2/3 of all specific, but with underlying trends such as known structure [75]. These studies have interactions are with the DNA back- the arginine-guanine interactions. shown that prokaryotic transcription fac- bone and that their main role is one of tors most frequently contain helix-turn- sequence-independent stabilisation. In helix motifs [71, 76] and eukaryotic factors contrast, interactions with bases display contain homeodomain type helix-turn- some strong preferences, including the helix, zinc finger or leucine zipper motifs.

Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 353 What is Bioinformatics?

From the protein classifications in each specific but different members bind distinct approach, it was found that at least 27% of genome, it is clear that different types of base sequences. Here protein residues known E. coli DNA-regulatory motifs are regulatory proteins differ in abundance and undergo frequent , and family conserved in one or more distantly related families significantly differ in size. A study members can be divided into subfamilies bacteria [84]. by Huynen and van Nimwegen [77] has according to the amino acid sequences The detection of regulatory sites in shown that members of a single family have at base-contacting positions; those in the eukaryotes poses a more difficult problem similar functions, but as the requirements same subfamily are predicted to bind the because consensus sequences tend to be of this function vary over time, so does same DNA sequence and those of different much shorter, variable, and dispersed over the presence of each gene family in the subfamilies to bind distinct sequences. On very large distances. However, initial stud- genome. the whole, the subfamilies corresponded ies in S. cerevisiae provided an interesting Most recently, using a combination of well with the proteins’ functions and mem- observation for the GATA protein in nitro- sequence and structural data, we examined bers of the same subfamilies were found to gen regulation. While the 5 the conservation of amino acid sequences regulate similar transcription pathways. base-pair GATA is between related DNA-binding proteins, The combined analysis of sequence and found almost everywhere in the genome, a and the effect that mutations have on structural data described by this study pro- single isolated binding site is insufficient to DNA sequence recognition. The structural vided an insight into how homologous exert the regulatory function [85]. There- families described above were expanded to DNA-binding scaffolds achieve different fore specificity of GATA activity comes include proteins that are related by sequence specificities by altering their amino acid from the repetition of the consensus se- similarity, but whose structures remain sequences. In doing so, proteins evolved quence within the upstream regions of con- unsolved. Again, members of the same distinct functions, therefore allowing trolled genes in multiple copies. An initial family are homologous, and probably derive structurally related transcription factors to study has used this observation to predict from a common ancestor. regulate expression of different genes. new regulatory sites by searching for over- Amino acid conservations were calculat- Therefore, the relative abundance of tran- represented in non-coding ed for the multiple sequence alignments scription regulatory families in a genome regions of yeast and worm genomes [86, of each family [78]. Generally, alignment depends, not only on the importance of a 87]. positions that interact with the DNA are particular protein function, but also in the Having detected the regulatory binding better conserved than the rest of the pro- adaptability of the DNA-binding motifs to sites, there is the problem of defining the tein surface, although the detailed patterns recognise distinct nucleotide sequences. genes that are actually regulated, commonly of conservation are quite complex. Residues This, in turn, appears to be best accommo- termed regulons. Generally, binding sites that contact the DNA backbone are highly dated by simple binding motifs, such as the are assumed to be located directly upstream conserved in all protein families, providing zinc fingers. of the regulons; however there are different a set of stabilising interactions that are Given the knowledge of the transcription problems associated with this assumption common to all homologous proteins. The regulators that are contained in each depending on the organism. For prokary- conservation of alignment positions that organism, and an understanding of how otes, it is complicated by the presence of contact bases, and recognise the DNA they recognise DNA sequences, it is of operons; it is difficult to locate the regulat- sequence, are more complex and could be interest to search for their potential bind- ed gene within an operon since it can rationalised by defining a three-class model ing sites within genome sequences [79]. several genes downstream of the regulatory for DNA-binding. First, protein families For prokaryotes, most analyses have in- sequence. It is often difficult to predict the that bind non-specifically usually contain volved compiling data on experimentally organisation of operons [88], especially to several conserved base-contacting residues; known binding sites for particular proteins define the gene that is found at the head, without exception, interactions are made in and building a consensus sequence that in- and there is often a lack of long-range con- the minor groove where there is little corporates any variations in nucleotides. servation in gene order between related discrimination between base types. The Additional sites are found by conducting organisms [89]. The problem in eukaryotes contacts are commonly used to stabilise word-matching searches over the entire is even more severe; regulatory sites often deformations in the nucleic acid structure, genome and scoring candidate sites by act in both directions, binding sites are particularly in widening the DNA minor similarity [80-83]. Unsurprisingly, most of usually distant from regulons because of groove. The second class comprise families the predicted sites are found in non-coding large intergenic regions, and transcription whose members all target the same regions of the DNA [80] and the results of regulation is usually a result of combined nucleotide sequence; here, base-contacting the studies are often presented in databases action by multiple transcription factors in a positions are absolutely or highly conser- such as RegulonDB [73]. The consensus combinatorial manner. ved allowing related proteins to target the search approach is often complemented by Despite these problems, these studies same sequence. comparative genomic studies searching have succeeded in confirming the transcrip- The third, and most interesting, class upstream regions of orthologous genes in tion regulatory pathways of well-character- comprises families in which binding is also closely related organisms. Through such an ized systems such as the heat shock response

Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 354 Luscombe, Greenbaum, Gerstein

system [83]. In addition, it is feasible to relates to the functions associated with the distinction between different forms of experimentally verify any predictions, most these folds; the former are commonly in- – for example subclasses of acute notably using gene expression data. volved in metabolic pathways and the latter leukaemia – has been well established, it is in signalling or transport processes [98]. still not possible to establish a clinical diag- This is also reflected in the relationship nosis on the basis of a single test. In a with subcellular localisations of proteins, recent study, acute myeloid leukaemia and 6.3 Gene Expression Studies where expression of cytoplasmic proteins is acute lymphoblastic leukaemia were suc- Many expression studies have so far high, but nuclear and membrane proteins cessfully distinguished based on the ex- focused on devising methods to cluster tend to be low [99, 100]. pression profiles of these cells [26]. As the genes by similarities in expression profiles. More complex relationships have also approach does not require prior biological This is in order to determine the proteins been assessed. Conventional wisdom is that knowledge of the diseases, it may provide a that are expressed together under different gene products that interact with each other generic strategy for classifying all types of cellular conditions. Briefly, the most com- are more likely to have similar expression cancer. mon methods are , profiles than if they do not [101, 102]. How- Clearly, an essential aspect of under- self-organising maps, and K- cluster- ever, a recent study showed that this rela- standing expression data lies in understand- ing. Hierarchical methods originally de- tionship is not so simple [103]. While ex- ing the basis of transcription regulation. rived from algorithms to construct phyloge- pression profiles are similar for gene pro- However, analysis in this area is still limited netic trees, and group genes in a “bottom- ducts that are permanently associated, for to preliminary analyses of expression levels up” fashion; genes with the most similar example in the large ribosomal subunit, in yeast mutants lacking key components of expression profiles are clustered first, and profiles differ significantly for products the transcription initiation complex [19, those with more diverse profiles are that are only associated transiently, includ- 107]. included iteratively [90-92]. In contrast, the ing those belonging to the same metabolic self-organising map [93, 94] and K-means pathway. methods [95, 96] employ a “top-down” As described below, one of the main approach in which the user pre-defines the driving forces behind expression analysis 7 “… many PRACTICAL number of clusters for the dataset. The has been to analyse cancerous cell lines APPLICATIONS…” clusters are initially assigned randomly, and [104]. In general, it has been shown that dif- the genes are regrouped iteratively until ferent cell lines (eg epithelial and ovarian Here, we describe some of the major uses they are optimally clustered. cells) can be distinguished on the basis of of bioinformatics. Given these methods, it is of interest to their expression profiles, and that these relate the expression data to other attri- profiles are maintained when cells are 7.1 Finding Homologues butes such as structure, function and sub- transferred from an in vivo to an in vitro cellular localisation of each gene product. environment [105]. The basis for their phy- As described earlier, one of the driving Mapping these properties provides an siological differences were apparent in the forces behind bioinformatics is the search insight into the characteristics of proteins expression of specific genes; for example, for similarities between different biomole- that are expressed together, and also sug- expression levels of gene products neces- cules.Apart from enabling systematic orga- gest some interesting conclusions about the sary for progression through the cell cycle, nisation of data, identification of protein overall biochemistry of the cell. In yeast, especially ribosomal genes, correlated well homologues has some direct practical uses. shorter proteins tend to be more highly with variations in cell proliferation rate. The most obvious is transferring informa- expressed than longer proteins, probably Comparative analysis can be extended to tion between related proteins. For example, because of the relative ease with which they tumour cells, in which the underlying given a poorly characterised protein, it is are produced [97]. Looking at the amino causes of cancer can be uncovered by possible to search for homologues that are acid content, highly expressed genes are pinpointing areas of biological variations better understood and with caution, apply generally enriched in alanine and glycine, compared to normal cells. For example in some of the knowledge of the latter to the and depleted in asparagine; these are , genes related to cell prolif- former. Specifically with structural data, thought to reflect the requirements of eration and the IFN-regulated trans- theoretical models of proteins are usually amino acid usage in the organism, where duction pathway were found to be upregu- based on experimentally solved structures synthesis of alanine and glycine are energe- lated [25, 106]. One of the difficulties in of close homologues [108]. Similar tech- tically less expensive than asparagine.Turn- cancer treatment has been to target specific niques are used in fold recognition in which ing to protein structure, expression levels of therapies to pathogenetically distinct tu- tertiary structure predictions depend on the TIM barrel and NTP hydrolase folds mour types, in order to maximise efficacy finding structures of remote homologues are highest, while those for the leucine and minimise toxicity. Thus, improvements and checking whether the prediction is zipper, zinc finger and transmembrane in cancer classifications have been central energetically viable [109]. Where biochem- helix-containing folds are lowest. This to advances in cancer treatment. Although ical or structural data are lacking, studies

Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 355 What is Bioinformatics?

Fig. 3 Above is a schematic outlining how can use bioinformatics to aid rational . MLH1 is a human gene encoding a mismatch repair protein (mmr) situated on the short arm of chromosome 3. Through linkage analysis and its similarity to mmr genes in mice, the gene has been implicated in nonpolyposis colorectal cancer. Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can be determined using translation software. Sequence search techniques can be used to find homologues in model organisms, and based on sequence similarity, it is possible to model the structure of the human protein on experimentally characterised structures. Finally, docking algorithms could design molecules that could bind the model structure, leading the way for biochemical assays to test their biological activity on the actual protein.

could be made in low-level organisms like simplifies the problem of understanding missing in humans. On a smaller scale, yeast and the results applied to homo- complex genomes by analysing simple structural differences between similar pro- logues in higher-level organisms such as organisms first and then applying the teins may be harnessed to design drug humans, where experiments are more same principles to more complicated ones – molecules that specifically bind to one demanding. this is one reason why early structural structure but not another. An equivalent approach is also employed genomics projects focused on Mycoplasma in genomics. Homologue-finding is exten- genitalium [75]. sively used to confirm coding regions in Ironically, the same idea can be applied newly sequenced genomes and functional in reverse. Potential drug targets are quickly 7.2 Rational Drug Design data is frequently transferred to annotate discovered by checking whether homo- One of the earliest medical applications of individual genes. On a larger scale, it also logues of essential microbial proteins are bioinformatics has been in aiding rational

Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 356 Luscombe, Greenbaum, Gerstein

drug design. Fig. 3 outlines the commonly protein folds are often related to specific this way, we are able to examine individual cited approach, taking the MLH1 gene pro- biochemical functions [52, 53], these find- systems in detail and also compare them duct as an example drug target. MLH1 is a ings highlight the diversity of metabolic with those that are related in order to un- human gene encoding a mismatch repair pathways in different organisms [36, 89]. cover common principles that apply across protein (mmr) situated on the short arm of As we discussed earlier, one of the most many systems and highlight unusual fea- chromosome 3 [110]. Through linkage ana- exciting new sources of genomic information tures that are unique to some. lysis and its similarity to mmr genes in mice, is the expression data. Combining expression the gene has been implicated in nonpoly- information with structural and functional Acknowledgments We thank Patrick McGarvey for comments on posis colorectal cancer [111]. Given the classifications of proteins we can ask the manuscript. nucleotide sequence, the probable amino whether the high occurrence of a protein acid sequence of the encoded protein can fold in a genome is indicative of high ex- be determined using translation software. pression levels [97]. Further genomic scale Sequence search techniques can then be data that we can consider in large-scale sur- References used to find homologues in model orga- veys include the subcellular localisations of 1. Reichhardt T. It’s sink or swim as a tidal wave of data approaches. Nature 1999. 399 (6736): nisms, and based on sequence similarity, it proteins and their interactions with each 517-20. is possible to model the structure of the other [114-116]. In conjunction with struc- 2. Benson DA, et al. GenBank. Nucleic Acids human protein on experimentally character- tural data, we can then begin to compile a Res 2000; 28 (1): 15-8. 3. Bairoch A, Apweiler R. The SWISS-PROT ized structures. Finally, docking algorithms map of all protein-protein interactions in protein and its supplement could design molecules that could bind the an organism. TrEMBL in 2000. Nucleic Acids Res 2000; 28 model structure, leading the way for bio- (1): 45-8. chemical assays to test their biological 4. Fleischmann RD, et al. Whole-genome ran- dom sequencing and assembly of Haemo- activity on the actual protein. philus influenzae Rd. Science 1995; 269 (5223): 496-512. 8. Conclusions 5. Drowning in data. The Economist 1999 (26 With the current deluge of data, compu- June 1999). 7.3 Large-scale Censuses 6. Bernstein FC, et al. The Protein Data Bank. A tational methods have become indispens- computer-based archival file for macromolec- Although databases can efficiently store all able to biological investigations. Originally ular structures. Eur J Biochem 1977; 80 (2): the information related to genomes, struc- developed for the analysis of biological se- 319-24. tures and expression datasets, it is useful to quences, bioinformatics now encompasses 7. Berman HM, et al. The Protein Data Bank. Nucleic Acids Res 2000; 28 (1): 235-42. condense all this information into under- a wide range of subject areas including 8. Pearson WR, Lipman DJ. Improved tools for standable trends and facts that users can , genomics and gene ex- biological sequence comparison. Proc Natl readily understand. Broad generalisations pression studies. In this review, we provided Acad Sci USA 1988; 85 (8): 2444-8. help identify interesting subject areas for an introduction and overview of the cur- 9. Altschul SF, et al. Gapped BLAST and PSI- BLAST: a new generation of protein database further detailed analysis, and place new ob- rent of field. In particular, we discus- search programs. Nucleic Acids Res 1997; 25 servations in a proper context. This enables sed the types of biological information and (17): 3389-402. one to see whether they are unusual in any databases that are commonly used, exa- 10. Fleischmann RD, et al. Whole-genome ran- dom sequencing and assembly of Haemo- way. mined some of the studies that are being con- philus influenzae Rd. Science 1995; 269 Through these large-scale censuses, one ducted – with reference to transcription (5223): 496-512. can address a number of evolutionary, bio- regulatory systems – and finally looked at 11. Lander ES, et al. Initial sequencing and analy- chemical and biophysical questions. For several practical applications of the field. sis of the . Nature 2001; 409: 860-921. example, are specific protein folds associat- Two principal approaches underpin all stud- 12. Venter JC, et al. The sequence of the human ed with certain phylogenetic groups? How ies in bioinformatics. First is that of com- genome. Science 2001; 291 (5507): 1304-51. common are different folds within partic- paring and grouping the data according to 13. Tatusova TA, Karsch-Mizrachi I, Ostell JA. ular organisms? And to what degree are biologically meaningful similarities and sec- Complete genomes in WWW Entrez: data representation and analysis. Bioinformatics folds shared between related organisms? ond, that of analysing one type of data to 1999; 15 (7-8): 536-43. Does this extent of sharing parallel meas- infer and understand the observations for 14. Eisen MB, Brown PO. DNA arrays for analy- ures of relatedness derived from traditional another type of data. These approaches are sis of gene expression. Methods Enzymol, evolutionary trees? Initial studies show reflected in the main aims of the field, 1999; 303: 179-205. 15. Cheung VG, et al. Making and reading micro- that the of folds differs greatly which are to understand and organise the arrays. Nat Genet 1999; 21 (1 Suppl): 15-9. between organisms and that the sharing of information associated with biological 16. Duggan DJ, et al. Expression profiling using folds between organisms does in fact follow molecules on a large scale. As a result, cDNA microarrays. Nat Genet 1999. 21 traditional phylogenetic classifications bioinformatics has not only provided great- (1 Suppl): 10-4. 17. Lipshutz RJ, et al. High density synthetic [37, 112, 113].We can also integrate data on er depth to biological investigations, but arrays. Nat Genet 1999; 21 (1): protein functions; given that the particular added the dimension of breadth as well. In 20-4.

Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved. 357 What is Bioinformatics?

18. Velculescu VE, et al. Serial Analysis of Gene 37. Gerstein M, Hegyi H. Comparing genomes in sequence, structure and function through Expression. Detailed Protocol 1999. terms of protein structure: surveys of a finite traditional and probabilistic scores. J Mol Biol 19. Holstege FC, et al. Dissecting the regulatory parts list. FEMS Microbiol Rev 1998; 22 (4): 2000; 297 (1): 233-49. circuitry of a eukaryotic genome. Cell 1998; 95 277-304. 56. Harrison SC. A structural taxonomy of DNA- (5): 717-28. 38. Skolnick J, Fetrow JS. From genes to protein binding domains. Nature 1991; 353 (6346): 20. Roth FP, Estep PW, Church GM. Finding structure and function: novel applications of 715-9. DNA regulatory motifs within unaligned non- computational approaches in the genomic era. 57. Luscombe NM, et al.An overview of the struc- coding sequences clustered by whole-genome Trends Biotech 2000; 18: 34-9. tures of protein-DNA complexes. Genome mRNA quantitation. Nat Biotech 1998; 16 39. Qian J, et al. PartsList: a web-based system for Biology 2000; 1 (1): 1-37. (10): 939-45. dynamically protein folds based on 58. Jones S, et al. Protein-DNA interactions: A 21. Jelinsky SA, Samson LD. Global response of disparate attributes, including whole-genome structural analysis. J Mol Biol 1999; 287 (5): Saccharomyces cerevisiae to an alkylating expression and interaction information. 877-96. agent. Proc Natl Acad Sci USA 1999; 96 (4): Nucleic Acids Res 2001; 29 (8): 1750-64. 59. Suzuki M, Gerstein M. Binding geometry of 1486-91. 40. Gerstein M. Integrative database analysis in alpha-helices that recognize DNA. Proteins structural genomics. Nat Struct Biol 2000; 7 22. Cho RJ, et al. A genome-wide transcriptional 1995; 23 (4): 525-35. Suppl: 960-3. 60. Luscombe NM, Thornton JM. Protein-DNA analysis of the mitotic cell cycle. Mol Cell 41. Etzold T, Ulyanov A, Argos P. SRS: informa- 1998; 2 (1): 65-73. interactions: a 3D analysis of alpha-helix- tion retrieval system for molecular biology data binding in the major groove. Manuscript in 23. DeRisi JL, Iyer VR, Brown PO. Exploring the banks. Methods Enzymol 1996; 266: 114-28. preparation. metabolic and genetic control of gene expres- 42. Schuler GD, et al. Entrez: molecular biology 61. Suzuki M, et al. DNA recognition code of sion on a genomic scale. Science 1997; 278 database and retrieval system. Methods (5338): 680-6. transcription factors. Protein Eng 1995; 8 (4): Enzymol 1996; 266: 141-62. 319-28. 24. Winzeler EA, et al. Functional characteriza- 43. Wade K. Searching Entrez PubMed and 62. Suzuki M. DNA recognition by a -sheet. tion of the S. cerevisiae genome by gene uncover on the internet. Aviat Space Environ Protein Eng 1995; 8 (1): 1-4. deletion and parallel analysis. Science 1999; Med 2000; 71 (5): 559. 63. Seeman NC, Rosenberg JM, Rich A. Sequence 285 (5429): 901-6. 44. Bertone P, et al. SPINE: An integrated specific recognition of double helical nucleic tracking database and datamining approach 25. Perou CM, et al. Molecular portraits of human acids by proteins. Proc Natl Acad Sci USA for high-throughput structural proteomics, breast tumours. Nature 2000; 406 (6797): 1976; 73: 804-8. enabling the determination of the properties 747-52. 64. Suzuki M. A framework for the DNA-protein of readily characterized proteins. Nucleic 26. Golub TR, et al. Molecular classification of recognition code of the probe helix in Acids Res. In Press. cancer: class discovery and class prediction by transcription factors: the chemical and stereo- 45. Zhang MQ. Promoter analysis of co-regulated gene expression monitoring. Science 1999; 286 chemical rules. Structure 1994; 2 (4): 317-26. (5439): 531-7. genes in the yeast genome. Comput Chem 1999; 23 (3-4): 233-50. 65. Mandel-Gutfreund Y, Schueler O, Margalit H. 27. Pedersendagger AG, et al. A DNA structural 46. Boguski MS. Biosequence exegesis. Science Comprehensive analysis of hydrogen bonds in atlas for Escherichia coli. J Mol Biol 2000; 299 1999; 286 (5439): 453-5. regulatory protein-DNA complexes: in search (4): 907-30. 47. Miller C, Gurd J, Brass A. A RAPID of common principles. J Mol Biol 1995; 253 28. Kanehisa M; Goto S. KEGG: kyoto encyclo- for sequence database comparisons: (2): 370-82. pedia of genes and genomes. Nucleic Acids application to the identification of vector 66. Luscombe NM, Laskowski RA, Thornton JM. Res 2000; 28 (1): 27-30. contamination in the EMBL databases. Bio- Protein-DNA interactions: a 3D analysis of 29. Jeffery CJ. Moonlighting proteins. TIBS 1999; informatics 1999; 15 (2): 111-21. amino acid-base interactions. Nucleic Acids 24 (1): 8-11. 48. Gonnet GH, Korostensky C, Brenner S. Res. In Press. 30. Chothia, C. Proteins. One thousand families measures of multiple sequence 67. Mandel-Gutfreund Y, Margalit H. Quantita- for the molecular . Nature 1992; 357 alignments. J Comput Biol 2000; 7 (1-2): tive parameters for amino acid-base inter- (6379): 543-4. 261-76. action: inplications for prediction of protein- DNA binding sites. Nucleic Acids Res 1998; 31. Orengo CA, Jones DT, Thornton JM. Protein 49. Orengo CA, Taylor WR. SSAP: sequential 26: 2306-12. superfamilies and domain superfolds. Nature structure alignment program for protein struc- 1994; 372 (6507): 631-4. ture comparison. Methods Enzymol 1996; 266: 68. Sternberg MJ, Gabb HA, Jackson RM. Predic- tive docking of protein-protein and protein- 32. Lesk AM, Chothia C. How different amino 617-35. 50. Orengo CA. CORA – topological fingerprints DNA complexes. Curr Opin Struct Biol 1998; acid sequences determine similar protein 8 (2): 250-6. structures: the structure and evolutionary for protein structural families. Protein Sci 1999; 8 (4): 699-715. 69. Aloy P, et al. Modelling repressor proteins dynamics of the globins. J Mol Biol 1980; 136 docking to DNA. Proteins 1998; 33 (4): (3): 225-70. 51. Russell RB, Sternberg MJ. Structure predic- tion. How good are we? Curr Biol 1995; 5 (5): 535-49. 33. Russell RB, et al. Recognition of analogous 488-90. 70. Dickerson RE. DNA-binding: the prevalence and homologous protein folds: analysis of 52. Martin AC, et al. Protein folds and functions. of kinkiness and the virtues of normality. sequence and structure conservation. J Mol Structure 1998; 6 (7): 875-84. Nucleic Acids Res 1998; 26 (8): 1906-26. Biol 1997; 269 (3): 423-39. 53. Hegyi H, Gerstein M. The relationship be- 71. Perez-Rueda E, Collado-Vides J. The reper- 34. Russell RB, et al. Recognition of analogous tween protein structure and function: a com- toire of DNA-binding transcriptional regula- and homologous protein folds – assessment of prehensive survey with application to the tors in Escherichia coli K-12. Nucleic Acids prediction success and associated alignment yeast genome. J Mol Biol 1999; 288 (1): 147-64. Res 2000; 28 (8): 1838-47. accuracy using empirical substitution matri- 54. Russell RB, Sasieni PD, Sternberg MJE. 72. Mewes HW, et al. MIPS: a database for geno- ces. Protein Eng 1998; 11 (1): 1-9. Supersites within superfolds. Binding site mes and protein sequences. Nucleic Acids Res 35. Fitch WM. Distinguishing homologous from similarity in the absence of homology. J Mol 2000; 28 (1): 37-40. analogous proteins. Syst Zool 1970; 19: 99-110. Biol 1998; 282 (4): 903-18. 73. Salgado H, et al. RegulonDB (version 3.0): 36. Tatusov RL, Koonin EV, Lipman DJ. A geno- 55. Wilson CA, Kreychman J, Gerstein M. As- transcriptional regulation and operon orga- mic perspective on protein families. Science sessing annotation transfer for genomics: nization in Escherichia coli K-12. Nucleic 1997; 278 (5338): 631-7. quantifying the relations between protein Acids Res 2000; 28 (1): 65-7.

Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 Method Inform Med 4/2001 For personal or educational use only. No other uses without permission. All rights reserved. 358 Luscombe, Greenbaum, Gerstein

74. Wingender E, et al. TRANSFAC: an integrated 89. Tatusov RL, et al. Metabolism and evolution protein-protein interactions. Manuscript in system for gene expression regulation. Nucleic of Haemophilus influenzae deduced from a preparation. Acids Res 2000; 28 (1): 316-9. whole-genome comparison with Escherichia 104. Marx J. DNA arrays reveal cancer in its many 75. Teichmann SA, Chothia C, Gerstein M. coli. Curr Biol 1996; 6 (3): 279-91. forms. Science 2000; 289 (5485): 1670-2. Advances in structural genomics. Curr Opin 90. Eisen MB, et al. and display 105. Ross DT, et al. Systematic variation in gene Struct Biol 1999; 9 (3): 390-9. of genome-wide expression patterns. Proc expression patterns in human cancer cell lines. 76. Aravind L, Koonin EV. DNA-binding pro- Natl Acad Sci USA 1998; 95 (25): 14863-8. Nat Genet 2000; 24 (3): 227-35. teins and evolution of transcription regulation 91. Wen X, et al. Large-scale temporal gene ex- 106. Perou CM, et al. Distinctive gene expression in the archaea. Nucleic Acids Res 1999; 27 pression mapping of central patterns in human mammary epithelial cells (23): 4658-70. development. Proc Natl Acad Sci USA 1998; and breast . Proc Natl Acad Sci USA 77. Huynen MA, van Nimwegen E.The frequency 95 (1): 334-9. 1999; 96 (16): 9212-7. distribution of gene family sizes in complete 92. Alon U, et al. Broad patterns of gene ex- 107. Livesey FJ, et al. analysis of the genomes. Mol Biol Evol 1998; 15 (5): 583-9. pression revealed by clustering analysis of transcriptional network controlled by the 78. Luscombe NM, Thornton JM. Protein-DNA tumor and normal colon tissues probed by photoreceptor homeobox gene Crx. Curr Biol interactions: an analysis of amino acid conser- oligonucleotide arrays. Proc Natl Acad Sci 2000; 10 (6): 301-10. vation and the effect on binding specificity. USA 1999; 96 (12): 6745-50. 108. Sali A, Blundell TL. Comparative protein Manuscript in preparation. 93. Tamayo P, et al. Interpreting patterns of gene modelling by satisfaction of spatial restraints. 79. Gelfand MS. Prediction of function in DNA expression with self-organizing maps: meth- Journal of Molecular Biology 1993; 234 (3): . J Comp Biol 1995; 1: ods and application to hematopoietic 779-815. 87-115. differentiation. Proc Natl Acad Sci USA 1999; 109. Jones DT, Taylor WR, Thornton JM. A new 80. Robison K, McGuire AM, Church GM. A 96 (6): 2907-12. approach to protein fold recognition. Nature comprehensive of DNA-binding site 94. Toronen P, et al. Analysis of gene expression 1992; 358 (6381): 86-9. matrices for 55 proteins applied to the data using self-organizing maps. FEBS Lett 110. Kok K, Naylor SL, Buys CH. Deletions of the complete Escherichia coli K-12 genome. J Mol 1999; 451 (2): 142-6. short arm of chromosome 3 in solid tumors Biol 1998; 284 (2): 241-54. 95. Tavazoie S, et al. Systematic determination of and the search for suppressor genes.Advances 81. Thieffry D, et al. Prediction of transcriptional genetic . Nat Genet 1999; in Cancer Research 1997; 71: 27-92. regulatory sites in the complete genome 22 (3): 281-5. 111. Syngal S, et al. Sensitivity and specificity of sequence of Escherichia coli K-12. Bioinfor- 96. Subrahmanyam YV, et al. RNA expression clinical criteria for hereditary non-polyposis matics 1998; 14 (5): 391-400. patterns change dramatically in human neu- colorectal cancer associated mutations in 82. Mironov AA. et al. Computer analysis of trophils exposed to bacteria. Blood 2001; 97 MSH2 and MLH1. Journal Med Genet 2000; transcription regulatory patterns in completely (8): 2457-68. 37 (9): 641-5. sequenced bacterial genomes. Nucleic Acids 97. Jansen R, Gerstein M. Analysis of the yeast 112. Lin J, Gerstein M. Whole-genome trees based Res 1999; 27 (14): 2981-9. with structural and functional on the occurrence of folds and orthologs: 83. Gelfand MS, Koonin EV, Mironov AA. categories: characterizing highly expressed pro- implications for comparing genomes on Prediction of transcription regulatory sites in teins. Nucleic Acids Res 2000; 28 (6): 1481-8. different levels. Genome Res 2000; 10 (6): Archaea by a comparative genomic approach. 98. Gerstein M, Jansen R. The current excitment 808-18. Nucleic Acids Res 2000; 28 (3): 695-705. in bioinformatics, analysis of whole-genome 113. Harrison PM, Echols N, Gerstein MB. Digging 84. McGuire AM, Hughes JD, Church GM. expression data: how does it relate to protein for dead genes: an analysis of the characteris- Conservation of DNA regulatory motifs and structure and function. Curr Opin Struct Biol tics of the pseudogene population in the discovery of new motifs in microbial genomes. 2000; 10: 574-84. Caenorhabditis elegans genome. Nucleic Genome Res 2000; 10 (6): 744-57. 99. Drawid A, Gerstein M. A Bayesian System Acids Res 2001; 29 (3): 818-30. 85. Bysani N, Daugherty JR, Cooper TG. Saturation Integrating Expression Data with Sequence 114. Uetz P, et al. A comprehensive analysis of mutagenesis of the UASNTR (GATAA) Patterns for Localizing Proteins: Compre- protein-protein interactions in Saccharomyces responsible for nitrogen catabolite repression- hensive Application to the Yeast Genome. J cerevisiae. Nature 2000; 403 (6770): 623-7. sensitive transcriptional activation of the Mol Biol 2000; 301:1059-75. 115. Ross-Macdonald P, et al. Transposon muta- allantoin pathway genes in Saccharomyces 100. Drawid A, Jansen R, Gerstein M. Genom- genesis for the analysis of protein production, cerevisiae. J Bacteriol 1991; 173 (16): 4977-82. wide analysis relating expression level with function, and localization. Methods Enzymol 86. Clarke ND, Berg JM. Zinc fingers in Caeno- protein subcellular localisation. Trends Genet 1999; 303: 512-32. rhabditis elegans: finding families and probing 2000; 16: 426-30. 116. Mewes HW, et al. MIPS: a database for pathways. Science 1998; 282 (5396): 2018-22. 101. Marcotte EM, et al. Detecting protein func- genomes and protein sequences. Nucleic 87. van Helden J, Andre B, Collado-Vides J. tion and protein-protein interactions from Acids Res 1999; 27 (1): 44-8. Extracting regulatory sites from the upstream genome sequences. Science 1999; 285 (5428): region of yeast genes by computational 751-3. Correspondence to: analysis of oligonucleotide frequencies. J Mol 102. Eisenberg D, et al. Protein function in the Mark Gerstein Biol 1998; 281 (5): 827-42. post-genomic era. Nature 2000; 405 (6788): Department of and Biochemistry 88. Salgado H, et al. Operons in Escherichia coli: 823-6. Yale University, 266 Whitney Avenue genomic analyses and predictions. Proc Natl 103. Jansen R, Greenbaum D, Gerstein M. Relat- PO Box 208114, New Haven CT 06520-8114, USA Acad Sci USA, 2000; 97 (12): 6652-7. ing whole-genome expression data with E-Mail: [email protected]

Method Inform Med 4/2001 Downloaded from www.methods-online.com on 2012-01-26 | IP: 129.174.206.135 For personal or educational use only. No other uses without permission. All rights reserved.