1 The Challenges Facing Genomic Informatics

Temple F. Smith

What are these areas of intense research labeled and functional genomics? If we take literally much of the recently published "news and views," it seems that the often stated claim that the last century was the century of physics, whereas the twenty-first will be the century of , rests significantly on these new research areas. We might therefore ask: What is new about them? After all, compu- tational or mathematical biology has been around for a long time. Surely much of bioinformatics, particularly that associated with evolution and genetic analyses, does not appear very new. In fact, the related work of researchers like R. A. Fisher, J. B. S. Haldane, and SewellWright dates nearly to the beginning of the 1900s. The modem analytical approaches to genetics, evolution, and ecology rest directly on their and similar work. Even genetic mapping easily dates to the 1930s, with the work of T. S. Painter and his students of Drosophila (still earlier if you include T. H. Morgan's work on X-linked markers in the fly). Thus a short historical review might provide a useful perspective on this anticipated century of biology and allow us to view the future from a firmer foundation. First of all, it should be helpful to recognize that it was very early in the so-called century of physics that modem biology began, with a paper read by Hermann Muller at a 1921meeting in Toronto. Muller, a student of Morgan's, stated that although of submicroscopic size, the gene was clearly a physical particle of complex structure, not just a working construct! Muller noted that the gene is unique from its product, and that it is normally duplicated unchanged, but once mutated, the new form is in turn duplicated faithfully. The next 30 years, from the early 1920s to the early 1950s, were some of the most revolutionary in the science of biology. In my original field of physics, the great insights of relativity and quantum mechanics were already being taught to under- graduates; in biology, the new one-gene-one-enzyme concept was leading researchers

II to new understandings in biochemistry, genetics, and evolution. The detailed physical nature of the gene and its product were soon obtained. By midcentury, the unique linear nature of the protein and the gene were essentially known from the work of Frederick Sanger (Sanger 1949) and Erwin Chargraff (Chargraff 1950). All that remained was John Kendrew's structural analysis of sperm whale myoglobin (Ken- drew 1958) and James Watson and Francis Crick's double helical model for DNA (Watson and Crick 1953). Thus by the mid-1950s, we had seen the physical gene and one of its products, and t e otivation was in place to find them all. Of course, the genetic code needed to be etermined and restriction enzymes discovered, but the beginning of. modem mol;;:ular biology was on its way. 4 Temple F. Smith

We might say that much of the last century was the century of applied physics, and the last half of the century was applied molecular biochemistry, generally called mo- lecular biology! So what happened to create bioinformatics and functional genomics? It was, of course, the wealth of sequence data, first protein and then genomic. Both are based on some very clever chemistry and the late 1940s molecular sizing by chromatography. Frederick Sanger's sequencing of insulin (Sanger 1956) and Wally Gilbert and Allan Maxam's sequence of the Lactose operator from E. coli (Maxam and Gilbert 1977) showed that it could be done. Thus, in principle, all genetic se- quences, including the human genome, were determinable; and, if determinable, they were surely able to be engineered, suggesting that the economics and even the ethics of biological research was about to change. The revolution was already visible to some by the 1970s. The science or discipline of analyzing and organizing sequence data defines for many the bioinformatics realm. It had two somewhat independent beginnings. The older was the attempt to related amino acid sequences to the three-dimensional structure and function of proteins. The primary focus was the understanding of the sequence's encoding of structure and, in turn, the structure's encoding of biochemical function. Beginning with the early work of Sanger and Kendrew, progress continued such that, by the mid-1960s, Margaret Dayhoff (Dayhoff and Eck 1966) had for- mally created the first major database of protein sequences. By 1973, we had the start of the database of X-ray crystallographic determined protein atomic coordinates under Tom Koetzle at the Brookhaven National Laboratory. From early on, Dayhoff seemed to understand that there was other very funda- mental information available in sequence data, as shown in her many phylogenetic trees. This was articulated most clearly by Emile Zuckerkandl and Linus Pauling as early as 1965 (Zuckerkandl and Pauling 1965), that within the sequences lay their evolutionary history. There was a second fossil record to be deciphered. It was that recognition that forms the true second beginning of what is so often thought of as the heart of bioinformatics, comparative sequence analyses. The semi- nal paper was by Walter Fitch and Emanuel Margoliash, in which they constructed a phylogenetic tree from a set of cytochrome sequences (Fitch and Margoliash 1967). With the advent of more formal analysis methods (Needleman and Wunsch 1970; Smith and Waterman 1981;Wilbur and Lipman 1983) and larger datasets (GenBank was started at Los amos in 1982), the marriage between sequence analysis and com u . nce emerged as naturally as it had with the analysis of tens of tho~ands of diffraction spots in protein structure determination a decade before. As if proof was needed that comparative sequence analysis was of more than academic interest, Russell Doolittle (Doolittle et al. 1983) demonstrated that we could explain the onc 'th The Challenges Facing Genomic Informatics 5

gene v-sis's properties as an aberrant growth factor by assuming that related func- tions are carried out by sequence similar proteins. By 1990, nearly all of the comparative sequence analysis methods had been refined and applied many times. The result was a wealth of new functional and evolutionary hypotheses. Many of these led directly to new insights and experimental validation. This in turn made the 40 years between 1950 and 1990 the years that brought reality to the dreams seeded in those wondrous previous 40 years of genetics and biochem- istry. It is interesting to note that during this same 40 years, computers developed from the wartime monsters through the university mainframes and the lab bench FS workstation to the powerful personal computer. In fact, Doolittle's early successful 0 comparative analysis was done on one of the first personal computers, an Apple II. The link between computers and molecular biology is further seen in the justification r of initially placing GenBank at the Los Alamos National Laboratory rather than at e an academic institution. This was due in large part to the laboratory's then immense computer resources, which in the year 2000 can be found in a top-of-the-linelaptop! What was new to was the data and the anticipated amount of it. Note that the human genome project W'asbeing formally initiated by 1990.Within the century's final decade, the genomes of more than two dozen micro- organisms, along with yeast and C. elegans, the worm, would be completely se- quenced. By the summer of the new century's very first year, the fruit fly genome would be sequenced, as well as 85 percent of the entire human genome. Although envisioned as possible by the late 1970s, no one foresaw the wealth of full genomic sequences that would be available at the start of the new millennium. c What challenges remained at the informatics level? Major database problems and s some additional algorithm development will still surely come about. And, even though r we stillcannot predict a protein's structure or function directlyfrom its sequence,de novo, straightforward sequence comparisons with such a wealth of data can generally in infer both function and structure from the identification of close homologues pre- viously analyzed. Yet it has slowly become obvious that there are at least four major a problems here: first, most "previously analyzed" sequences obtained their annotation via sequence comparative inheritance, and not by any direct experimentation; sec- ond, many proteins carry out very different cellular roles even when their biochemical functions are similar; third, there are even proteins that have evolved to carry out functions di tinct from those carried out by their close homologues (Jeffery 1999); and, finally, any proteins are multidomained and thus multifunctional, but identified \ by only one unction. When we compound these facts with the lack of any universal vocabulary t oughout much of molecular biology, there is great confusion, even with interpreti standard sequence similarity analysis. Even more to the point of the 6 Temple F. Smith

future of bioinfonnatics is knowing that the function of a protein or even the role in the cell played by that function is only the starting point for asking real biological questions. Asking questions beyond what biochemistry is encoded in a single protein or pro- tein domain is still challenging. However, asking what role biochemistry plays in the of the cell, which many refer to as functional genomics, is clearly even more chal- lenging from the computational side. The analysis of genes and gene networks and their regulation may be even more complicated. Here we have to deal with alternate spliced gene products with potentially distinct functions and highly degenerate short DNA regulatory words. So far, sequence comparative methods have had limited success in these cases. What will be the future role of computation in biology in the first few decades of this century? Surely many of the traditional comparative sequence analyses, including homologous extension protein structure modeling and DNA signal recognition, will continue to play major roles. As already demonstrated, standard statistical and clus- tering methods will be used on gene expression data. It is obvious, however, that the challenge for the biological sciences is to begin to understand how the genome parts list encodes cellular function-not the function of the individual parts, but that of the whole cell and organism. This, of course, has been the motivation underlying most of molecular biology over the last 20 years. The difference now is that we have the parts lists for multiple cellular organisms. These are complete parts lists rather than just a couple of genes identified by their mutational or other effects on a single pathway or cellular function. The past logic is now reversible: rather than starting with a path- way or physiological function, we can start with the parts list either to generate test- able models or to carry out large-scale exploratory experimental tests. The latter, of course, is the logic behind the mRNA expression chips, whereas the fonner leads to experiments to test new regulatory network or metabolic pathway models. The design, analysis, and refinement of such complex models will surely require new computa- tional approaches. The analysis of the RNA expression data requires the identification of various correlations between individual gene expression profiles and between those profiles and different cellular environments or types. These, in turn, require some model con- cepts as to how the behavior of one gene may effect that of others, both temporally and spatially. Some straightforward analyses ofRNA expression data have identified many differences in gene expression in cancer versus noncancer cells (Golub et al. 1999) and for different growth conditions (Eisen et al. 1998). Such data have also been used in an attempt to identify common or shared regulatory signals in bacteria (Hughes et al. 2000).

" Smith The Challenges Facing Genomic Informatics 7

'ole in Yet expression data's full potential is not close to being realized. In particular, ogical when gene expression data can be fully coupled to protein expression, modification, and activity, the very complex genetic networks should begin to come into view. In . r pro- higher animals, for example, proteins can be complex products of genes through in the alternate exon splicing. We can anticipate that mRNA-based microarray expression : chal- analysis will be replaced by exon expression analysis. Here again, modeling will surely ~sand playa critical role, and the type of computational biology envisioned by population emate and evolutionary geneticists such as Wright may finally become a reality. This, the short extraction of how the organism's range of behavior or environment responses is imited encoded in the genome, is the ultimate aim of functional genomics. Many people in what is now called bioinformatics will recall that much of the des of wondrous mathematical modeling and analysis associated with population and evo- luding lutionary biology was at best suspect and at worst ignored by molecular biologists n, will over the last 30 years or so. At the beginning of the new millennium, perhaps those :lclus- thinkers should be viewed as being ahead of their time. Note, it was not that serious tat the mathematics is not necessary to understand anything as complex as interacting ~parts populations, but only that the early biomodelers did not have the needed data! Today of the we are rapidly approaching the point where we can measure not only a population's lost of genetic variation, but nearly all the genes that might be associated with a particular eparts environmental response. It is the data that has created the latest aspect of the bio- just a logical revolution. Just imagine what we will be able to do with a dataset composed Nayor of distributions of genetic variation among different subpopulations of fruit fly living l path- in distinctly different environments, or what might we learn about our own evolution te test- by having access to the full range of human and other primate genetic variation for tter, of all 40,000 to 100,000 human genes? :adsto It is perhaps best for those anticipating the challenges of bioinformatics and com- :lesign, putational genomics to think about how biology is likely to be taught by the end of nputa- the second decade of this century. Will the complex mammalian immune system be II presented as a logical evolutionary adaptation of an early system for cell-cell com- ranous munication that developed into a cell-cell recognition system, and then self-nonself )rofiles recognition? Will it become obvious that the use by yeast of the G-protein couple el con- receptors to recognize matting types would become one of the main components of porally nearly all higher organisms sensor systems? Like physics, where general rules and ~ntified laws are taught at the start and the details are left for the computer, biology will ) etal. surely be presented to future generations of students as a set of basic systems that ve also have been duplicated and adapted to a very wide range of cellular and organismic ,acteria functions following basic evolutionary principles constrained by Earth's geological history . Smith The Challenges Facing Genomic Infonnatics 7

'olein Yet expression data's full potential is not close to being realized. In particular, ogical when gene expression data can be fully coupled to protein expression, modification, and activity, the very complex genetic networks should begin to come into view. In . r pro- higher animals, for example, proteins can be complex products of genes through in the alternate exon splicing. We can anticipate that mRNA-based microarray expression ~chal- analysis will be replaced by exon expression analysis. Here again, modeling will surely ~sand playa critical role, and the type of computational biology envisioned by population emate and evolutionary geneticists such as Wright may finally become a reality. This, the short extraction of how the organism's range of behavior or environment responses is imited encoded in the genome, is the ultimate aim of functional genomics. Many people in what is now called bioinformatics will recall that much of the des of wondrous mathematical modeling and analysis associated with population and evo- [uding lutionary biology was at best suspect and at worst ignored by molecular biologists n, will over the last 30 years or so. At the beginning of the new millennium, perhaps those i clus- thinkers should be viewed as being ahead of their time. Note, it was not that serious lat the mathematics is not necessary to understand anything as complex as interacting : parts populations, but only that the early biomodelers did not have the needed data! Today ofthe we are rapidly approaching the point where we can measure not only a population's lost of genetic variation, but nearly all the genes that might be associated with a particular ~parts environmental response. It is the data that has created the latest aspect of the bio- just a logical revolution. Just imagine what we will be able to do with a dataset composed Nayor of distributions of genetic variation among different subpopulations of fruit fly living Ipath- in distinctly different environments, or what might we learn about our own evolution te test- by having access to the full range of human and other primate genetic variation for lter,of all 40,000 to 100,000 human genes? ~adsto It is perhaps best for those anticipating the challenges of bioinformatics and com- iesign, putational genomics to think about how biology is likely to be taught by the end of nputa- the second decade of this century. Will the complex mammalian immune system be II presented as a logical evolutionary adaptation of an early system for cell-cell com- rarious munication that developed into a cell-cell recognition system, and then self-nonself )fOfiles recognition? Will it become obvious that the use by yeast of the G-protein couple el con- receptors to recognize matting types would become one of the main components of porally nearly all higher organisms sensor systems? Like physics, where general rules and mtified laws are taught at the start and the details are left for the computer, biology will ) et al. surely be presented to future generations of students as a set of basic systems that ve also have been duplicated and adapted to a very wide range of cellular and organismic acteria functions following basic evolutionary principles constrained by Earth's geological history. 8 Temple F. Smith

References

Chargraff, E. (1950). Chemical specificity of the nucleic acids and mechanisms of their enzymatic degra- dation. Experimentia 6: 201-208. Dayhoff, M. 0., and Eck, R. V. (1966). Atlas of Protein Sequence and Structure. Silver Spring, MD: NBRF Press. Doolittle, R. F., Hunkapiller, M. W., Hood, L. E., Devare, S. G., Robbins, K. C., Aaronson, S. A., and Antoniades, H. N. (1983). Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. Science 221(4607): 275-277. Eisen, M. B., Spellman, P. T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Froc. Nat/. Acad. Sci. USA 95(25): 14863-14868. Fitch, W. M., and Margoliash, E. (1967). Construction of phylogenetic trees. A method based on mutation distances as estimated from cytochrome c sequences is of general applicability. Science 155: 279-284. Golub, T. R., Sionim, D. K., Tamayo, P., Huard, c., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439): 531-537. Hughes, J. D., Estep, P.W., Tavazoie, S., and Church, G. M. (2000). Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Bioi. 296(5): 1205-1214. Jeffery, C. J. (1999). Moonlighting proteins. Trends Biochem. Sci. 24(1): 8-1l. Kendrew, J. C. (1958). The three-dimensional structure of a myoglobin. Nature 181: 662-666. Maxam, A. M., and Gilbert, W. (1977). A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74(2): 560-564. Needleman, S. B., and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Bioi. 48: 443-453. Sanger, F. (1949). Cold Spring Harbor Symposia on Quantitative Biology 14: 153-160. Sanger, F. (1956). The structure of insulin. In Currents in Biochemical Research, Green, D. E. ed. New York: Interscience. Smith, T. F., and Waterman, M. S. (1981). Identification of common molecular subsequences. J. Mol. Bioi. 147: 195-197. Watson, J. D., and Crick, F. H. C. (1953). Genetic implications of the structure of deoxyribonucleic acid. Nature 171: 964-967. Wilbur, W. J., and Lipman, D. J. (1983). Rapid similarity searches of nucleic acid and protein data banks. Froc. Natl. Acad. Sci. USA 80(3): 726-730. Zuckerkandl, E., and Pauling, L. C. (1965). Molecules as documents of evolutionary history. J. Theoret. Bioi. 8: 357-358.