1 the Challenges Facing Genomic Informatics
Total Page:16
File Type:pdf, Size:1020Kb
1 The Challenges Facing Genomic Informatics Temple F. Smith What are these areas of intense research labeled bioinformatics and functional genomics? If we take literally much of the recently published "news and views," it seems that the often stated claim that the last century was the century of physics, whereas the twenty-first will be the century of biology, rests significantly on these new research areas. We might therefore ask: What is new about them? After all, compu- tational or mathematical biology has been around for a long time. Surely much of bioinformatics, particularly that associated with evolution and genetic analyses, does not appear very new. In fact, the related work of researchers like R. A. Fisher, J. B. S. Haldane, and SewellWright dates nearly to the beginning of the 1900s. The modem analytical approaches to genetics, evolution, and ecology rest directly on their and similar work. Even genetic mapping easily dates to the 1930s, with the work of T. S. Painter and his students of Drosophila (still earlier if you include T. H. Morgan's work on X-linked markers in the fly). Thus a short historical review might provide a useful perspective on this anticipated century of biology and allow us to view the future from a firmer foundation. First of all, it should be helpful to recognize that it was very early in the so-called century of physics that modem biology began, with a paper read by Hermann Muller at a 1921meeting in Toronto. Muller, a student of Morgan's, stated that although of submicroscopic size, the gene was clearly a physical particle of complex structure, not just a working construct! Muller noted that the gene is unique from its product, and that it is normally duplicated unchanged, but once mutated, the new form is in turn duplicated faithfully. The next 30 years, from the early 1920s to the early 1950s, were some of the most revolutionary in the science of biology. In my original field of physics, the great insights of relativity and quantum mechanics were already being taught to under- graduates; in biology, the new one-gene-one-enzyme concept was leading researchers II to new understandings in biochemistry, genetics, and evolution. The detailed physical nature of the gene and its product were soon obtained. By midcentury, the unique linear nature of the protein and the gene were essentially known from the work of Frederick Sanger (Sanger 1949) and Erwin Chargraff (Chargraff 1950). All that remained was John Kendrew's structural analysis of sperm whale myoglobin (Ken- drew 1958) and James Watson and Francis Crick's double helical model for DNA (Watson and Crick 1953). Thus by the mid-1950s, we had seen the physical gene and one of its products, and t e otivation was in place to find them all. Of course, the genetic code needed to be etermined and restriction enzymes discovered, but the beginning of. modem mol;;:ular biology was on its way. 4 Temple F. Smith We might say that much of the last century was the century of applied physics, and the last half of the century was applied molecular biochemistry, generally called mo- lecular biology! So what happened to create bioinformatics and functional genomics? It was, of course, the wealth of sequence data, first protein and then genomic. Both are based on some very clever chemistry and the late 1940s molecular sizing by chromatography. Frederick Sanger's sequencing of insulin (Sanger 1956) and Wally Gilbert and Allan Maxam's sequence of the Lactose operator from E. coli (Maxam and Gilbert 1977) showed that it could be done. Thus, in principle, all genetic se- quences, including the human genome, were determinable; and, if determinable, they were surely able to be engineered, suggesting that the economics and even the ethics of biological research was about to change. The revolution was already visible to some by the 1970s. The science or discipline of analyzing and organizing sequence data defines for many the bioinformatics realm. It had two somewhat independent beginnings. The older was the attempt to related amino acid sequences to the three-dimensional structure and function of proteins. The primary focus was the understanding of the sequence's encoding of structure and, in turn, the structure's encoding of biochemical function. Beginning with the early work of Sanger and Kendrew, progress continued such that, by the mid-1960s, Margaret Dayhoff (Dayhoff and Eck 1966) had for- mally created the first major database of protein sequences. By 1973, we had the start of the database of X-ray crystallographic determined protein atomic coordinates under Tom Koetzle at the Brookhaven National Laboratory. From early on, Dayhoff seemed to understand that there was other very funda- mental information available in sequence data, as shown in her many phylogenetic trees. This was articulated most clearly by Emile Zuckerkandl and Linus Pauling as early as 1965 (Zuckerkandl and Pauling 1965), that within the sequences lay their evolutionary history. There was a second fossil record to be deciphered. It was that recognition that forms the true second beginning of what is so often thought of as the heart of bioinformatics, comparative sequence analyses. The semi- nal paper was by Walter Fitch and Emanuel Margoliash, in which they constructed a phylogenetic tree from a set of cytochrome sequences (Fitch and Margoliash 1967). With the advent of more formal analysis methods (Needleman and Wunsch 1970; Smith and Waterman 1981;Wilbur and Lipman 1983) and larger datasets (GenBank was started at Los amos in 1982), the marriage between sequence analysis and com u . nce emerged as naturally as it had with the analysis of tens of tho~ands of diffraction spots in protein structure determination a decade before. As if proof was needed that comparative sequence analysis was of more than academic interest, Russell Doolittle (Doolittle et al. 1983) demonstrated that we could explain the onc 'th The Challenges Facing Genomic Informatics 5 gene v-sis's properties as an aberrant growth factor by assuming that related func- tions are carried out by sequence similar proteins. By 1990, nearly all of the comparative sequence analysis methods had been refined and applied many times. The result was a wealth of new functional and evolutionary hypotheses. Many of these led directly to new insights and experimental validation. This in turn made the 40 years between 1950 and 1990 the years that brought reality to the dreams seeded in those wondrous previous 40 years of genetics and biochem- istry. It is interesting to note that during this same 40 years, computers developed from the wartime monsters through the university mainframes and the lab bench FS workstation to the powerful personal computer. In fact, Doolittle's early successful 0 comparative analysis was done on one of the first personal computers, an Apple II. The link between computers and molecular biology is further seen in the justification r of initially placing GenBank at the Los Alamos National Laboratory rather than at e an academic institution. This was due in large part to the laboratory's then immense computer resources, which in the year 2000 can be found in a top-of-the-linelaptop! What was new to computational biology was the data and the anticipated amount of it. Note that the human genome project W'asbeing formally initiated by 1990.Within the century's final decade, the genomes of more than two dozen micro- organisms, along with yeast and C. elegans, the worm, would be completely se- quenced. By the summer of the new century's very first year, the fruit fly genome would be sequenced, as well as 85 percent of the entire human genome. Although envisioned as possible by the late 1970s, no one foresaw the wealth of full genomic sequences that would be available at the start of the new millennium. c What challenges remained at the informatics level? Major database problems and s some additional algorithm development will still surely come about. And, even though r we stillcannot predict a protein's structure or function directlyfrom its sequence,de novo, straightforward sequence comparisons with such a wealth of data can generally in infer both function and structure from the identification of close homologues pre- viously analyzed. Yet it has slowly become obvious that there are at least four major a problems here: first, most "previously analyzed" sequences obtained their annotation via sequence comparative inheritance, and not by any direct experimentation; sec- ond, many proteins carry out very different cellular roles even when their biochemical functions are similar; third, there are even proteins that have evolved to carry out functions di tinct from those carried out by their close homologues (Jeffery 1999); and, finally, any proteins are multidomained and thus multifunctional, but identified \ by only one unction. When we compound these facts with the lack of any universal vocabulary t oughout much of molecular biology, there is great confusion, even with interpreti standard sequence similarity analysis. Even more to the point of the 6 Temple F. Smith future of bioinfonnatics is knowing that the function of a protein or even the role in the cell played by that function is only the starting point for asking real biological questions. Asking questions beyond what biochemistry is encoded in a single protein or pro- tein domain is still challenging. However, asking what role biochemistry plays in the life of the cell, which many refer to as functional genomics, is clearly even more chal- lenging from the computational side. The analysis of genes and gene networks and their regulation may be even more complicated. Here we have to deal with alternate spliced gene products with potentially distinct functions and highly degenerate short DNA regulatory words.