<<

SEQUENCE ANALYSIS Contributions by Ulam to Molecular

by Walter B. Goad

ord Rayleigh once introduced a the computer, are of immense help in fer- humor-esoteric, malicious, or plain— key idea with “It is tolerably ob- reting out meaning in the very great quan- and an occasional flash of ego. The key, vious once remarked. .“ Yes, tities of data now pouring forth. of course, was to discern the dominant L I think now, that is just how I met Stan soon after arriving in Los phenomenon and to estimate its role in it was—Stan Ulam providing us with a Alamos at the end of 1950. I came os- the matter at hand. One always had a steady stream of ideas and observations tensibly to finish a thesis begun at Duke feeling, almost visceral, as to how deeply “tolerably obvious” only in retrospect, under Lothar Nordheim, who had arrived an argument was rooted in the web of and then striking in the way they be- several months earlier while I stayed in our knowledge of physics and mathemat- came integral to one’s tangible world of Durham awaiting security clearance. At ics. Stan habitually turned things to view evolved and evolving forms and actions. last a telegram came from Carson Mark from a variety of directions, much as he Here I would like to sketch ideas de- that read, “Your clearance not available.” would see an algebraic structure topolog- veloped during the sixties and seventies An anxious telephone call established that ically, and vice versa, and often supplied as an avalanche of detail, still growing, the “not” had been garbled in transit from the connection that dispelled a gathering gathered about the way sequences of nu- “now.” I was immediately swept up fog. cleotide bases in DNA encode instructions in the thermonuclear program, kept busy for development and propagation of liv- with the rest dissecting schemes and de- round 1960 Jim Tuck invited Leonard ing organisms. Stan showed us a very signs, and sometimes new phenomena, ALerman, who was in the thick of the general way of thinking precisely about usually standing around a blackboard. In- gathering revolution in biology and then relationships among sequences, in par- troducing the right factors, right at least at the University of Colorado, to visit ticular, how to devise quantitative mea- in order of magnitude, was both vital Los Alamos. The “phage group” gathered sures of relationship that. together with and enjoyably competitive, laced with loosely around Max Delbruck had estab-

288 Los Alamos Science Special Issue 1987 Sequence Analysis

lished a mode of analysis that is still driv- interaction between 02 molecules that is ular biology, and, undoubtedly, a great ing the biological revolution: Changes completely unrelated to their interactions deal that we do not now even glimpse. in a single DNA molecule are amplified as free molecules: Within a hemoglobin Less than a decade after Watson and

by biological reproduction, usually in a molecule up to four 02's bind at dis- Crick determined the structure of DNA, re- microorganism, to the macroscopic level; tinct sites and thus effectively stick to- searchers at the laboratories of Nirenberg, there the consequences of those changes, gether. Furthermore, three or four stick Khorana, and Ochoa fully worked out however ramified, can be studied with the more tightly than one or two. So, where the “genetic code” by which the base se- resources of physics and . The there is much oxygen, four are tightly quences of particular segments of DNA— amplification is made possible by an im- bound; where there is little, departure —are translated into sequences of mensely powerful, and growing, armory of one causes the others to more eas- amino acids that fold up as particular pro- of molecular tools based on enzymes that ily depart. Invoking the adaptor prin- teins. For a few years many people felt carry out specific operations on specific ciple, Francis Crick predicted the exis- that, in principle, DNA function was now DNA'S. As we grasped those ideas from tence of what are now called transfer completely understood. But in the mid Leonard and began to see the clarity and RNA’S—small RNA molecules, a particu- seventies methods were worked out for concreteness with which the mechanisms lar species of which adapts each three- determining sequences of bases in DNA, of life would emerge from such analysis, base codon to molecules of a particular and it amost immediately emerged that many of us were galvanized. We soon amino acid. A Zen-like consciousness of not even the sequences that are translated responded in a way typical of the cul- physical necessity—for the way in which into proteins are simple, continuous cod- ture, organizing a seminar, hungrily seek- electrons and nuclei, and thus atoms and ing sequences. The last few years have ing out the many aspects of the subject. molecules, do what they must—leads first seen the discovery of a great many dis- As I recall, the seminar continued through to puzzlement at living systems and then tinct “signals” that control the replica- the sixties and early seventies with a vary- to resolution: Molecular adaptors free the tion of DNA and the expression of genes. ing membership but with Stan, Jim Tuck, logic of higher levels of organization to However, it is not yet known how the ac- George Bell, and me as regulars. We adopt and express a logic of their own, tion of those signals is coordinated, as it were frequently visited, and enormously exploiting, not circumventing, physical must be, to yield the patterns seen dur- encouraged, by Ted Puck, who has built a necessity. ing reproduction and development. On distinguished school of molecular and cell Proteins and RNA’S provide an array of the other hand, an outline is emerging of biology at the University of Colorado and complex and highly specific adaptors, and the organization within DNA of repetitive who was, and is, exceedingly optimistic their structures are encoded in sequences sequences, which make up a substantial about the contribution systematic theory of bases in DNA. To a large ex- fraction of the genome in higher organ- can make to biology. tent the double-helical structure of DNA isms. That organization may or may not A quick tour of systematic theory in- wraps the information-conveying part of have signaling capabilities, but it is al- evitably would start with Darwin’s grand the DNA into a protected interior and so most surely important in evolution. Per- synthesis. For physicists a key way point in the main removes chemical constraints haps most striking of all is the grow- would be the publication in 1944 of Er- on the propagation and selection of se- ing knowledge of phenomena—such as win Schrodinger’s short book What Is quences. the mobility and duplication of pieces of Life?, which equates that grand question Working on DNA as a substrate, evolu- DNA and its rearrangement-that intro- with one congenial to physicists: What tion has produced the marvelously com- duce into the genome a degree of dy- generates “negentropy,” the high degree plex web of living systems we see today. namism far beyond what classical genet- of order that living systems are continu- The working hypothesis, to which no ex- ics had led us to suspect. ally creating from the environment? Ever ception is yet known, is that all of the since, theorists of all kinds have looked to information for propagation and develop- ost of this was yet to come in the the formulation of some powerful phys- ment of individual organisms is encoded Mlate sixties, when the amino-acid se- ical theory of life. Short of that, what somehow in the sequence of four bases quences of a few proteins were the only we do know is that living systems escape adenine (A), thymine (T), guanine (G), biological sequences known. However, it from the determinism of ordinary chem- and cytosine (C) along the DNA molecules was already clear that the information on istry by interposing molecular adaptors to (or, in some cases, RNA molecules) that which a cell acts is encoded in sequences control molecular interactions. An exam- compose its genome. The “somehow” in- of bases, and the question of how to char- ple is provision by the complex protein cludes the great triumphs of the past two acterize relationships among sequences structure of hemoglobin of an effective decades, the present frontiers of molec- hundreds or thousands of bases long was

Los A/amos Science Special Issue 1987 289 Sequence Analysis

at hand. With his almost visceral feel- possible sets of elementary changes, re- DISTANCE BETWEEN ing for representation of natural phenom- quires on the order of N ! computer op- DNA SEQUENCES ena by general mathematical structures, erations. An algorithm for determining Stan immediately framed the question in the distance in N* operations was dis- Consider the two short DNA sequences terms of defining a distance between se- covered by the biologists Saul Needleman GTTAAGGCGGGAA and GTTAGAGAGGAAA. As quences or, more generally, of defining a and Christian Wunsch in 1970 and inde- shown in (a), one of these can be trans- formed into the other by four base substi- usable metric space of sequences (Ulam pendently by Sellers in 1974. Essentially, tutions. If the “weight” assigned to a base 1972). This he did by considering cer- the algorithm proceeds by induction: The substitution is x, then the “measure” of the tain elementary base changes by which minimal set of changes needed to trans- set of changes in (a) is 4x. Alternatively, as one sequence might be transformed into form the first n bases of one sequence shown in (b), one sequence can be trans- a second: Replacement of one base by an- into the first m bases of the other is found formed into the other by two base inser- other and insertion or deletion of a base. by extending already computed minimal tions, two base deletions, and two base sub- (Combinations of these changes can result transformations of shorter subsequences, stitutions. Since base insertions (deletions) from errors in DNA replication, chromo- then n and m are increased, and so on occur less frequently than do base substitu- somal crossover during meiosis, insertion until the ends of the sequences have been tions, the weight y assigned to an insertion of viral or other DNA, or the action of reached. (deletion) is different from that assigned to mutagens.) Obviously, one sequence can By the end of the 1970s, it was ap- a substitution; in particular y is assigned a value greater than that of x. The measure of be transformed into another by more than parent that DNA would take one set of elementary changes, as shown off, and that investigators from all areas the set of changes illustrated in (b) is 2x+4y, in the accompanying figure. What Stan of biology, biomedicine, and bioagricul- which is greater than 4x. The distance be- tween the two given sequences is defined proposed was to compute a measure, a ture would increasingly apply it to their as the minimum of the measures calculated “size,” for each such set and to define as particular research problems. It was also for all possible sets of elementary changes the distance between the sequences the obvious that computer manipulation and that transform one sequence into the other. minimum value of the measure. analysis of sequences, much of it flow- In simplest form the measure is a sum ing from Stan’s idea for a metric, would of weights, one for each of the elemen- play an increasingly large role in exploit- tary changes that compose a transfor- ing the information. Mike Waterman had mation set. The set corresponding to joined Beyer and Smith in working on se- the minimum measure—the distance be- quence analysis, and Minoru Kanehisa, a tween the two sequences-can be inter- postdoc from Japan, and I made genetic preted as the minimal mutational path by sequences and their analysis our princi- which one sequence could have evolved pal preoccupation from then on. In 1982 from the other. In 1974 Stan, with Bill a consortium of federal agencies funded Beyer, Temple Smith, and Myron Stein GenBank, the national genetic-sequence applied the idea of distance to discern- data bank. Los Alamos collects and or- ing evolutionary relationships among var- ganizes the sequence data, and Bolt Be- ious species from variations in the amino- ranek and Newman Inc. distribute them acid sequences of a protein they all share. to users. By the end of 1986, DNA se- Also in 1974 Peter Sellers, after hear- quences totaling about 15 million bases, ing Stan talk at Rockefeller University, from several hundred species, had been proved that such a distance can indeed deposited in GenBank. satisfy the conditions of a metric, the most demanding of which is satisfaction n the 1980s a series of problems in of a triangle inequality. Without that, Isequence comparison have been faced one’s sense of what it means for some with varying degrees of success. One among several sequences to be close and problem now solved concerns global ver- others distant would be quite unreliable. sus local closeness (closeness, that is, Finding the distance between two se- in the sense of a distance between se- quences of length N by brute force, that quences). Often of interest are sequences is, by computing the measures for all the that are close to each other although em-

290 Los Alamos Science Special Issue 1987 Sequence Analysis

bedded in otherwise unrelated longer se- to be faced. A good sense of the problem, Walter B. Goad and Minoru I. Kanehisa. 1982. quences. Peter Sellers first introduced and of the limitations of sequence com- Pattern recognition in nucleic acid sequences. I. A general method for finding local homologies and the important distinction between local parison, is given by analogy to another symmetries. Nucleic Acids Research 10: 247. and overall closeness in 1980. A mea- idea of Stan’s. He proposed that percep- sure suited to the local problem (essen- tion, and thought itself, be considered in T. F. Smith and M. S. Waterman. 1981. Identifica- tion of common molecular subsequences. Journal tially the number of weighted changes per terms of a metric space. This frames the of 147: 195–197. base, formulated so that the algorithm of question: How is the distance between Needleman, Wunsch, and Sellers can still the visual fields corresponding to, say, Walter B. Goad. 1986. Computational analysis of genetic sequences. Annual Review of Biophysics and be used) was introduced in slightly dif- two tables—which will vary greatly with Biophysical Chemistry 15: 79–95. ferent forms by Kanehisa and me in 1982 circumstances-computed in our brains so that it is small compared with the Michael S. Waterman. 1984. General methods and by Smith and Waterman in 1981. of sequence comparison. Bulletin of Mathematical Another class of problems stems from distance between the visual fields corre- Biology 46: 473–500. the sheer quantity of data-examining 15 sponding to a table and a chair? Clearly million bases, even with an N 2 algorithm, the metric appropriate to a particular class Walter M. Fitch and Temple F. Smith. 1983. Op- timal sequence alignments. Proceedings of the Na- requires hundreds of hours on a Cray. of problems depends on the mechanisms tional Academy of Sciences of the United States of That problem has been reasonably suc- one hopes to discover or illuminate. America 80: 1382–1386. cessfully dealt with by presecreening se- quences for likely candidates for signif- athematical analysis has spread into icant relationships. A table of pointers Mnearly every corner of molecular Walter B. Goad received a B.S. in physics from to the locations of short subsequences (a genetics; its spread and development is Union College in 1945 and a Ph. D., also in physics, simple hash table) is created and searched still accelerating. In early 1986 the De- from Duke University in 1954. He has been a for short matching sequences. At this partment of Energy took the initiative member of the staff of Los Alamos since 1950. writing the method is being implemented in seriously exploring sequencing of the In 1982 he received a Distinguished Performance Award from the Laboratory in recognition of his with new hardware features of the Cray complete human genome, some 3 billion efforts at establishing GenBank, and in 1987 he was XMP. For a general review of sequence- bases. In that project computerized man- named a Fellow of the Laboratory. Until recently comparison algorithms, see Goad 1986; agement and analysis of information will he directed the activities of GenBank, in which for a review that emphasizes mathemati- play a key role. he continues to participate. His research focuses cal aspects, see Waterman 1984. Speaking of sequence analysis, Gen- primarily on analysis of biological sequences. He is a Fellow of the American Physical Society and Devising a metric appropriate to the in- Bank, and all that, Stan once said, “I of the American Association for the Advancement vestigation at hand is probably not a prob- started all this.” Yes. ■ of Science. lem that can be precisely posed, much less solved. A simple metric in which each elementary change is given the same weight may well suffice when the ob- Further Reading ject of study is a virus under great pres- S. M. Ulam. 1972. Some ideas and prospects in sure to preserve a small genome. But biomathematics. Annual Review of Biophysics and Bioengineering 1: 277–291. such a metric may show misleading re- lationships when applied to segments of Willaim A. Beyer, Myron L. Stein, Temple F. Smith, DNA from a more complicated organism, and Stanislaw M. Ulam. 1974. A molecular se- quence metric and evolutionary trees. Mathematical as Fitch and Smith found in 1983 for Biosciences 19: 9–25. mammalian hemoglobins. Some relation- ships may depend on similarities in three- Peter H. Sellers. 1974. On the theory and compu- tation of evolutionary distances. SIAM Journal on dimensional structure of DNA that are pre- Applied Mathematics 26: 787. served through a set of sequences, as may be the case for the elements that con- Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for simi- trol initiation of expression of particular larities in the amino acid sequence of two proteins. genes. To discover such relationships, Journal of Molecular Biology 48: 443. one needs a measure of structural simi- Peter H. Sellers. 1980. The theory and computa- larity, expressed of course in terms of se- tion of evolutionary distances: Pattern recognition. quences. That problem is just beginning Journal of Algorithms I: 359-373.

Los Alamos Science Special Issue 1987 291