<<

PERSPECTIVES

d’éthique en désaccord avec la directive européenne. Le Medicine in the 21st Century, December 4–5, 1998, 2nd nomic environment that includes various Monde 15 June (2000). edn 47–53 (AACC, Washington DC, 1998). social values, research practices and business 7. Kolata, G. Special Report: Who owns your genes? New 26. Caulfield, T. & Gold, E. R. Genetic testing, ethical York Times 15 May (2000). concerns, and the role of patent law. Clin. Genet. 57, pressures. We are mindful that, in some situa- 8. Ramirez, A. School given patent to clone humans. 370–375 (2000). tions, modifying patent law may reduce one National Post 16 May (2000). 27. Bruzzone, L. The research exemption: a proposal. Am. 9. Sagar, A., Daemmrich, A. & Ashiya, M. The tragedy of Intell. Prop. Law Assoc. QL 21, 52 (1993). problem (such as permitting more competi- commoners: biotechnology and its publics. Nature 28. Parker, D. Patent infringement exemptions for life science tion), while magnifying others (such as Biotechnol. 18, 2–4 (2000). research. Houston J. Intl Law 16, 615 (1994). 10. Angell, M. Is academic medicine for sale? N. Engl. J. 29. Gold, E. R. in Commercialization of Genetic Research: reducing incentives to conduct research and Med. 20, 1516–1518 (2000). Ethical, Legal and Policy Issues (eds Caulfield, T. & development). Nevertheless, although care 11. Pottagem, A. The inscription of life in law: gene, patents, Williams–Jones, B.) 63–78 (Plenum, New York, 1999). and bio-politics. The Modern Law Review 61, 740–765 30. Schissel, A., Merz, J. F. & Cho, M. K. Survey confirms must be taken, this debate needs to progress (1998). fear about licensing of genetic tests. Nature 402, 118 to ensure that patenting practices, as applied 12. Gold, E. R. Body Parts: Property Rights and the (1999). Ownership of Human Biological Materials (Georgetown 31. Blumenthal, D. et al. Withholding Research Results in to genetic material, fulfil the ultimate objec- Univ. Press, Washington DC, 1996). Academic Life Science: Evidence From a National Survey tive of encouraging the development of 13. Caulfield, T. & Gold, E. R. Whistling in the wind: reframing of Faculty J. Am. Med. Assoc. 277, 1224 (1997). the genetic patent debate. Forum for Applied Research 32. Caulfield, T. The commercialization of human genetics: a genetic technologies into products for the and Public Policy 15, 75–79 (2000). discussion of issues relevant to Canadian consumers. J. public’s good. 14. Ernst and Young’s Fourth Report on the Canadian Consumer Policy 21, 483–526 (1998). Biotechnology Industry. Can. Biotechol. ‘97: Coming of 33. Packer, K. & Webster, A. Patenting culture in science: Age (Ernst and Young, 1997). reinventing the scientific wheel of credibility. Science, Update – note added in proof 15. President and Fellows of Harvard v. Commissioner of Technology and Human Values 21, 425–445 Patents (August 3, 2000) No. A-334–398 (Fed. Crt of (1996). The recent announcement that scientists Appeals). 34. Blumenthal, D. Academic–industry relationships in the life will share a patent over a disease-related 16. Nottingham, S. Eat Your Genes (St. Martin’s, New York, sciences. J. Am. Med. Assoc. 268, 3344 (1992). 1999). 35. Straus, J. Intellectual property issues in genome 43 gene with a patient advocacy group, who 17. Roberts, T. Why not patent plants? Patent World 113, research. Genome Digest 3, 1–2 (1996). provided the researchers with blood and 14–16 (1999). 36. Barton, J. Reforming the patent system. Science 287, 18. Schehr, R. & Fox, J. Human genome bombshell. Nature 1933–1934 (2000). 44 tissue samples , is a positive sign that Biotechnol. 18, 365 (2000). 37. United States Patent and Trade Mark Office. Interim Utility researchers take seriously their moral 19. Marcus, A. Owning a gene: patent pending. Nature Med. Guidelines (1999). 2, 728–729 (1996). 38. Holtzman, N. Are genetic tests adequately regulated? responsibility to donors. Such steps are in 20. Nelkin, D. & Andrews, L. Homo economicus: Science 286, 409 (1999). agreement with recent policy statements Commercialization of body tissue in the age of 39. Kodish, E. Commentary: Risks and benefits, testing and biotechnology. Hastings Center Report 28, 30–39 (1998). screening, cancer, genes and dollars. J. Law Med. Ethics 45 issued by HUGO . Binding legal measures 21. Heller, M. & Eisenberg, R. Can patents deter innovation? 25, 252–255 (1997). would help to ensure that researchers and The anticommons in biomedical research. Science 280, 40. Brower, V. News: Testing, testing, testing? Nature Med. 698–701 (1998). 3, 131–132 (1997). companies who comply with this type of 22. Knoppers, B. M. Status, sale and patenting of human 41. Weiss, R. Genetic testing’s human toll. Washington Post ethical norm do not face unfair competi- genetic material: an international survey. Nature Genet. 21 July (1999). 22, 23–26 (1999). 42. Cowan, D. Tort liability of patentee licensors. J. Patent tion from those who do not. 23. Bunk, S. Researchers feel threatened by disease gene Office Soc. 64, 87–104 (1982). patents. The Scientist 13, 7 (1999) 43. Le Saux, O. et al. Mutations in a gene encoding an ABC Timothy Caulfield is at the Health Law Institute, 24. Academy of Clinical Laboratory Physicians and transporter cause pseudoxanthoma elasticum. Nature University of Alberta, Edmonton, Alberta, T6G Scientists. ACLPS Resolution: Exclusive Licenses for Genet. 25, 223–227 (2000). 2H5, Canada. E. Richard Gold is at the Faculty of Diagnostic Tests Approved by the ACLPS Executive 44. Smaglik, P. Tissue donors use their influence in deal over Council 06/03/99 (1999). gene patent terms. Nature 407, 821 (2000). Law, University of Western Ontario, London http://depts.washington.edu/labweb.aclps/license/htm 45. Human Genome Organization Ethics Committee. Genetic Ontario, N6A 3K7 , Canada. Mildred K. Cho is at 25. Cho, M. K. in Preparing for the Millennium: Laboratory benefit sharing. Science 290, 49 (2000). the Center for Biomedical Ethics, Stanford University, Stanford, California 94304 , USA. Correspondence to : [email protected]

Links TIMELINE DATABASE LINKS BRCA1 | APOE FURTHER INFORMATION American College of Medical Genetics | Unesco’s 1997 Universal Declaration on the Human Genome and The origins of Human Rights | European Patent Office | United States Patent Office | Canadian Patent Joel B. Hagen Office | Japanese Patent Office | patent on the ‘onco-mouse’ | European Patent Convention | Bioinformatics is often described as being internet and supercomputers1–3.However, Incyte Pharmaceuticals | United States in its infancy, but computers emerged as some scientists who claim that bioinformatics Supreme court case of Diamond versus important tools in molecular biology during is in its infancy acknowledge that computers Chakrabarty | United States Patent Office’s the early 1960s. A decade before DNA were important tools in molecular biology a recent interim guidelines sequencing became feasible, decade before DNA sequencing became feasi- 4 1. Rifkin, J. The Biotech Century (Penguin Putnam, New computational biologists focused on the ble . Although the pioneers of computational York, 1998). rapidly accumulating data from biology did not use the term ‘bioinformatics’ 2. American College of Medical Genetics, Position Statement on Gene Patents and Accessibility of Gene biochemistry. Without the benefits of to describe their work, they had a clear vision Testing (1999). www.faseb.org/genetics/acmg/pol- supercomputers or computer networks, of how computer technology, mathematics 34.htm 3. Sarma, L. Biopiracy: Twentieth century imperialism in the these scientists laid important conceptual and molecular biology could be fruitfully form of international agreements. Temple International and technical foundations for combined to answer fundamental questions and Comparative Law Journal 13, 107–136 (1999). 4. Thomas, S. et al. Ownership of the human genome. bioinformatics today. in the life sciences. Nature 380, 387–388 (1996). Three important factors facilitated the 5. Thomas, S. in The Commercialization of Genetic Research: Ethical, Legal and Policy Issues (eds Caulfield, It is tempting to trace the origins of bioinfor- emergence of computational biology during T. & Williams-Jones, B.) 55–62 (Kluwer Academic/Plenum Publishing, New York, 1999). matics to the recent convergence of DNA the early 1960s. First, an expanding collection 6. Nau, J. Y. Brevetabilité des gènes humains: le comité sequencing, large-scale genome projects, the of amino-acid sequences provided both a

NATURE REVIEWS | GENETICS VOLUME 1 | DECEMBER 2000| 231 © 2000 Macmillan Magazines Ltd PERSPECTIVES

developed from weapons research pro- cerns were “blown away” by Sanger’s work, grammes during the Second World War, which quickly dispelled any doubts that finally became widely available to academic each protein was characterized by a unique biologists. Not all biologists had — or wanted primary structure11. to have — access to these machines but, by Sequencing insulin was a case of problem 1960, scarcity of computers was no longer a solving by a master chemist who used great serious stumbling block for the development scientific skill in separating and identifying the of computational biology. fragments of protein degradation12. At the same time, however, other biochemists were Sequencing developing more refined methods that would The idea that proteins carry information transform the laborious analytical process encoded in linear sequences of amino acids is used by Sanger and his co-workers. The commonplace today, but it has a relatively Edman degradation reaction, by which bio- short history. This idea first emerged during chemists could sequentially remove and iden- the decades following the Second World War, tify individual amino acids from the amino a time that one main participant, Emil Smith, terminus of a short , was a great later described as a “heroic period” in protein improvement over the more tedious methods biochemistry8. The watershed event of this used by Sanger8,9. The use of ion exchange Figure 1 | at the Nobel prize period was the first successful sequencing of a columns and other innovations in CHROMATOG- ceremony in 1980. complete protein, INSULIN, by Frederick Sanger RAPHY and electrophoresis also made sequenc- (Photograph kindly provided by the MRC, Laboratory 8–10 of Molecular Biology, Cambridge, UK.) and his colleagues at Cambridge University ing more efficient. Just as significantly, the during the decade 1945–1955 (FIG. 1). entire process of separating and identifying Sanger’s achievement, for which he was amino acids was rapidly becoming automated. source of data and a set of interesting prob- awarded the 1958 Nobel Prize in chemistry, Using semi-automated techniques, researchers lems that were infeasible to solve without the firmly established the polypeptide theory of led by Stanford Moore and William Stein at number-crunching power of computers. protein structure. First formulated in 1902, the Rockefeller Institute were able to sequence Second, the idea that carry this theory had faced considerable scepti- the 124 amino acids in RIBONUCLEASE in about information became a central part of the con- cism and competition from alternative the- half the time that Sanger’s group had spent ceptual framework of molecular biology. ories9 (FIG. 2). Analytical techniques in pro- deciphering the sequence of the 51 amino Although some historians and philosophers tein biochemistry had improved greatly acids in insulin13,14. Automation sent a shock have questioned the theoretical significance of during the 1930s and 1940s, but before wave through the biochemical community, this idea for modern molecular biology5–7,it Sanger’s work, practically nothing was because it promised to transform sequencing seems likely that thinking in terms of macro- known about the order of amino acids in into a routine procedure carried out, not by molecular information provided an impor- any protein. One could, therefore, still cling master chemists, but by competent laboratory tant conceptual link between molecular biol- to the belief that proteins were structurally technicians8. By the late 1960s, Pehr Edman ogy and the computer science from which simple or even that they had no definite had designed the ‘sequenator’,a fully automat- formal information theory had arisen. Third, structure at all. As the biochemist Paul ed sequencing machine that implemented his high-speed digital computers, which had Zamecnik later recalled, these lingering con- already widely used degradation reaction15.

Timeline | Some early milestones in protein and peptide sequencing

Oxytocin, Tobacco mosaic Vasopressin coat 9 159

Immunoglobulin Insulin (α-chain) Glucagon Cytochrome c Lysozyme Trypsinogen (γ-chain) 21 29 105 129 216 446

1951 1953 1957 1960 1961 1962 1963 1965 1966 1967 1969

Insulin (β-chain)* (1963) Haemoglobin (α-chain) Myoglobin Glyceraldehyde-3-phosphate 30 Ribonuclease 141 153 dehydrogenase 124 340

Haemoglobin (β-chain) (1969) Human 145 growth hormone 188

*The complete primary structure of insulin, including the positions of the disulphide bonds, was published in 1955. (Dates in parentheses are for revisions of the originally published sequences; numbers in bold are the numbers of amino acids.) Source: L.R. Croft, Handbook of Protein Sequence Analysis: A Compilation of Sequences of Proteins with an Introduction to the Methodology (John Wiley, Chichester, 1980).

232 | DECEMBER 2000 | VOLUME 1 www.nature.com/reviews/genetics © 2000 Macmillan Magazines Ltd PERSPECTIVES a Figure 2 | Alternative theories of protein focused their efforts on providing mechanis- structure. Although the polypeptide theory was tic explanations for how , protein widely accepted by the time of the Second World War, there remained cautious doubt about protein hormones, antibodies and respiratory pig- structure when Frederick Sanger began his work ments worked. This episode has attracted on insulin. a | An influential group of biochemists considerable historical interest9,12,20, but histo- during the early decades of the twentieth century rians have paid much less attention to how had argued that proteins were amorphous macromolecular information could be colloids, with no definite molecular structure. thought of in an explicitly phylogenetic con- Although this idea steadily lost ground, a few biochemists clung to the idea that polypeptides text. This phylogenetic approach was a signifi- b O were merely artefacts formed when colloidal cant departure from traditional biochemical C NH proteins were denatured. Other structural theories thinking about structure and function, which surfaced during the 1930s. b | According to the had largely ignored evolutionary questions. R-CH HC-R cyclol theory, proteins were honeycomb-like Indeed, historians have often stressed the con- R N C-OH R fabrics formed from interlocking cyclical subunits. flicts between evolutionary biology and the These alternative theories were not strongly HC C-OH CH supported but, until insulin was sequenced, they more experimental sciences such as biochem- HN N C-OH C=O were not completely ruled out. c | According to istry. However, during the 1960s, biochemists periodicity theories, proteins were simple, and molecular biologists were increasingly C CH HC NH repetitive chains of amino acids. drawn to evolutionary questions. For exam- O R R ple, Emile Zuckerkandl and c referred to proteins and nucleic acids as “semantides”,whose information-carrying sequences of subunits could be used to docu- 21 = Gly = Ala = Other amino acid ment evolutionary history .Derived from ‘semanteme’,the fundamental unit of mean- ing used by linguists to study human speech, These innovations encouraged many labora- causal relationship between the information semantides were to be the analogous bio- tories to begin sequencing proteins and rapid- carried in the primary structure of a protein chemical units (hence the chemical suffix — ly expanded the library of amino-acid and the three-dimensional configuration of ide) for evolutionary studies. sequences (TIMELINE). the active molecule. Experiments carried How would the evolutionary information out by Christian Anfinsen and his col- carried by semantides be used? Zuckerkandl Macromolecular information leagues at the National Institutes of Health and Pauling imagined a new field of ‘paleoge- Once the polypeptide theory became firmly in the late 1950s showed that, after being netics’ that would use the rigorous laboratory established and methods for sequencing pro- denatured, ribonuclease spontaneously techniques of biochemistry and molecular teins were readily available, the idea of pro- refolded to regain its original enzymatic biology to answer evolutionary questions tra- teins as information-carrying macromole- activity17. This was taken as compelling evi- ditionally studied by palaeontologists and cules became widespread. This general idea dence that the sequence of amino acids naturalists. Paleogeneticists, who soon developed within three broadly overlapping completely specified the three-dimensional contexts: the , the three-dimen- structure of the protein. Of course, in prac- sional structure of a protein in relation to its tical terms, knowing the sequence did not function, and protein evolution. necessarily allow biochemists to correctly predict the secondary and tertiary struc- The genetic code. Concurrent developments tures of a protein. But sequence data played in the molecular biology of the gene provided a key role in interpreting the X-ray diffrac- a compelling theoretical context for dis- tion images used by John Kendrew and Max cussing how genetic information was trans- Perutz (FIG. 3) to determine the three-dimen- ferred from a sequence of nucleotides to a sional structures of MYOGLOBIN and HAEMO- sequence of amino acids. However, the GLOBIN18. Combining the biochemical tech- sequencing of DNA and RNA presented for- niques of sequence analysis with the midable technical hurdles that were not fully biophysical techniques of X-RAY CRYSTALLOG- overcome until the early 1970s (REF. 16).So, RAPHY seemed to hold the key to under- although molecular biologists learned a great standing how the molecular information in deal about the genetic code, the actual a sequence of amino acids causes a protein nucleotide sequences of genes remained to fold into a specific, often highly complex, largely unknown during the 1960s. With a three-dimensional configuration13,14,17. growing collection of amino-acid sequences, the idea of molecular information could Protein evolution. The idea that linear infor- therefore be explored with proteins in ways mation could determine the structure and not applicable to nucleic acids. function of proteins fits squarely within a dominant tradition in twentieth-century bio- Figure 3 | Max Perutz, who shared the 1962 19 Nobel prize in chemistry with John Kendrew. Protein structure. From a purely biochemi- chemistry — the ‘protein paradigm’ .For (Photograph kindly provided by the MRC, Laboratory cal perspective, one could ask about the several decades before 1960, biochemists had of Molecular Biology, Cambridge, UK.)

NATURE REVIEWS | GENETICS VOLUME 1 | DECEMBER 2000| 233 © 2000 Macmillan Magazines Ltd PERSPECTIVES became more commonly referred to as mole- have emphasized the importance of instru- trated by the career of Margaret Oakley cular evolutionists, had several approaches at mentation8,9,19 but, with the exception of John Dayhoff31,32. Trained in quantum chemistry their disposal. Comparisons of similar pro- Kendrew’s use of computers for elucidating and mathematics, she became interested in teins, such as myoglobin and haemoglobin, the three-dimensional structure of proteins and molecular evolution around provided evidence for molecular evolution by myoglobin18,27, the historical role of digital 1960. As associate director of the newly estab- gene duplication. Comparison of homolo- computers during the 1960s has been virtual- lished National Biomedical Research gous proteins drawn from various species ly ignored. Even in the case of Kendrew, com- Foundation, an organization founded specifi- could be used to trace phylogenetic relation- puters have not been viewed as contributing cally to encourage the development of com- ships among both the proteins themselves decisively to the discovery process. puter applications, Dayhoff was well situated and the species that carried them. In some Nonetheless, digital computers were well suit- to explore mathematical approaches for cases, such comparisons could also be used to ed to deal with the types of numerical data analysing amino-acid sequence data (FIG. 4). recreate the ancestral proteins from which that protein biochemists were generating in Continuously funded by grants from the present-day molecules evolved. Assuming growing amounts. National Institutes of Health throughout the that amino-acid substitution rates were rela- By the early 1960s, computers were becom- 1960s and with further support from the tively constant within a given protein, paleo- ing widely available to academic researchers. National Science Foundation, the National geneticists had a ‘MOLECULAR CLOCK’ by which According to surveys conducted at the begin- Aeronautics and Space Administration, and evolutionary events might be reliably dated. ning of the decade, 15% of colleges and univer- the IBM corporation, Dayhoff moved on sev- These claims, particularly the idea of a sities in the United States had at least one com- eral fronts. Her initial project was writing a molecular clock, were enormously controver- puter on campus, and most principal research series of FORTRAN programs to determine sial and provided a source of conflict between universities were purchasing so-called ‘second the amino-acid sequences of protein mole- molecular evolutionists and traditional natu- generation’ computers, based on transistors, to cules33,34. Taking the overlapping peptide frag- ralists22–24. Sequence analysis also had to com- replace the older vacuum-tube models28.The ments from the partial digestion of a protein, pete with well-established molecular tech- first high-level programming language, FOR- the programs deduced all of the possible niques, such as the immunological measures TRAN (formula ), was introduced sequences that were consistent with the data. used by Morris Goodman, Allan Wilson, by the International Business Machines (IBM) Conceptually similar to the puzzle-solving Vincent Sarich and others to unravel phylo- corporation in 1957. It was particularly well approach that biochemists claimed to have genetic relationships23. Encounters among suited to scientific applications, and compared used in the early sequencing investigations of competing groups of biologists at profes- with the earlier machine languages, it was rela- insulin and ribonuclease10,13,14,35,Dayhoff’s sional meetings could be bruising, but these tively easy to learn. For the first time, detailed computer programs arrived at the correct confrontations should not eclipse the knowledge of computer architecture was not sequence for a small protein (ribonuclease) important synthesis of evolutionary biology, needed to write a computer program. This within a few minutes. The same feat had protein biochemistry and computer science important innovation in computer software taken a team of humans many months to that was beginning to emerge during the early encouraged the growth of computational biol- accomplish. Similar programs written by 1960s, which laid an evolutionary foundation ogy. At the same time, there was a concerted other computational biologists at about the for the bioinformatics of today24–26. effort by governmental agencies and the com- same time claimed to successfully sequence puter industry to foster the development of hypothetical proteins up to 750 amino acids Emergence of computational biology academic computing in the life sciences29,30. in length36. Significantly, even during the early Historical studies of protein biochemistry The attraction of computers is well illus- development of these programs, Dayhoff and her contemporaries realized that the logic of Glossary sequence analysers could also be directly applied to nucleic acids when gene sequences CHROMATOGRAPHY insulin-dependent diabetes mellitus. A chemical analysis technique that uses a process of finally became available. separating gases, liquids or solids from mixtures or MOLECULAR CLOCK Computer programs for sequence analysis solutions by selective adsorption. The hypothesis that, in any given gene or DNA followed the principles initiated by the auto- sequence, mutations accumulate at an approximately matic amino-acid analyser used by Stein and CYTOCHROMES constant rate in all evolutionary lineages as long as the 37,38 Proteins whose function is to carry electrons or protons gene or the DNA sequence retains its original function. Moore . In both cases, the objective was to (hydrogen ions) by virtue of the reversible develop quickly a library of sequences that charging/discharging of an iron atom or iron/sulphur MYOGLOBIN could be used for studies in comparative bio- atoms in the centre of the protein. Cytochromes are An oxygen-carrying muscle protein that makes oxygen chemistry and molecular evolution. To pro- central molecules of electron transport in the process available to the muscles for contraction. known as oxidative phosphorylation. Cytochromes are mote this end, Dayhoff established the Atlas of divided into four groups (a, b, c, d) according to their RIBONUCLEASE Protein Sequence and Structure, an annual ability to absorb or transmit certain colours of light. A that hydrolyses RNA. publication that attempted to catalogue all known amino-acid sequences. Although HAEMOGLOBIN X-RAY CRYSTALLOGRAPHY rudimentary by today’s standards, the Atlas Protein present in red blood cells that reversibly binds Study of the molecular structure of crystalline oxygen for transport to tissues. compounds through X-ray diffraction techniques. served as the first database for molecular biol- When an X-ray beam bombards a crystal, the atomic ogy, and it became an indispensable resource INSULIN structure of the crystal causes the beam to scatter for early computational research25,31,32,39.It A protein hormone secreted by β cells of the pancreas. (diffract) in a specific pattern. X-ray crystallography eventually evolved into a major online data- Insulin is important in the regulation of glucose provides information on the positions of individual base, the Protein Information Resource (PIR), metabolism, generally promoting the cellular use of atoms in the crystal, the distances between atoms, the glucose. It is also an important regulator of protein and angles of the atomic bonds and other features of established in 1983, and it provided an lipid metabolism. Insulin is used as a drug to control molecular geometry. important point of departure for other com-

234 | DECEMBER 2000 | VOLUME 1 www.nature.com/reviews/genetics © 2000 Macmillan Magazines Ltd PERSPECTIVES

1960s, several biologists developed computer algorithms for determining sequence homol- ogy and aligning related sequences to account for deletions or insertions43,45,46. Walter Fitch’s moving-window approach searched for nonrandom alignments by com- paring all possible combinations of sequences of a given length (say 30 amino acids) along two protein molecules. For each of the thousands of comparisons, the com- puter calculated the minimum number of mutations required to convert one sequence to the other. This was done using a matrix of the number of mutations needed to substi- tute one amino acid for any other on the basis of the genetic code. The computer was instructed to print out all sequences whose similarity could not be accounted for by chance. Fitch’s approach was further elabo- rated by others, notably Saul Needleman and Figure 4 | The IBM 7090 computer, which Margaret Dayhoff used for her early work. This famous Christian Wunsch, whose algorithm remains computer was one of the first solid-state machines and was used widely in business and defence one of the standard methods for sequence settings, as well as scientific applications. alignment47. The computational approach (Photograph courtesy of IBM archives.) used by Dayhoff and Richard Eck was simi- lar, but their MDM (mutation data matrix) putational biologists, who soon began build- this data could be used to build a phylogenet- or PAM (per cent accepted mutation) matri- ing their own molecular databases40. ic tree that was remarkably similar to those ces were based on the probability of substi- Critics have pointed out that most of the based on more traditional taxonomic charac- tuting a given amino acid with any other. entries in Dayhoff’s Atlas were interspecific ters42. In this, and in the similar computer These probabilities were estimated by count- variations of a few well-studied proteins such programs concurrently devised by Dayhoff’s ing the occurrence of each amino-acid sub- as cytochrome c, and that the number of team43,44, pairwise comparisons were made stitution in families of very similar proteins known protein sequences remained fairly among homologous amino-acid sites on (for example, cytochromes or haemoglobins) small throughout the 1960s (REF. 40). What CYTOCHROMES drawn from 20 species. The taken from the Atlas of Protein Sequence and should not be overlooked, however, is how computer calculated the mutation distances, Structure. The matrices proved useful for the early comparative studies of homologous or the minimum number of steps required to more distantly related proteins as well, and proteins opened up the field of molecular convert one cytochrome to another. Starting quickly became standard tools for detecting evolution. Although phylogenetic analysis of with a simple three-branched tree, subse- homology and aligning sequences. PAM amino-acid sequences could be done by quent branches were added sequentially in a matrices continue to be used today and have hand41, computers proved immensely valu- way that would minimize the mutation dis- stimulated the development of several more able in this regard. From the beginning, theo- tances. These early computer programs did refined methods3. retical biologists realized that in most cases not attempt exhaustive searches for the sim- Even the most challenging computational the number of possible phylogenetic trees was plest , but left this partly to problems in bioinformatics had precursors in so great that it would be infeasible for a human intuition. The investigator chose a the 1960s. For example, the biophysicist Cyrus human to evaluate even a small fraction of different initial subset of three proteins, then Levinthal and a team of researchers48 used them. If every amino acid in even a small pro- the computer constructed a second tree. The one of the first large, time-sharing computers tein was to be considered a separate character, new tree was discarded if it turned out to be at the Massachusetts Institute of Technology then finding the most likely tree was clearly an less satisfactory than the original one. During to construct three-dimensional models of appropriate task for digital computers. Early the research leading to their article, Fitch and cytochrome c. A visual display of the molecule molecular evolutionists such as Russell Margoliash examined 40 trees in an attempt was projected on an oscilloscope screen. Doolittle, who began studying protein phylo- to find the optimal one. Researchers could control the rotation of the genies without computers, quickly added The molecular phylogenies constructed by model using a hand-operated device similar them to their research programmes by the early computational biologists rested on the to a track ball and could manipulate the late 1960s. assumption that the proteins being compared model using either a keyboard or a light pen. The potential for using computers for were homologous. In cytochrome c trees, the Because of limitations in the speed of even the phylogenetic analysis was dramatically proteins from various species were so similar most powerful computers of the day, this demonstrated for cytochrome c, the respira- that there was no question that they shared a early modelling effort did not have an imme- tory pigment found in all aerobic cells. By the close common ancestor. However, detecting diate impact on biochemistry or molecular mid-1960s, the protein had been sequenced homology, and distinguishing it from chance biology. However, it was an important histori- for a wide variety of plants, animals, fungi similarity, in more distantly related proteins cal bridge between earlier mechanical and and microbes. In a now classic article, Walter was recognized as an important problem by stereoscopic modelling techniques and the Fitch and Emanuel Margoliash showed how molecular evolutionists. During the late advanced computer models of today.

NATURE REVIEWS | GENETICS VOLUME 1 | DECEMBER 2000| 235 © 2000 Macmillan Magazines Ltd PERSPECTIVES

Conclusions 13. Stein, W. H. & Moore, S. The chemical structure of 87–98 (1983). proteins. Sci. Am. 204, 81–92 (1961). 32. Hunt, L. T. Margaret Oakley Dayhoff, 1925–1983. Bull. By 1970, computational biologists had devel- 14. Moore, S. & Stein, W. H. Chemical structures of Math. Biol. 46, 467–472 (1984). oped a diverse set of techniques for analysing pancreatic ribonuclease and deoxyribonuclease. Science 33. Dayhoff, M. O. & Ledley, R. S. Comprotein: A computer 180, 458–464 (1973). program to aid primary protein structure determination. molecular structure, function and evolution. 15. Edman, P. & Begg, G. A protein sequenator. Eur. J. Proc. Fall Joint Comp. Conf. 22, 262–274 (1962). Although originally designed for studying Biochem. 1, 80–91 (1967). 34. Dayhoff, M. O. Computer aids to protein sequence 16. Sanger, F. Sequences, sequences, and sequences. determination. J. Theor. Biol. 8, 97–112 (1965). proteins, many of these computing tech- Annu. Rev. Biochem. 57, 1–28 (1988). 35. Thompson, E. O. The insulin molecule. Sci. Am. 192, niques could be adapted for studying nucleic 17. Anfinsen, C. B. Principles that govern the folding of 36–41 (1955). protein chains. Science 161, 223–230 (1973). 36. Bernhard, S. A., Bradley, D. F. & Duda, W. L. Automatic acids. Some of these techniques survive today 18. Olby, R. C. The ‘Mad Pursuit’: X-Ray crystallographers’ determination of amino acid sequences. IBM J. Res. Dev. or have lineal descendants that are used in search for the structure of hemoglobin. Hist. Phil. Life Sci. 7, 246–251 (1963). 7, 171–193 (1985). 37. Spackman, D. D., Stein, W. H. & Moore, S. Automatic bioinformatics. In other cases, they stimulat- 19. Kay, L. The Molecular Vision of Life (Oxford Univ. Press, recording apparatus for use in the chromatography of ed the development of more refined tech- New York, 1993). amino acids. Anal. Chem. 30, 1190–1206 (1958). 20. Srinivasan, P. R., Fruton, J. S. & Edsall, J. T. (eds) The 38. Mason. E. E. & Bulgren, W. G. Computer Applications in niques to correct deficiencies in the original Origins of Modern Biochemistry: A Retrospective on Medicine (Charles C. Thomas, Springfield, Illinois, 1964). methods. Although the nascent field was later Proteins (New York Academy of Sciences, New York, 39. Fitch, W. M. Book review of M. O. Dayhoff, Atlas of 1979). Protein Sequence and Structure. Syst. Zool. 22, 196 revolutionized by the advent of genome pro- 21. Zuckerkandl, E. & Pauling, L. Molecules as documents of (1972). jects, large-scale computer networks, evolutionary history. J. Theor. Biol. 8, 357–366 (1965). 40. Doolittle, R. F. Some reflections on the early days of 22. Zuckerkandl, E. On the molecular evolutionary clock. J. sequence searching. J. Mol. Med. 75, 239–241 (1997). immense databases, supercomputers and Mol. Evol. 26, 34–64 (1987). 41. Doolittle, R. F. & Blombäck, B. Amino-acid sequence powerful desktop computers, today’s bioin- 23. Dietrich, M. Paradox and persuasion. Negotiating the investigations of fibrinopeptides from various mammals: place of molecular evolution within evolutionary biology. Evolutionary implications. Nature 202,147–152 (1964). formatics also rests on the important intellec- J. Hist. Biol. 31, 85–111 (1998). 42. Fitch, W. M. & Margoliash, E. Construction of tual and technical foundations laid by scien- 24. Hagen, J. B. Naturalists, molecular biologists, and the phylogenetic trees. Science 155, 279–284 (1967). challenges of molecular evolution. J. Hist. Biol. 32, 43. Eck, R. V. & Dayhoff, M. O. Atlas of Protein Sequence tists at an earlier period in the computer era. 321–341 (1999). and Structure (National Biomedical Research Foundation, Joel B. Hagen is at the Department of Biology, 25. Jungck, J. R. & Friedman, R. M. Mathematical tools for Silver Spring, Maryland, 1966). molecular genetics data: An annotated bibliography. Bull. 44. Dayhoff, M. O. Computer analysis of protein evolution. Radford University, Radford, Virginia 24142, Math. Biol. 46, 699–744 (1984). Sci. Am. 221, 87–95 (1969). USA. e-mail: [email protected] 26. Hagen, J. B. The introduction of computers into 45. Fitch, W. M. An improved method of testing for systematic research in the United States during the evolutionary homology. J. Mol. Biol. 16, 9–16 (1966). 1960s. Studies Hist. Phil. Biol. Biomed. Sci. (in the press). 46. Dayhoff, M. O. & Eck, R. Atlas of Protein Sequence and Links 27. Perutz, M. Early days of protein crystallography. Meth. Structure 1967–1968 (National Biomedical Research Enzymol. 114, 3–18 (1985). Foundation, Silver Spring, Maryland, 1968). DATABASE LINKS insulin | ribonuclease | 28. Anonymous. Computing in the university. Datamation 8, 47. Needleman, S. B. & Wunsch, C. D. A general method 27–30 (1962). applicable to the search for similarities in the amino acid myoglobin | haemoglobin | Protein 29. Ledley, R. S. Digital electronic computers in biomedical sequence of two proteins. J. Mol. Biol. 48, 443–453 Information Resource | cytochrome c sciences. Science 130, 1225–1234 (1959). (1970). 30. Ledley, R. S. Use of Computers in Biology and Medicine 48. Levinthal, C. Molecular model-building by computer. Sci. FURTHER INFORMATION IBM history | (McGraw–Hill, New York, 1965). Am. 214, 42–52 (1966). National Biomedical Research Foundation | 31. Hunt, L. T. Margaret O. Dayhoff, 1925–1983. DNA 2, History of visualization of biological macromolecules ENCYCLOPEDIA OF LIFE SCIENCES DNA sequencing | Sanger, Frederick | Protein secondary structures: predictions | Molecular TIMELINE clocks | Linus Carl Pauling

1. Lake, J. A. & Moore, J. E. Phylogenetic analysis and Genetic disease since 1945 comparative genomics. Trends Guide Bioinformatics 22–23 (1998). 2. Howard, K. The bioinformatics gold rush. Sci. Am. 283, 58–63 (2000). M. Susan Lindee 3. Rashidi, H. H. & Buehler, L. K. Bioinformatics Basics: Applications in Biological Science and Medicine (CRC Although hereditary disease has been our genes has become the focus of intense Press, Boca Raton, 2000). 4. Boguski, M. S. Bioinformatics — a new era. Trends recognized for centuries, only recently has it medical, scientific and corporate interest. Guide Bioinformatics 1–3 (1998). become the prevailing explanation for Many complicated bodily states have been 5. Kay, L. Who wrote the book of life? Information and the transformation of molecular biology, 1945–1955. Science numerous human pathologies. Before the reconfigured as genetic diseases and, by the Cont. 8, 609–634 (1995). 1970s, physicians saw genetic disease as 1990s, heredity was the prevailing explana- 6. Kay, L. Cybernetics, information, life: The emergence of scriptural representations of heredity. Configurations 5, rare and irrelevant to clinical care. But, by tion for virtually all disease states. Even 23–91 (1997). the 1990s, genes seemed to be critical infectious disease has been rhetorically sub- 7. Sarkar, S. in The Philosophy and History of Molecular Biology: New Perspectives (ed. Sarkar, S.) 187–231 factors in virtually all human disease. Here I sumed under genetic mechanisms of inher- (Kluwer Academic, Dordrecht, 1996). explore some perspectives on how and ent vulnerability and genetically driven 8. Smith, E. L. in The Origins of Modern Biochemistry: A Retrospective on Proteins (eds Srinivasan, P. R., Fruton, why this happened, by looking at two immune system responses. The model in J. S. & Edsall, J. T.) 107–118 (New York Academy of genetic diseases — familial dysautonomia which all bodily states have an underlying Sciences, New York, 1979). 9. Fruton, J. S. Proteins, Enzymes, Genes: The Interplay of and phenylketonuria. genetic cause dominates concepts of disease Chemistry and Biology (Yale Univ. Press, New Haven, in the industrialized world, particularly in 1999). 10. Sanger, F. Chemistry of insulin. Science 129, 1340–1344 Human hereditary disease has a long English-speaking nations and even more so (1959). recorded history but, for most of the twenti- in the United States. 11. Zamecnik, N. H. in The Origins of Modern Biochemistry: A Retrospective on Proteins (eds Srinivasan, P. R., eth century, geneticists knew more about One possible way to understand the new Fruton, J. S. & Edsall, J. T.) 269–294 (New York Academy hereditary pathology in the mouse or the centrality of genetic disease to scientific of Sciences, New York, 1979). 12. Fruton, J. S. A Skeptical Biochemist (Harvard Univ. Press, fruitfly than in human beings. In the past research, medical education, medical theory Cambridge, Massachusetts, 1992). thirty years, however, pathology as related to and the broader culture would be to

236 | DECEMBER 2000 | VOLUME 1 www.nature.com/reviews/genetics © 2000 Macmillan Magazines Ltd