Perspective The Roots of Bioinformatics

David B. Searls* Independent Consultant, Philadelphia, Pennsylvania, United States of America

Introduction context of key moments when computers In Thomas Kuhn’s famous conception were first taken up by early adopters of scientific revolutions, the early stages Every new scientific discipline or meth- reveals how deep the roots of bioinfor- of paradigm formation are freewheeling odology reaches a point in its maturation matics go. and unstructured, while being effectively where it is fruitful for it to turn its gaze cut off from the pre-existing scientific inward, as well as backward. Such intro- milieu by their very novelty and an spection helps to clarify the essential The Nature of Bioinformatics inherent incommensurability [1]. (The structure of a field of study, facilitating Many who draw a distinction between overused word ‘‘paradigm’’ can be ex- communication, pedagogy, standardiza- bioinformatics and computational biology cused in this context because it was tion, and the like, while retrospection aids portray the former as a tool kit and the Kuhn who instigated its overuse.) At this process by accounting for its begin- latter as science. All would allow that the some point, such ‘‘pre-science’’ becomes nings and underpinnings. science informs the tools and the tools consolidated, establishes norms and In this spirit, PLoS Computational Biology is enable the science; in any case, bioinfor- templates, and settles into a ‘‘normal launching a new series of themed articles matics and computational biology are near science’’ phase that allows for efficient tracing the roots of bioinformatics. Essays enough cousins that their origins and early discovery within a prevailing paradigm. from prominent workers in the field will influences are likely to be commingled as Many would agree that the heady early relate how selected scientific, technologi- well. Therefore, this article and series will days of bioinformatics had a makeshift cal, economic, and even cultural threads construe bioinformatics broadly, bearing feel, which has since matured into a came to influence the development of the in mind it can thus be expected to have a more coherent, productive discipline field we know today. These are not dual nature. This duality echoes another with an established canon. intended to be review articles, nor person- that goes back to Aristotle, between But before claiming the exalted status of al reminiscences, but rather narratives ‘‘episteme’’ (knowledge, especially scientif- a Kuhnian paradigm shift, it should be from individual perspectives about the ic) and ‘‘techne’’ (know-how, in the sense noted that Kuhn had in mind rather origins and foundations of bioinformatics, of craft or technology). The power of broader disciplines of science than bioin- and are expected to provide both historical bioinformatics might be seen as arising formatics, which was erected within and in and technical insights. Ideally, these arti- from their harmonious combination, in relation to the comprehensive pre-existing cles will offer an archival record of the the Greek tradition, lending it emergent scaffoldings of biology and computer field’s development, as well as a human capabilities beyond the simple intersection science. To the extent that bioinformatics face on an important segment of science, is a subsidiary or derivative field, it might for the benefit of current and future of computers and biology, or indeed of science and engineering. call more for an evolutionary than a workers. revolutionary model of development, of a Upcoming articles, already commis- sort some critics of Kuhn have advocated sioned, will cover the roots of bioinfor- A Bioinformatics Revolution? [2,3]. From this perspective, its novelty matics in structural biology, in evolution- Many commentators refer to the and force perhaps derive from hybrid ary biology, and in artificial intelligence, ‘‘bioinformatics revolution.’’ If there has vigor rather than spontaneous generation, with more in the works. These topics are been one, was it a revolution in techne, and it would seem to be more enabling obviously very broad, and so are likely to like the Industrial Revolution, or in than overturning—thus, primarily an ad- be subdivided or otherwise revisited in episteme, like the Scientific Revolution? vance in techne. Whether its rapid uptake future installments by authors with varying Or was it both? The former suggests and substantial impact qualify it as a perspectives. Topics and authors will be quantum leaps in scale and capability technological revolution, or merely an chosen at the discretion of the editors through automation, which seems to evolutionary saltation, is perhaps only a along lines broadly corresponding to the apply to bioinformatics almost by defini- matter of semantics. usual content of this journal. tion, while the latter implies an actual In Kuhn’s semantics, though, scientific The author, having been asked to serve shift in worldview, raising a more philo- revolutions produce profound shifts in our as Series Editor by the Editor-in-Chief, sophical question. literal perception of reality. A computa- will endeavor to maintain a uniform flow of articles solicited from luminaries in the field. As a starting point to the series, I Citation: Searls DB (2010) The Roots of Bioinformatics. PLoS Comput Biol 6(6): e1000809. doi:10.1371/ journal.pcbi.1000809 offer below a few vignettes and reflections on some longer-term influences that have Published June 24, 2010 shaped the discipline. I first consider the Copyright: ß 2010 David B. Searls. This is an open-access article distributed under the terms of the Creative unique status of bioinformatics vis-a`-vis Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. science and technology, and then explore historical trends in biology and related Funding: The author received no specific funding for this article. fields that anticipated and prepared the Competing Interests: The author has declared that no competing interests exist. way for bioinformatics. Examining the * E-mail: [email protected]

PLoS Computational Biology | www.ploscompbiol.org 1 June 2010 | Volume 6 | Issue 6 | e1000809 tional perspective may radically change Einstein’s equivalence principle would did much to ground the instrumental gene attitudes toward data, or even models of come more naturally. Today, operational in physical locations on by data, but it seems unlikely to fundamen- definitions of biological concepts such as 1929 (though soon she in turn introduced tally alter our sense of reality in the ‘‘gene’’ and ‘‘pathway’’, distinguished as to instrumental notions of transposition and domain of biology. Still, true believers whether they are probed by methods ‘‘controlling elements’’ that only became may argue that the ‘‘computational think- genetic, biochemical, or biophysical, are instantiated decades later in transposons, ing’’ movement [4] as applied to biology, providing new insights as they are similarly operons, and other regulatory apparatus, and perhaps even a view of life itself as a integrated, with appropriate caution, by resulting in her belated Nobel Prize in form of computation [5], does indeed rise bioinformatic methods. 1983 [14]). Bioinformatics has played an to the level of a paradigm shift and a true increasingly important role in this evolu- revolution in episteme. We will explore a The Instrumental Gene tion. Mark Gerstein notes that by the few such ideas below. Even scientific theories can be consid- 1970s and 1980s, through a combination ered techne. Instrumentalism, an idea that of cloning and sequencing techniques and The Role of Tools goes back to the earliest days of the then computational gene identification A philosophical stance called realism scientific revolution, takes a very pragmat- (whether by similarity or protein-coding essentially views episteme as independent ic, almost mechanical view of theories, signature), the working definition of a gene of techne, holding that scientific truth is that they should be viewed merely as tools was reduced to a literal open reading ultimately separable from how we measure for predicting or explaining observations frame of sequence—digitized data, in or model it. But some assign tools a more as opposed to directly describing objective other words, critically dependent on elec- prominent and persistent role. The Nobel reality [9]. Thus was at first tronic storage and algorithms—and that laureate physicist P. W. Bridgman’s influ- purely instrumental; regardless of any by the 1990s the gene had become for ential notions of operationalism sought to conviction that the gene had a physical most practical purposes an annotated reduce all scientific concepts to the literal basis, it was in practice a conceptual tool database entry [13]. Gerstein goes on to means by which they are measured—that [10]. Instrumentalism doesn’t ask whether assert that the latest metaphor for genes is is, to operational definitions to be taken at a theory is true or false, but treats it as a as ‘‘subroutines in the genomic operating face value rather than as describing some sort of anonymous function taking data as system,’’ which suggests entirely new underlying idealization—so as not to over input and producing predictions or expla- senses of operationalism and instrumen- interpret or heedlessly conflate such con- nations as output, the quality of which talism in biology, with a natural role for cepts [6]. Thus, temperature would be determine the appeal of the theory. bioinformatics. defined in terms of thermometers rather Whether or not this is an adequate Yet operationalism and instrumentalism than thermodynamics. Decades before formulation of a scientific theory, it may are often challenged in philosophical computer scientists conceived of opera- be as good a definition as any of a circles today, where they are considered tional semantics and abstract data types, bioinformatics application. to be ‘‘anti-realist’’ in their seeming Bridgman considered a scientific concept For a taste of the pre-molecular instru- disregard for the actual physical objects ‘‘synonymous with the corresponding set mental conception of genes, consider the and processes underlying scientific con- of operations’’ [7]. Though controversial moment in 1911 when Alfred Sturtevant cepts. In fact, it would appear that in physics circles, operationalization was made a key contribution. While still an scientific progress is made when opera- seized upon by certain ‘‘soft’’ sciences like undergraduate at , he tional concepts are joined up, as by sociology as a way of achieving a more won a seat in the legendary ‘‘fly room’’ of Einstein, or when instrumental concepts respectable exactitude. T. H. Morgan’s lab, which was busy are mapped to successively more material The ‘‘hardening’’ of biology in the 20th identifying Drosophila mutants and count- forms, as by Sturtevant, McClintock, and century involved a reductionist conver- ing offspring of various crosses. One day, eventually and Francis gence with chemistry and physics, en- upon realizing that multiple pairwise Crick. But this only bears out the func- hanced by improving instrumentation, as linkage strengths could not only be viewed tional utility of these ‘‘isms,’’ whose well as new quantitative overlays to the inversely as distances but also collapsed persistence suggests some underlying legacies of Linneaus, Mendel, and Darwin. onto a single dimension, he related that he truth; they seem to wrestle with important This often called for operationalizations, ‘‘went home and spent most of the night concepts such as abstraction and reifica- such as that of ‘‘enzyme’’ in terms of a (to the neglect of my undergraduate tion (that is, concretization of abstractions measured activity, or that of the much- homework) in producing the first chromo- as ‘‘first-class objects’’ for further manip- debated concept of ‘‘species’’ [3]. The some map’’ [11]. Long before the advent ulation) that are natural to and even practice predated but has lately been of bioinformatics, we nevertheless glimpse promoted by the computational sciences. reinforced by bioinformatics. Computers, something of its ‘‘style’’ in this approach to One thing they certainly assert is that it is with their notorious literal-mindedness, data transformation, integration, and vi- a mistake to trivialize the role of tools in require the same sort of ‘‘tightening up’’ sualization—not to mention the fact that science as mere means to an end, as of descriptive language as that urged by the youngest scientists often seem most scientific ground truth may be hard to Bridgman [6], and have promoted ever adept at data-crunching (evidently even disentangle from those tools in the final more explicitly operational definitions, for without benefit of a computer literacy analysis. example, of ‘‘gene’’, in terms of the surpassing that of their elders). biological operations applied to DNA Bioinformatics before sequences [8]. Bioinformatics and Genes Bioinformatics Bridgman felt that by first recognizing The gene concept has undergone a clearly the distinction between operation- steady evolution, in varying degrees in- Bioinformatics is far from being the first ally defined concepts such as gravitational strumental and operational [12,13]. The discipline to straddle the duality of epis- and inertial mass, deeper insights like work of Barbara McClintock, for example, teme and techne. Mathematics is also

PLoS Computational Biology | www.ploscompbiol.org 2 June 2010 | Volume 6 | Issue 6 | e1000809 considered a tool, vis-a`-vis science, and multiscale mathematical modeling that are It is interesting to speculate whether here it is even more apparent how now central elements of bioinformatics. Turing’s turn toward biology, had he lived inseparable is the tool from the underlying Today’s systems biology has a pedigree much past the discovery of the double scientific reality. Indeed, since Galileo and extending back at least to the first half of helix, would have caused him to recognize Newton, a common sentiment has been the 20th century. The biologist Ludwig and embrace this pivotal moment when that science is never so successful as when von Bertalanffy began work on his holistic biology became digital. He could not have its laws and explanations can be reduced General System Theory then [20], while failed to remark (as others soon would to mathematical expression. Historically ’s cybernetics added an [28,29]) how biological macromolecules this had not been biology’s forte, but early engineering math perspective in the 1950s incarnated his virtual automata, with in the 20th century statistics and numer- encompassing feedback and regulatory biopolymers for tapes and enzymes to ical analysis began to establish footholds in systems that was influenced not only by read and write them. Moreover, as a the field. Computers eventually carried early computer science, but also by veteran of Bletchley Park and the wartime these methods to new heights, though evolutionary biology and cognitive science cryptanalysis effort, he might well have mainly by automating them rather than [21]. Network theory is often attributed to been drawn into the frenzy to decipher the changing their underlying methodologies. Gestalt social psychologists in the 1930s, genetic code that played out in the decade Yet ‘‘pure’’ computer science is itself but was productively merged with math- after his death. discrete math, separable from hardware, ematical graph theory by 1956 [22]. In 1943 Turing had visited the US to and soon this also would come to bear on Developmental biology began a long share British codebreaking methods and a newly digital biology. As the following flirtation with math upon the publication met often with , who was narrative suggests, the roots of bioinfor- in 1917 of D’Arcy Thompson’s On working on similar problems at Bell Labs matics may be detected in a mathemati- Growth and Form, which was technically [30]. Shannon’s efforts on cryptanalysis zation of biology on many fronts, which elegant and visually striking, albeit mostly were closely tied to his work in commu- machines only served to accelerate. The descriptive [23]. Computing pioneer Alan nication that, within the decade, would middle of the 20th century witnessed the Turing turned to biology during the tragic give rise to the new field of information key transitions. denouement of his life and was responsible theory. Turing took the opportunity to in 1952 for a classic work in spatial show him his 1936 paper on the Universal Mathematics Sets the Scene modeling of morphogenesis [24], propos- Turing Machine, since Shannon had been ing a reaction-diffusion model of pattern responsible in 1937 for the first rigorous The development of modern statistics formation that has only recently gained application of Boolean logic as a formal was to a significant degree driven by its strong experimental support [25]. In this basis for digital design, which to that point application to biology in the work of period Turing used the Manchester Uni- had comprised much more ad hoc ar- Francis Galton in the 19th century [15] versity Mark I, another trailblazing stored- rangements of circuit elements. This and R. A. Fisher in the 20th [16]. Fisher program machine, to model biological contribution, which constituted Shannon’s helped put both Mendelism and Darwin- growth in systems such as the Fibonacci Master’s thesis, is accorded great signifi- ism on a firm mathematical footing by patterns in fir cones described by D’Arcy cance in the history of computing, but 1930, and he is also credited with being Thompson [26]. Turing’s labors on these what has been all but forgotten is his 1940 the first to apply a computer to biology, problems are evident in page after page of PhD thesis, entitled ‘‘An Algebra for albeit almost offhandedly. In a 1950 note calculations interspersed with dense ma- Theoretical Genetics’’ [31]. In this work, giving tables of solutions to a differential chine code subroutines set down in his Shannon formalized population genetics equation developed for population genet- own hand, now archived at King’s Col- just as he had circuit design, after spending ics, Fisher says simply ‘‘I owe this lege, Cambridge [27]. an instructive summer at the Cold Spring tabulation to Dr. M.V. Wilkes and Mr. Harbor Laboratory. Today it would be D.J. Wheeler, operating the EDSAC Turing’s Legacy labeled bioinformatics. electronic computer.’’ [17] EDSAC, the Turing’s bequest to biology is far more One is left to wonder whether Turing Electronic Delay Storage Automatic Cal- sweeping, though, insofar as bioinfor- and Shannon ever touched on biology culator, was built at the University of matics would eventually embody a broad during their lunchtime discussions. The Cambridge Mathematical Laboratory; it is computational mathematization of the life geneticist James Crow feels that Shannon considered the first truly practical stored- sciences. The changes would be not only might well have extended his PhD work to program computer and the inspiration for quantitative but also qualitative. As Fisher have significant impact in the field but for the first text on computer programming in realized, ‘‘conventional’’ applications of the fact that he was drawn irresistibly to 1951 [18]. numerical analysis could be taken to new communication theory, first by the war As biology became more quantitative levels, visualized as never before, and often and then by the lush technical milieu of throughout the 20th century, it increas- freed from the necessity of closed-form Bell Labs [32]. It is intriguing to think that ingly assumed a ‘‘statistical frame of mind’’ solutions, by the sheer power of comput- two giants of computer science and [19]. In addition, naturalists adopted ers. But qualitatively, Turing’s first efforts mathematics may have come so close to numerical methods for population model- at biological computing began to shift the committing their careers to biology. ing, and biochemists for enzyme kinetics; focus from the equations to the phenom- such applications remain the core topics of ena, from calculation to modeling. More- Enter the Physicists mathematical biology texts today. As over, Turing’s overall legacy would soon Instead it was physicists, some of them noted, statistics and numerical analysis foster a new perspective founded in veterans of the Manhattan Project, who were considerably empowered by comput- discrete math, information theory, and migrated to the new molecular biology ers, but later these disciplines in turn symbolic reasoning, catalyzing trends that and helped imbue it with their mathemat- contributed substantially to entirely new may already have been inchoate in the ical sensibilities. The attraction can be methods such as machine learning and new molecular biology. discerned in Erwin Shro¨dinger’s famous

PLoS Computational Biology | www.ploscompbiol.org 3 June 2010 | Volume 6 | Issue 6 | e1000809 wartime lectures and 1946 book What is and thereby non-degenerate code, as well the very moment he was beginning to Life? [33], which influenced Francis Crick as attempting to account for a direct apply computers to general problem- and in turn was stimulated by the work of translation from the DNA helix to the solving [44]. Simon would soon co-found physicist-turned-biologist Max Delbru¨ck, polypeptide by a physical docking [38]. the discipline of artificial intelligence, mentor to James Watson. In this slim (This perhaps reflects Shro¨dinger’s errant another fundament of bioinformatics, volume, Shro¨dinger posits that chromo- instinct that chromosomes should be self- and another field deeply indebted to somes constitute Morse-like ‘‘code-scripts’’ sufficient machines, or just enthusiasm for Turing. Gamow also recruited Robert of which ‘‘the all-penetrating mind, once the astonishing implications of base pair- Ledley, who in 1955 wrote a theoretical conceived by Laplace, to which every ing in the Watson-Crick model.) Still, paper suggesting how computerized sym- causal connection lay immediately open, Gamow set the game in motion, and bolic reasoning could apply not only to the could tell from their structure whether the served with great verve as its master of genetic code but also to enzymatic path- egg would develop … into a black cock or ceremonies. ways, portending modern pathway infer- a speckled hen …’’ (pp. 20–21). Later, he ence techniques [45]. Ledley went on suggests that some such executive in fact Codebreaking to promote computer-based medical diag- resides in the chromosomes themselves— A letter written in 1954 by Gamow to nosis and protein sequence tools and that they are not only script but also the biologist Martynas Ycas, preserved in databases. machinery. This programmatic conceit, in the Library of Congress complete with itself strikingly evocative of Turing’s self- marginal scrawls and cartoon drawings, The Urge to Model referential automata and associated suggests the tenor of the times: ‘‘After the The non-overlapping code Gamow and proofs, foretold the scramble to solve the collapse of triplet (major+2 minors) system Ycas had arrived at by 1955 made an odd puzzle of how the DNA sequence mapped a new suggestion was made by Edward assumption, that the order of bases in each to the other structures of life. Teller busy as he was with H bomb, and triplet was irrelevant. No doubt this was One of the first responses Watson and Oppenheimer. The idea is that each again motivated by a desire to dispose of Crick had to their seminal 1953 paper was following aa. is defined by two bases … degeneracy, as this scheme effectively did a letter from the physicist George Gamow, and the preceeding AA. Looks good! The by collapsing permutation classes, but in unknown to them, who 5 years before had ‘preceeding AA’ is characterized only by some degree it may simply reflect the proposed the Big Bang [34]. Gamow was beeing [sic] ‘small’, ‘medium’ or ‘large.’ surrounding upheaval: biology was be- already fascinated by biology, being Last week I have discovered in Los Alamos coming an information science even as friends with Delbru¨ck and having pub- the possibility of putting that system on information science itself was aborning. lished a popularization of a broad swath of Maniac, and this seems to be possible’’ After all, for the first half of the 20th science entitled One Two Three…Infinity, [39]. century the prevailing mindset had been which included an exposition of fly What is most significant here is not the that DNA comprised repeating identical genetics showing Morgan and Sturtevant’s next ill-conceived model to which Gamow tetranucleotides, and that proteins were map [35]. Gamow’s remarkable letter had turned, but rather the reference to amorphous with no set linear sequence reimagined the DNA in each MANIAC I, the Mathematical Analyzer, [46]. In his first letter to Watson and as a long number written in base four, so Numerical Integrator and Computer built Crick, Gamow even suggested that genes as to open up its analysis to number to do weapons research by Nicholas were not localized, but smeared over the theory. He was soon calling this ‘‘the Metropolis (of Monte Carlo fame) [40]. chromosome like a Fourier transform [34], number of the beast,’’ suggesting that it Once it was known that RNA directed his physicist’s instincts flying in the face of varied only slightly among individuals, protein synthesis, Gamow and Ycas did all genetics since Morgan and Sturtevant. ‘‘whereas the numbers representing the indeed use MANIAC to run a series of Gamow’s biochemistry was initially just as members of two different species must Monte Carlo simulations, first trying in naı¨ve. He had scant basis to assume that show larger differences’’ [36]. Not only did 1954 to salvage overlapping codes, and exactly 20 amino acids were encoded, Gamow thus neatly frame the future of when those proved untenable, testing in since others were known to occur natural- sequence bioinformatics, but he went on 1955 whether observed amino acid fre- ly, if more rarely, and his first list of 20 to pose the question of the genetic code for quencies in proteins were likely to arise actually included some of these and the first time in purely formal terms—that from non-overlapping triplet code transla- omitted valid ones [37]. Gamow’s quanti- is, in Crick’s words, ‘‘not cluttered up with tions [41]. (Metropolis also worked with tative skills and fresh perspective were a lot of unnecessary chemical details’’ others soon afterwards to computationally valuable and he learned quickly (much like (quoted by Judson [30]). Postulating a model cell multiplication and tumor cell computer scientists who came to biology collinearity of DNA with proteins (having populations [42,43].) later), but his concerted campaign to seen Sanger’s as yet fragmentary insulin These first MANIAC runs, requiring deduce the transcriptional and translation- sequences), the question for Gamow was hundreds of hours, represent a new al machinery on theoretical grounds seems how to ‘‘translate’’ the four-letter code to a bioinformatics milestone, extending Tur- a bit feverish in retrospect. 20-letter code. ing’s mathematical modeling of outward Even Crick was not immune, proposing Crick credited him with the simple phenotypic patterns to stochastic modeling a so-called ‘‘comma-free’’ code that uti- combinatoric analysis that triplets of of the informational mechanics of life. As lized relatively few triplets as codons, but DNA bases would suffice [37], but Ga- Lily Kay remarks, by ‘‘blurring the artfully chosen such that only one reading mow seems almost to have recoiled from boundary between theory, experiment, frame would be possible [47]. By chance, the prodigal degeneracy implied by the and simulation … MANIAC had become the math dictated that the capacity of such leftover information content (i.e., 43 trip- the site of an artificial reality’’ [44]. an unambiguous comma-free triplet code lets for only 20 amino acids). Certainly Among the many scientists whom Gamow would be exactly 20 codons, making the Gamow’s first model was overly compli- induced to take a run at the genetic code theory immensely appealing—and dead cated, involving as it did an overlapping was Herbert Simon, who dabbled in this at wrong in the event. However, comma-free

PLoS Computational Biology | www.ploscompbiol.org 4 June 2010 | Volume 6 | Issue 6 | e1000809 codes (as generalized to prefix codes) positivism, a major inspiration for Bridg- calculations but for many related routines assumed great importance in computer man’s operationalism [6,9]. as well; dozens of codes were written in the science by way of Shannon’s information The logical positivists of the Vienna new FORTRAN and ALGOL program- theory, which strove to quantify, charac- Circle between the wars felt that the time ming languages, as opposed to being terize, and ultimately ascribe utility to the was ripe to reduce all of science (in fact all ‘‘hand-coded’’ at machine level [55]. This very sort of degeneracy with which knowledge) to a pure empiricism, by which activity extended to visualization, includ- Gamow was contending [48]. While these the only admissible statements would be ing interactive molecular graphics first theoretical excursions of Gamow and those verifiable by direct observation. In done by Cyrus Levinthal at the Massa- Crick foreshadow the future importance the process they rejected all things meta- chusetts Institute of Technology, using an of Turing and Shannon to bioinformatics, physical, and in fact felt that their efforts early time-sharing mainframe connected they also exemplify how beautiful math, should go to serving science by following to an oscilloscope display of a wireframe much less numerology, can run afoul of in its wake and providing a ‘‘rational model controlled by a prototypic trackball biological reality. Nowadays it is a truism reconstruction’’ of it in symbolic logic and [59]. Of this, Levinthal wrote in 1966: ‘‘It that the bioinformatics should not get too formalized language. This entailed a is too early to evaluate the usefulness of the far ahead of the data, yet we see that the strongly reductionist view of scientific man-computer combination in solving real instrumentalist urge to model is nothing theories and concepts, and faith in what problems of molecular biology. It does new. Rudolf Carnap called the ‘‘Unity of seem likely, however, that only with this In fact, no amount of computational Science’’ [52]. combination can the investigator use his modeling or theory could by itself have Today, when we codify biology in ‘chemical insight’ in an effective way’’ discerned the full details of the genetic comprehensive formal ontologies, enforc- [59]. code, which by the early 1960s fell to ing the stringent terminological and rela- Crystallographers went on to accumu- bench scientists like the late Marshall tional definitions demanded by computa- late myriad structures and from these Nirenberg to elucidate by means of cell- tional structures, we are following in the gained many ‘‘chemical insights’’ into free translation systems and radioactive footsteps of the Vienna Circle. We should life. Since the time of Sturtevant, genet- tracers. The US National Institutes of take heed, because logical positivism did icists as well had been doing mutant Health maintains in its archives pages not survive the half-century. Among many screens and maps that were undertaken from Nirenberg’s lab notebooks, which critics, W. V. O. Quine attacked its nottotesthypothesesinthefirstinstance, include sprawling spreadsheet-like tables reductionist tenets, holding that science is but to gather grist for the mill of hypothesis generation. We tend to think of hand-entered data, with multiple panels more like what he called a ‘‘Web of Belief’’ than a neat logic diagram, with complex of data-driven research as a recent taped together and chaotically annotated interwoven structures creating mutually innovation, and of the genome, pro- [49]. It appears that he was literally supporting bits of evidence and theory teome, and all the other ‘‘omes’’ as drawing conclusions directly on the data [53]. (One would be tempted to load it concepts uniquely enabled by technology, sheets, outlining in red pencil the signifi- into Cytoscape.) Quine’s views are more bioinformatics, and audacious scale. In- cant entries (as indeed might a cryptogra- compatible with probabilistic networks deed, omics is sometimes criticized as pher), such that the genetic code is seen and connectionism, and with the current ‘‘high-tech stamp collecting’’ [60], but emerging pictorially from the raw data. assertions by systems biologists that the 50- this could also have described Darwin’s One senses that the carefully arrayed rows year run of reductionism in molecular time on the Beagle. In fact, the ground- and columns of data, constituting an biology has played itself out [54]. Luckily, work for omics was laid long ago, and exhaustive all-against-all probe of triplet bioinformatics is adaptable. with it the data-rich, information-centric codes versus amino acids, was a harbinger modality that came into its own with of something new in biology; if it were Computing Structures the advent of computers. done today, someone would no doubt Crystallographers were early adopters label it the ‘‘codome.’’ of computers in aid of their laborious Computing Traits calculations of Fourier syntheses and the The first electronic computation of Codifying Biology like, beginning mainly with home-brew was performed by H. R. Gamow’s theoretical instincts were very analog computers, but by the late 1940s Simpson at the Rothamsted Experimental much in the mold of Delbru¨ck who, in his gradually shifting to IBM punchcard Station (where R. A. Fisher had created the Nobel-winning 1943 paper with Salvador tabulators programmed via plugboards statistical theory of experimental design) in Luria, confirmed the basic tenets of (recognizable descendants of those used 1958, on an early room-sized business Darwinism in bacteria through a profound for the 1890 census) [55]. The first model, the Elliott 401 [61]. However, as interpretation of a trivial experiment [50]; crystallographic applications of stored- noted above and in a recent history by A. to this end, they deployed reasoning that program computers were done on ED- W. F. Edwards [62], this introduction of anticipated by 40 years the stochastic SAC [56] and the Manchester Mark II computers to genetics was merely the coalescent theory now prominent in pop- [57] in 1952–1953. However, these were culmination of a continuous evolution from ulation genetics and the analysis of poly- used for inorganic structures. The first Mendel, through Morgan and Sturtevant, morphism [51]. Physicists and statisticians application of computers to protein crys- to Fisher and many other statisticians, brought to the biological table a degree of tallography, which some consider the real theorists, and experimentalists. comfort with formalism, not only in math forerunner of today’s bioinformatics, was The intellectual heirs of Linnaeus and but also in language and logic, that would in fact for the first high-resolution struc- Darwin were beginning to feel the influ- also typify computer science. A similar ture, that of myoglobin, in 1958 [58]. ence of computing in this same period, esteem for logic and formalism was also By the 1960s, crystallographers were spearheaded by math. George Gaylord apparent earlier in the century in the enthusiastic users of burgeoning computer Simpson, who perhaps most embodied the philosophical movement called logical technology, not just for the tedious core ‘‘modern synthesis’’ of paleontology, ge-

PLoS Computational Biology | www.ploscompbiol.org 5 June 2010 | Volume 6 | Issue 6 | e1000809 netics, and evolution, showed by 1944 how studied characteristics, an existing classifi- computation by biologists was for purposes the mathematics of population genetics cation system for reference, and cladistic of laboratory information management, pioneered by Fisher could relate to the methods with explicit rules and formal with little sense that it would ever be good fossil record [63], and brought a focus to logic for establishing evolutionary histories for more than straightforward data acqui- evolutionary rates that presaged the mo- (despite the tension between pheneticists sition, reduction, and storage. By the same lecular clock hypothesis central to modern and cladists, which is still evident in token, theoretical computer scientists who phylogenetic reconstruction. Simpson had bioinformatics today). In return, comput- first encountered biology sometimes in 1939 co-authored the first book on ers had a prodigious effect on systematics, seemed less interested in nature than in quantitative methods in biology proper shaping the mathematics used, promoting citing motivating examples for string algo- [64], and went on to devise operational formality of methods, and most impor- rithms or combinatoric problems with little metrics for ecologists to assess similarity of tantly, enabling the molecular systematics regard for their practical application. habitats based on the range of taxa found that was about to explode on the scene. In Happily, as with the mutual stimulation in them [65]. (Other statisticians provided a few short years, with the work of between biological taxonomy and compu- estimators for species diversity within Dayhoff, Fitch, and many others, protein tational classification methods, the subse- habitats [66], and ecologists were quick structures and evolutionary trees would quent history of bioinformatics took a to adapt Shannon entropy to this purpose come together in a powerful synergy that decidedly more syncretic turn, often as a [67], as eventually would bioinformati- still informs much of bioinformatics. result of felicitous collaborations. cians for sequence motif analysis.) These Sneath later recollected that population Even when individuals are willing, were hand calculations as long as the data biologists proved open to numerical tax- institutions and policies can make or break were limited to a few combinations, but onomy (though Fisher, characteristically, cross-disciplinary studies, in any field. when similarity metrics were adapted by worried that it didn’t have an exact Carnap, the logical positivist, undertook others to classification of species based on statistical basis), while evolutionary biolo- advanced training in both physics and increasing numbers of traits, the problem gists were at first more dubious [74]. philosophy, and wrote a doctoral thesis at soon grew to become as onerous as had Traditional taxonomists felt most threat- the University of Jena on an axiomatiza- been the crystallographers’ hand labors. ened of all; David Hull tells of a conten- tion of space-time. Both the physics and tious meeting where one indignantly philosophy departments found the work Computing Trees asked, ‘‘You mean to tell me that taxon- interesting, but as a dissertation both A phenetic operationalization of taxon- omists can be replaced by computers?’’ turned it away, each saying it was more omy (i.e., clustering by overall similarity) and was answered, ‘‘No, some of you can pertinent to the other field. A no doubt invited automation. In 1957, P. H. A. be replaced by an abacus’’ [3]. G. G. exasperated Carnap rewrote it with an Sneath first applied a computer to classi- Simpson himself was receptive but, realiz- undeniable philosophical cast and received fying bacteria, using a relatively advanced ing the tectonic shift that was at hand, was his degree from that department without Elliott 405 [68]; for readers not so almost wistful in addressing his colleagues further ado in 1921 [75]. Many who equipped, he also showed how to simulate (quoted in [73]): ‘‘We may as well realize entered bioinformatics only a few decades the computations by superimposing pho- that the day is upon us when for many of ago might empathize, but it hardly seems tographic negatives on which the data our problems, taxonomic and otherwise, an orphan discipline today, with major were encoded as transparent dots. The freehand observation and rattling off funding initiatives, training programs, and next year he published a follow-up with elementary statistics on desk calculators entire institutes devoted to it. the wonderfully Tom Swiftian title ‘‘An will no longer suffice. The zoologist of the For reasons such as these, a retrospec- Electro-Taxonomic Survey of Bacteria’’ future … often is going to have to work tive view of the roots of bioinformatics is [69]. Then, in 1960 an IBM staff math- with a mathematical statistician, a pro- likely to be a social history as much as ematician, Taffee Tanimoto, worked with grammer, and a large computer. Some of anything, tracing the interaction of scien- David Rogers of the New York Botanical you may welcome this prospect, but others tific disciplines down to the level of Garden to apply computers to plant may find it dreadful’’. university environments, scientific con- classification [70]. (Their similarity metric, claves, individual collaborations, and net- bearing Tanimoto’s name, is commonly The Bioinformatic Synthesis works of interaction. Indeed, the impor- used today in cheminformatics to compare tance of the sociology of science to its compounds; in fact, by 1957 there had Despite Simpson’s ambivalence, the progress is considered one of the main already been amazingly advanced work most salient feature of the development of intellectual legacies of Kuhn’s work, even done on computational chemical structure bioinformatics has been its success as an discounting his theories of scientific revo- search by the National Bureau of Stan- interdisciplinary enterprise. The combina- lution [3,9]. dards for the US Patent Office [71].) tion of biology and computer science seems The tentativeness and doubt voiced by Though the idea of quantifying relation- increasingly to be syncretic rather than pioneers like Levinthal and Simpson have ships went back to the previous century, eclectic—not simply one of juxtaposition faded. The insights of Fisher, Turing, and computers thus helped to precipitate the and coexistence, but a substantial merging Shannon now underpin the standard new field of ‘‘numerical taxonomy’’ with of systems with different worldviews, meth- repertoire of bioinformatics tools. The the appearance of the 1963 book of that ods, and cultures. At an even more theoretical intuitions of Delbru¨ck and name by Sneath and Robert Sokal [72], fundamental level, beyond any disciplinary Gamow drive those tools, and the empir- which also broached the idea of extending boundaries, it represents a successful syn- ical sensibilities of Sturtevant, McClintock, numerical approaches to phylogeny. thesis of episteme and techne. and Nirenberg are embedded in them. As related by Joel Hagen [73], compu- At first, it may have appeared more like a Whether this is revolution or evolution, tational research in classification soon marriage of convenience than of true the story of how it came to pass—the roots came to be driven by biological systemat- minds. Notwithstanding the examples cited of bioinformatics—should make compel- ics with its very large datasets of well- above, much of the early adoption of ling reading.

PLoS Computational Biology | www.ploscompbiol.org 6 June 2010 | Volume 6 | Issue 6 | e1000809 References 1. Kuhn TS (1962) The structure of scientific 27. Turing Digital Archive (1978) AM Turing’s 50. Luria SE, Delbru¨ck M (1943) Mutations of revolutions. Chicago: University of Chicago notes on morphogenesis, contributed by NE bacteria from virus sensitivity to virus resistance. Press. 210 p. Hoskin. Available: http://www.turingarchive.org/ Genetics 28: 491–511. 2. Toulmin SE (1972) Human understanding: the browse.php/C/24-27. Accessed 24 May 2010. 51. Rosenberg NA, Nordborg M (2002) Genealogical collective use and evolution of concepts. Prince- 28. Pattee HH (1961) On the origin of macromolec- trees, coalescent theory and the analysis of genetic ton: Princeton University Press. 520 p. ular sequences. Biophys J 1: 683–710. polymorphisms. Nat Rev Genet 3: 380–390. 3. Hull DL (1988) Science as a process: an 29. Stahl WR, Goheen HE (1963) Molecular algo- 52. Carnap R (1934) The unity of science. London: evolutionary account of the social and conceptual rithms. J Theor Biol 5: 266–287. Kegan. 101 p. development of science. Chicago: University of 30. Hodges A (1992) Alan Turing: the enigma. New 53. Quine WV, Ullian JS (1970) The web of belief. Chicago Press. 600 p. York: Walker & Company. 608 p. New York: Random House. 96 p. 4. Wing JM (2006) Computational thinking. Com- 31. Shannon CE (1940) An algebra for theoretical 54. Strange K (2005) The end of ‘‘naı¨ve reduction- mun ACM 49: 33–35. genetics [PhD thesis]. Cambridge (Massachu- ism’’: rise of systems biology or renaissance of 5. Regev A, Shapiro E (2002) Cells as computation. setts): Department of Mathematics, Massachusetts physiology? Am J Physiol Cell Physiol 288: Nature 419: 343. Institute of Technology. Available: http://dspace. C968–C974. 6. Godfrey-Smith P (2003) Theory and reality: an mit.edu/handle/1721.1/11174. Accessed 24 55. Cranswick LMD (2008) Busting out of crystal- introduction to the philosophy of science. Chi- May 2010. lography’s Sisyphean prison: from pencil and cago: University of Chicago Press. 272 p. 32. Crow JF (2001) Shannon’s brief foray into paper to structure solving at the press of a button. 7. Hempel CG (1966) Philosophy of natural science. genetics. Genetics 159: 915–917. Acta Crystallogr A64: 65–87. Englewood Cliffs (New Jersey): Prentice-Hall. 33. Shro¨dinger E (1946) What is life? the physical 56. Bennett JM, Kendrew JC (1952) The computa- 116 p. aspect of the living cell. New York: MacMillan. tion of Fourier syntheses with a digital electronic 8. Pesole G (2008) What is a gene? An updated 91 p. calculating machine. Acta Crystallogr 5: operational definition. Gene 417: 1–4. 34. Gamow G (8 July 1953) Letter from G. Gamow 109–116. 9. Rosenberg S (2005) The philosophy of science: a to J. D. Watson and F. H. Crick. Appendix in 57. Ahmed FR, Cruickshank DWJ (1953) Crystallo- contemporary introduction. 2nd edition. New Watson JD (2001) Girls, genes and Gamow: after graphic calculations on the Manchester Univer- York: Routledge. 213 p. the double helix. New York: Knopf. 259 p. sity electronic digital computer (Mark II). Acta 10. Griffiths PE, Stotz K (2007) Gene. In: Hull DL, 35. Gamow G (1947) One two three…infinity. New Crystallogr 6: 765–769. Ruse M, eds. The Cambridge companion to the York: Viking Press. 340 p. 58. Kendrew JC, Bodo G, Dintzis HM, Parrish RG, philosophy of biology. New York: Cambridge 36. Judson HF (1979) The eight day of creation: the Wyckoff H, Phillips DC (1958) A three-dimen- University Press. pp 85–102. makers of the revolution in biology. New York: sional model of the myoglobin molecule obtained 11. Crow JF (1988) A diamond anniversary: the first Simon and Schuster. 686 p. by X-ray analysis. Nature 181: 662–666. chromosomal map. Genetics 118: 1–3. 37. Crick FH (1990) What mad pursuit: a personal 59. Levinthal C (1966) Molecular model-building by view of scientific discovery. New York: Basic 12. Griffiths PE, Stotz K (2006) Genes in the post- computer. Sci Am 214: 42–52. Books. 208 p. genomic era. Theor Med Bioeth 27: 499–521. 60. Hunter DJ (2006) Genomics and proteomics in 38. Gamow G (1954) Possible relation between 13. Gerstein MB, Bruce C, Rozowsky JS, Zheng D, epidemiology: treasure trove or ‘‘high-tech stamp deoxyribonucleic acid and protein structures. Du J, et al. (2007) What is a gene, post- collecting’’? Epidemiology 17: 487–489. Nature 173: 318–319. ENCODE? History and updated definition. 61. Simpson HR (1958) The estimation of linkage by 39. Gamow G (2 July 1954) Letter from G. Gamow Genome Res 17: 669–681. an electronic computer. Ann Hum Genet 22: to M. Ycas. Available: http://www.loc.gov/ 14. Comfort NC (2001) From controlling elements to 356–361. exhibits/treasures/trr115.html. Accessed 24 transposons: Barbara McClintock and the Nobel 62. Edwards AWF (2005) Linkage methods in human May 2010. Prize. Trends Genet 17: 475–478. genetics before the computer. Hum Genet 118: 40. Anderson HL (1986) Metropolis, Monte Carlo, 515–530. 15. Bulmer MG (2003) Francis Galton: pioneer of and the MANIAC. Los Alamos Science 14: 96– 63. Simpson GG (1944) Tempo and mode in and biometry. Baltimore: The Johns 108. Available: http://library.lanl.gov/cgi-bin/ evolution. New York: Columbia University Press. Hopkins University Press. 376 p. getfile?00326886.pdf. Available 24 May 2010. 237 p. 16. Fisher RA (1930) The genetical theory of natural 41. Gamow G, Ycas M (1955) Statistical correlation selection. Variorum edition, 2000. New York: of protein and ribonucleic acid composition. Proc 64. Simpson GG, Roe A (1939) Quantitative zoology: Oxford University Press. 318 p. Nat Acad Sci USA 41: 1011–1019. numerical concepts and methods in the study of 17. Fisher RA (1950) Gene frequencies in a cline 42. Hoffman JG, Metropolis N, Gardiner V (1955) recent and fossil animals. New York: McGraw- determined by selection and diffusion. Biometrics Study of tumor cell populations by Monte Carlo Hill. 414 p. 6: 353–361. methods. Science 122: 465–466. 65. Simpson GG (1960) Notes on the measurement of 18. Wilkes MV, Wheeler DJ, Gill S (1951) The 43. Gardiner V, Hoffman JG, Metropolis N (1956) faunal resemblance. Am J Sci 258A: 300–311. preparation of programs for an electronic digital Digital computer studies of cell multiplication by 66. Simpson EH (1949) Measurement of diversity. computer. Cambridge: Addison-Wesley. 167 p. Monte Carlo methods. J Natl Cancer Inst 17: Nature 163: 688. 19. Hagen J (2003) The statistical frame of mind in 175–188. 67. Margalef R (1958) Information theory in ecology. systematic biology from quantitative zoology to 44. Kay LE (2000) Who wrote the book of life? a Gen Syst 3: 36–71. biometry. J Hist Biol 36: 353–384. history of the genetic code. Stanford: Stanford 68. Sneath PHA (1957) The application of computers 20. von Bertalanffy L (1968) General system theory: University Press. 441 p. to taxonomy. J Gen Microbiol 17: 201–226. foundations, development, applications. New 45. Ledley RS (1955) Digital computational methods 69. Sneath PHA, Cowan ST (1958) An electro- York: George Braziller. 289 p. in symbolic logic, with examples in biochemistry. taxonomic survey of bacteria. J Gen Microbiol 21. Weiner N (1948) Cybernetics: or control and Proc Natl Acad Sci USA 41: 498–511. 19: 551–565. communication in the animal and the machine. 46. Trifinov EN (2000) Earliest pages of bioinfor- 70. Rogers DJ, Tanimoto TT (1960) A computer Cambridge: MIT Press. 194 p. matics. Bioinformatics 16: 5–9. program for classifying plants. Science 132: 22. Cartwright D, Harary F (1956) Structural bal- 47. Hayes B (1998) The invention of the genetic code. 1115–1118. ance: a generalization of Heider’s theory. Psychol Am Sci 86: 8–14. 71. Ray LC, Kirsch RA (1957) Finding chemical Rev 63: 277–293. 48. Shannon CE, Weaver W (1949) The mathemat- records by digital computers. Science 126: 23. Thompson D’AW (1917) On growth and form. ical theory of communication. Urbana: University 814–819. Canto edition, 1992. Cambridge: Cambridge of Illinois Press. 117 p. 72. Sokal RR, Sneath PHA (1963) Principles of University Press. 346 p. 49. Office of NIH History, US National Institutes of numerical taxonomy. San Francisco: WH Free- 24. Turing AM (1952) The chemical basis of Health (2004) Photo of Marshall Nirenberg’s man. 359 p. morphogenesis. Philos Trans R Soc Lond B Biol laboratory notebook. In Deciphering the genetic 73. Hagen JB (2001) The introduction of computers Sci 237: 37–72. code: Marshall Nirenberg [online exhibit]. Avail- into systematic research in the United States 25. Maini PK, Baker RE, Chuong CM (2006) able: http://history.nih.gov/exhibits/nirenberg/ during the 1960s. Stud Hist Phil Biol & Biomed The Turing model comes of molecular age. popup_htm/05_chart_lg.htm. See also Adams J Sci 32: 291–314. 314: 1397–1398. (2008) Sequencing human genome: the contribu- 74. Sneath PHA (1995) Thirty years of numerical 26. Turing AM (2004) The essential Turing: seminal tions of and . Nature taxonomy. Systematic Biol 44: 281–298. writings in computing, logic, philosophy, artificial Education 1(1). Available: http://www.nature. 75. Murzi M (2001) Rudolf Carnap. In: Fieser J, intelligence, and artificial life plus the secrets of com/scitable/topicpage/Sequencing-Human- Dowden B, eds. The Internet encyclopedia of enigma. Copeland BJ, ed. New York: Oxford Genome-the-Contributions-of-Francis-686. Ac- philosophy, Available: http://www.iep.utm.edu/ University Press. 622 p. cessed 24 May 2010. carnap. Accessed 24 May 2010.

PLoS Computational Biology | www.ploscompbiol.org 7 June 2010 | Volume 6 | Issue 6 | e1000809 Copyright of PLoS Computational Biology is the property of Public Library of Science and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.