<<

What is Biological ? ( http://www.esp.org/rjr/canberra.pdf )

Robert J. Robbins Fred Hutchinson Cancer Research Center 1100 Fairview Avenue North, LV-101 Seattle, Washington 98109

[email protected] http://www.esp.org/rjr (206) 667 2920 AbstractAbstract In the last 25 years, Moore's Law has transformed society, delivering exponentially better computers at exponentially lower prices. Biological informatics is the application of powerful, affordable information technology to the problems of biology. With $2500 desktop PCs now delivering more raw computing power than the first Cray, is rapidly becoming the critical technology for 21st- century biology.

DNA is legitimately seen as a biological mass-storage device, making bioinformatics a sine qua non for genomic research. Others areas of biological investigation are equally information rich –– an exhaustive tabulation of the Earth's would involve a cross–index of the millions of known against the approximately 500,000,000,000,000 square meters of the Earth's surface.

Bioinformatics is also becoming a scholarly discipline in its own right, melding information science with computer science, seasoning it with engineering methods, and applying it to the most information rich component of the known universe –– the Biosphere. What is Biological Informatics?

WhatWhat are are the the differences differences among: among: •• biologicalbiological informatics informatics •• bioinformaticsbioinformatics •• informaticsinformatics •• computationalcomputational biology biology

3 What is Bioinformatics?

SemanticSemantic tuning: tuning: •• X-informaticsX-informaticsisis the the application application of of informationinformation technology technology to to discipline discipline X, X, withwith emphasis emphasis on on persistent persistent data data stores. stores. •• ComputationalComputational X Xisis the the application application of of informationinformation technology technology to to discipline discipline X, X, withwith emphasis emphasis analytical analytical . algorithms.

4 What is Bioinformatics?

BioinformaticsBioinformatics is: is: •• thethe use use of of computers computers (and (and persistent persistent data data structures)structures) in in pursuit pursuit of of biological biological research. research. •• anan emerging emerging new new discipline, discipline, with with its its own own goals,goals, research research program, program, and and practitioners. practitioners. •• thethesinesine qua qua non nonforfor 21st-century 21st-century biology. biology. •• thethe most mostsignificantlysignificantly underfunded underfunded componentcomponent of of 21st-century 21st-century biology. biology. •• allall of of the the above. above.

5 Topics

•• BiotechnologyBiotechnology and and information information technology technology will will be be the the “magic”“magic” technologies technologies of of the the 21st 21st Century. Century. •• Moore’sMoore’s Law Law constantly constantly transforms transforms IT IT (and (and everything everything else).else). •• InformationInformation Technology Technology (IT) (IT) has has a a special special relationship relationship withwith biology. biology. •• 21st-century21st-century biology biology will will be be based based on on bioinformatics. bioinformatics. •• BioinformaticsBioinformatics is is emerging emerging as as an an independent independent discipline.discipline. •• AA connected, connected, federated federated information information infrastructure infrastructure for for 21st-century21st-century biology biology is is needed. needed. •• CurrentCurrent support support for for public public bio-information bio-information infrastructureinfrastructure seems seems inadequate. inadequate.

6 Introduction Magical Technology Magic

To a person from 1897, much current technology would seem like magic. What technology of 2097 would seem magical to a person from 1997?

Candidate:Candidate: BiotechnologyBiotechnology so so advanced advanced that that the the distinction distinction betweenbetween living living and and non-living non-living is is blurred. blurred. InformationInformation technology technology so so advanced advanced that that access access toto information information is is immediate immediate and and universal. universal.

8 Moore’s Law Transforms InfoTech (and everything else) Moore’s Law: The Statement

Every eighteen months, the number of transistors that can be placed on a chip doubles.

GordonGordon Moore, Moore, co-founder co-founder of of Intel... Intel...

10 Moore’s Law: The Effect

100,000

10,000

1,000

100 (constant cost)

Performance 10

Time

11 Moore’s Law: The Effect

100,000 100,000

10,000 10,000

1,000 1,000 Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

12 Moore’s Law: The Effect

ThreeThree Phases Phases of of Novel Novel IT IT Applications Applications •• It’sIt’s Impossible Impossible •• It’sIt’s Impractical Impractical •• It’sIt’s Overdue Overdue

13 Moore’s Law: The Effect

D 100,000 100,000

10,000 10,000 P 1,000 1,000 Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

14 Moore’s Law: The Effect

D 100,000 100,000

10,000 10,000 P 1,000 1,000 Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

15 Moore’s Law: The Effect

D 100,000 100,000

10,000 10,000 P 1,000 1,000 Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

16 Moore’s Law: The Effect

D 100,000 100,000

10,000 10,000 P 1,000 1,000 Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

17 Moore’s Law: The Effect

D 100,000 100,000

10,000 10,000 P 1,000 1,000

C Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

18 Moore’s Law: The Effect

D 100,000 100,000

10,000 10,000 P 1,000 1,000

C Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

19 Moore’s Law: The Effect

D A 100,000 100,000

10,000 10,000 P 1,000 1,000

C Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

20 Moore’s Law: The Effect

D A 100,000 100,000

10,000 10,000 P 1,000 1,000

C Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time

21 Moore’s Law: The Effect

D A 100,000 100,000

10,000 10,000 P 1,000 1,000

C Cost 100 100 (constant cost)

Performance 10 10 (constant performance)

Time RelevanceRelevance for for biology? biology?

22 Cost (constant performance)

University 10,000,000 Purchase

1,000,000

100,000

10,000

1,000 1975 1980 1985 1990 1995 2000 2005

23 Cost (constant performance)

University 10,000,000 Purchase

1,000,000 Department Purchase

100,000

10,000

1,000 1975 1980 1985 1990 1995 2000 2005

24 Cost (constant performance)

University 10,000,000 Purchase

1,000,000 Department Purchase

100,000 Project Purchase

10,000

1,000 1975 1980 1985 1990 1995 2000 2005

25 Cost (constant performance)

University 10,000,000 Purchase

1,000,000 Department Purchase

100,000 Project Purchase

10,000 Personal Purchase 1,000 1975 1980 1985 1990 1995 2000 2005

26 Cost (constant performance)

University 10,000,000 Unknown Purchase Purchases

1,000,000 Department Purchase

100,000 Project Purchase

10,000 Personal Purchase 1,000 1975 1980 1985 1990 1995 2000 2005

27 IT-BiologyIT-Biology SynergismSynergism IT is Special

InformationInformation Technology: Technology: •• affectsaffects the the performance performanceandandthethe managementmanagement of of tasks tasks •• allowsallows the the manipulation manipulation of of huge huge amountsamounts of of highly highly complex complex data data •• isis incredibly incredibly plastic plastic (programming and poetry are both exercises in pure thought)

• improves exponentially (Moore’s Law)

29 Biology is Special

LifeLife is is Characterized Characterized by: by: •• individualityindividuality •• historicityhistoricity •• contingencycontingency •• highhigh (digital) (digital) information information content content NoNo law law of of large large numbers, numbers, since since every every livingliving thing thing is is genuinely genuinely unique. unique.

30 IT-Biology Synergism

•• PhysicsPhysics needs needs calculus, calculus, the the method method for for manipulatingmanipulating information information about about statisticallystatistically large large numbers numbers of of vanishingly vanishingly small,small, independent, independent, equivalent equivalent things. things. •• BiologyBiology needs needs information information technology, technology, the the methodmethod for for manipulating manipulating information information aboutabout large large numbers numbers of of dependent, dependent, historicallyhistorically contingent, contingent, individual individual things. things.

31 Biology is Special

ForFor it it is is in in relation relation to to the the statistical statistical point point of of view view thatthat the the structure structure of of the the vital vital parts parts of of living living organismsorganisms differs differs so so entirely entirely from from that that of of any any piecepiece of of matter matter that that we we physicists physicists and and chemists chemists havehave ever ever handled handled in in our our laboratories laboratories or or mentallymentally at at our our writing writing desks. desks.

ErwinErwin Schrödinger. Schrödinger. 1944. 1944. WhatWhat is is Life Life..

32 Genetics as Code

[The][The] chromosomes chromosomes ...... contain contain in in some some kind kind of of code- code- scriptscript the the entire entire pattern pattern of of the the individual's individual's future future developmentdevelopment and and of of its its functioning functioning in in the the mature mature state. state...... [By] [By] code-script code-script we we mean mean that that the the all-penetrating all-penetrating ,mind, once once conceived conceived by by Laplace, Laplace, to to which which every every causalcausal connection connection lay lay immediately immediately open, open, could could tell tell fromfrom their their structure structure whether whether [an [an egg egg carrying carrying them] them] wouldwould develop, develop, under under suitable suitable conditions, conditions, into into a a black black cockcock or or into into a a speckled speckled hen, hen, into into a a fly fly or or a a maize maize plant, plant, aa rhodo-dendron, rhodo-dendron, a a beetle, beetle, a a mouse, mouse, or or a a woman. woman. ErwinErwin Schrödinger. Schrödinger. 1944. 1944. WhatWhat is is Life Life..

33 One Human Sequence

We now know that Schrödinger’s mysterious human “code-script” consists of 3.3 billion base pairs of DNA.

Typed in 10-pitch font, one human sequence would stretch for more than 5,000 miles. Digitally formatted, it could be stored on one CD- ROM. Biologically encoded, it fits easily within a single cell. 34 One Human Sequence

A variant of this factoid actually made it into Ripley’s Believe It or Not, but that’s another story...

For details on Ripley’s interest in DNA, see HTTP://LX1.SELU.COM/~rjr/factoids/genlen.html

35 Bio-digital Information

DNADNA is is a a highly highly efficient efficient digital digital storage storage device: device: •• ThereThere is is more more mass-storage mass-storage capacity capacity in in the the DNADNA of of a a side side of of beef beef than than in in all all the the hard hard drives drives ofof all all the the world’s world’s computers. computers. •• StoringStoring all all of of the the (redundant) (redundant) information information in in all all ofof the the world’s world’s DNA DNA on on computer computer hard hard disks disks wouldwould require require that that the the entire entire surface surface of of the the Earth Earth bebe covered covered to to a a depth depth of of three three miles miles in in Conner Conner 1.01.0 gB gB drives. drives.

36 Genomics:Genomics: AnAn Example Example Human Genome Project - Goals

–– construction construction of of a a high-resolution high-resolution genetic genetic map map of of the the human human genome;genome; –– production production of of a a variety variety of of physical physical maps maps of of all all human human chromosomeschromosomes and and of of the the DNA DNA of of selected selected model model organisms;organisms; –– determination determination of of the the complete complete sequence sequence of of human human DNA DNA and and ofof the the DNA DNA of of selected selected model model organisms; organisms; –– development development of of capabilities capabilities for for collecting, collecting, storing, storing, distributing,distributing, and and analyzing analyzing the the data data produced; produced; –– creation creation of of appropriate appropriate technologies technologies necessary necessary to to achieve achieve thesethese objectives. objectives. USDOE.USDOE. 1990. 1990. UnderstandingUnderstanding Our Our Genetic Genetic Inheritance. Inheritance. TheThe U.S. U.S. Human Human Genome Genome Project: Project: The The First First Five Five Years. Years. 38 Infrastructure and the HGP

ProgressProgress towards towards all all of of the the [Genome [Genome Project] Project] goalsgoals will will require require the the establishment establishment of of well- well- fundedfunded centralized centralized facilities, facilities, including including a a stock stock centercenter for for the the cloned cloned DNA DNA fragments fragments generatedgenerated in in the the mapping mapping and and sequencing sequencing efforteffort and and a a data data center center for for the the computer-based computer-based collectioncollection and and distribution distribution of of large large amounts amounts of of DNADNA sequence sequence information. information.

NationalNational Research Research Council. Council. 1988. 1988. MappingMapping and and Sequencing Sequencing the the HumanHuman Genome Genome.. Washington, Washington, DC: DC: National National Academy Academy Press. Press. p. p. 3 3

39 GenBank Totals (Release 103)

DIVISION Entries Per Cent Base Pairs Per Cent

Phage Sequences (PHG) 1,313 0.074% 2,138,810 0.184% Viral Sequences (VRL) 45,355 2.568% 44,484,848 3.834% (BCT) 38,023 2.153% 88,576,641 7.634%

Plant, Fungal, and Algal Sequences (PLN) 44,553 2.523% 92,259,434 7.951%

Invertebrate Sequences (INV) 29,657 1.679% 105,703,550 9.110%

Rodent Sequences (ROD) 36,967 2.093% 45,437,309 3.916% Primate Sequences (PRI1–2) 75,587 4.280% 134,944,314 11.630% Other Mammals (MAM) 12,744 0.722% 12,358,310 1.065% Other Vertebrate Sequences (VRT) 17,713 1.003% 17,040,159 1.469%

High-Throughput Genome Sequences (HTG) 1,120 0.063% 72,064,395 6.211%

Genome Survey Sequences (GSS) 42,628 2.414% 22,783,326 1.964% Structural RNA Sequences (RNA) 4,802 0.272% 2,487,397 0.214% Sequence Tagged Sites Sequences (STS) 52,824 2.991% 18,161,532 1.565% Patent Sequences (PAT) 87,767 4.970% 27,593,724 2.378% Synthetic Sequences (SYN) 2,577 0.146% 5,698,945 0.491% Unannotated Sequences (UNA) 2,480 0.140% 1,933,676 0.167%

EST1-17 1,269,737 71.905% 466,634,317 40.217%

TOTALS 1,765,847 100.000% 1,160,300,687 100.000% 40 Base Pairs in GenBank

87 88 89 90 91 92 93 94 95 96 97 1,200,000,000

1,000,000,000

800,000,000

600,000,000

400,000,000

200,000,000

0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105

GenBank Release Numbers

41 Base Pairs in GenBank

87 88 89 90 91 92 93 94 95 96 97 1,200,000,000

1,000,000,000 Growth in GenBank is exponential. 800,000,000 Growth in GenBank is exponential. MoreMore data data were were added added in in the the last last ten ten 600,000,000 weeksweeks than than were were added added in in the the first first ten ten yearsyears of of the the project. project. 400,000,000 AtAt this this rate, rate, what’s what’s next... next... 200,000,000

0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105

GenBank Release Numbers

42 ABI Bass-o-Matic Sequencer

789+ TGCGCATCGCGTATCGATAG 456-

E 123n t e 0 Defrost r speed

gB/sec

In with the sample, out with the sequence...

43 What’s Really Next

The post-genome era in biological research will take for granted ready access to huge amounts of genomic data. The challenge will be understanding those data and using the understanding to solve real-world problems...

44 Base Pairs in GenBank

87 88 89 90 91 92 93 94 95 96 97 140,000,000 Net Changes 120,000,000 Net Changes

100,000,000

80,000,000

60,000,000

40,000,000

20,000,000

0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105

GenBank Release Numbers

45 Base Pairs in GenBank (Percent Increase)

100.00% Percent Increase 90.00% Percent Increase averageaverage = = 56% 56% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 85 86 87 88 89 90 91 92 93 94 95 96 Year

46 Projected Base Pairs

100,000,000,000,000 10,000,000,000,000 1,000,000,000,000 100,000,000,000 10,000,000,000 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 90 95 0 5 10 15 20 25 Year Assumed annual growth rate: 50% (less than current rate) 47 Projected Base Pairs

100,000,000,000,000 10,000,000,000,000 1,000,000,000,000 100,000,000,000 10,000,000,000 1,000,000,000 100,000,000 Is this crazy? 10,000,000 Is this crazy? 1,000,000 OneOne trillion trillion bp bp by by 2015 2015 100,000 100 trillion by 2025 10,000 100 trillion by 2025 1,000 100 10 1 90 95 0 5 10 15 20 25 Year Assumed annual growth rate: 50% (less than current rate) 48 Projected Base Pairs

100,000,000,000,000 500,000 10,000,000,000,000 50,000 1,000,000,000,000 5,000 100,000,000,000 10,000,000,000 1,000,000,000 100,000,000 Is this crazy? 10,000,000 Is this crazy?Projected size of the 1,000,000 OneOne trillion trillionsequence bp bp by database, by 2015 2015 100,000 100 trillionindicated by 2025 as the number 10,000 MaybeMaybe not... not... 100 trillionof base by pairs 2025 per 1,000 individual medical 100 10 record in the US. 1 90 95 0 5 10 15 20 25 Year

49 Projected Base Pairs

100,000,000,000,000 500,000 10,000,000,000,000 50,000 1,000,000,000,000 5,000 100,000,000,000 10,000,000,000 1,000,000,000 100,000,000 Is this crazy? 10,000,000 Is this crazy?Projected size of the 1,000,000 OneOne trillion trillionsequence bp bp by database, by 2015 2015 Note100,000 too: indicated as the number Note too: 100100 trillion trillion by by 2025 2025 10,000The amount of digitalMaybeMaybe data necessary not... not... to of base pairs per 1,000The amount of digital data necessary to 14 individual medical storestore100 10 1014 basesbases of of DNA DNA is is only only a a fraction fraction record in the US. ofof10 the the data data necessary necessary to to describe describe the the world’sworld’s1 microbial microbial biodiversity biodiversity at at one one squaresquare90 meter meter 95 resolution... resolution... 0 5 10 15 20 25 Year

50 21st Century Biology Post-Genome Era The Post-Genome Era

Post-genomePost-genome research research involves: involves: •• applyingapplying genomic genomic tools tools and and knowledge knowledge to to more more generalgeneral problems problems •• askingasking new new questions, questions, tractable tractable only only to to genomic genomic oror post-genomic post-genomic analysis analysis •• movingmoving beyond beyond the the structural structural genomics genomics of of the the humanhuman genome genome project project and and into into the the functional functional genomicsgenomics of of the the post-genome post-genome era era

52 The Post-Genome Era

SuggestedSuggested definition: definition:

•• functionalfunctional genomics genomics = = biology biology

53 The Post-Genome Era

AnAn early early analysis: analysis:

Walter Gilbert. 1991. Towards a paradigm shift in biology. Nature, 349:99.

54 Paradigm Shift in Biology

ToTo use use [the] [the] flood flood of of knowledge, knowledge, which which will will pour pour acrossacross the the computer computer networks networks of of the the world, world, biologistsbiologists not not only only must must become become computer computer literate,literate, but but also also change change their their approach approach to to the the problemproblem of of understanding understanding life. life.

WalterWalter Gilbert. Gilbert. 1991. 1991. Towards Towards a a paradigm paradigm shift shift in in biology. biology. NatureNature,, 349:99. 349:99.

55 Paradigm Shift in Biology

TheThe new new paradigm, paradigm, now now emerging, emerging, is is that that all all the the ‘genes’‘genes’ will will be be known known (in (in the the sense sense of of being being residentresident in in databases databases available available electronically), electronically), andand that that the the starting starting point point of of a a biological biological investigationinvestigation will will be be theoretical. theoretical. An An individual individual scientistscientist will will begin begin with with a a theoretical theoretical conjecture, conjecture, onlyonly then then turning turning to to experiment experiment to to follow follow or or test test thatthat hypothesis. hypothesis.

WalterWalter Gilbert. Gilbert. 1991. 1991. Towards Towards a a paradigm paradigm shift shift in in biology. biology. NatureNature,, 349:99. 349:99.

56 Paradigm Shift in Biology

Case of Microbiology

< 5,000 known and described bacteria

5,000,000 base pairs per genome

25,000,000,000 TOTAL base pairs

IfIf a a full, full, annotated annotated sequence sequence were were available available for for all all known known bacteria, bacteria, the the practice practice ofof microbiology microbiology would would match match Gilbert’s Gilbert’s prediction. prediction.

57 21st Century Biology The Science Fundamental Dogma DNA

The fundamental dogma of RNA is that genes act to create phenotypes through a flow of information from DNA to RNA to proteins, to interactions among proteins Proteins (regulatory circuits and metabolic pathways), and ultimately to phenotypes. Collections of individual phenotypes, of Circuits course, constitute a population. Phenotypes

Populations

59 Fundamental Dogma

Map DNA GenBank Databases EMBL DDBJ RNA AlthoughAlthough a a few few databases databases already already exist exist toto distribute distribute molecular molecular information, information, Proteins SwissPROT PDB PIR Circuits

Phenotypes

Populations

60 Fundamental Dogma

Map DNA GenBank Databases EMBL DDBJ

Although a few databases already exist RNA Although a few databases already exist Development ? toto distribute distribute molecular molecular information, information, Gene Expression? Proteins SwissPROT PDB PIR thethe post-genomic post-genomic era era will will need need many many moremore to to collect, collect, manage, manage, and and publish publish Circuits Metabolism? thethe coming coming flood flood of of new new findings. findings. Regulatory Pathways? Phenotypes Clinical Data ? Neuroanatomy? Populations Biodiversity? Molecular Epidemiology?

Comparative Genomics? 61 Fundamental Dogma

Map DNA GenBank Databases EMBL DDBJ

Although a few databases already exist RNA Although a few databases already exist Development ? toto distribute distribute molecular molecular information, information, Gene Expression? Proteins SwissPROT PDB PIR thethe post-genomic post-genomic era era will will need need many many moremore to to collect, collect, manage, manage, and and publish publish Circuits Metabolism? thethe coming coming flood flood of of new new findings. findings. Regulatory Pathways? Phenotypes Clinical Data ? Neuroanatomy? IfIf this this extension extension covers covers functional functional genomics,genomics, then then “functional “functional genomics” genomics” Populations isis equivalent equivalent to to biology. biology. Biodiversity? Molecular Epidemiology?

Comparative Genomics? 62 21st Century Biology The Literature Electronic Data Publishing P I R -- Beta Hemoglobin ------G D B -- Beta HemoglobinDEFINITION ------[HBHU] Hemoglobin beta chain O M I M -- Beta Hemoglobin** Locus Detail View- Human, ** chimpanzee, pygmy ------chimpanzee, and gorilla ------Symbol: HBB Title GenBank -- Beta Hemoglobin Name: hemoglobin, betaSUMMARY [SUM] *141900 HEMOGLOBIN--BETA LOCUS MIM Num: 141900 #Type Protein ------[HBB; SICKLE CELL ANEMIA, Location: 11p15.5 #Molecular-weight 15867 DEFINITION [DEF] INCLUDED; BETA-THALASSEMIAS, Created: 01 Jan 86 00:00 #Length 146 [HUMHBB] Human betaINCLUDED; HEINZ BODY ANEMIAS, ------#Checksum 1242 globin regionBETA-GLOBIN TYPE, ...] LOCUS [LOC] ** Polymorphism Table ** ------SEQUENCE HUMHBB The alpha and beta loci determine the ProbeVHLTPEEKSAVTALWG Enzyme ACCESSION NO. [ACC] structure of the 2 types of polypeptide ======KVNVDEVGGEALGRLL J00179 J00093 J00094 J00096chains in adult hemoglobin, Hb A. By beta-globin cDNAVVYPWTQRFFESFGDL RsaI J00158 J00159 J00160 J00161autoradiography using heavy-labeled beta-globin cDNA,JW10+STPDAVMGNPKVKAHG AvaII hemoglobin-specific messenger RNA, Pstbeta,JW102,BD23,pB+KKVLGAFSDGLAHLDN BamHI KEYWORDS [KEY] Price et al. (1972) found labeling of a pRK29,UnknownLKGTFATLSELHCDKL HindII Alu repetitive element; HPFH;chromosome 2 and a group B chromo- KpnI repetitive sequence; RNA beta-IVS2Many probe findingsHVDPENFRLLGNVLVC will HphI be published some. They concluded,IVS-2 incorrectlyMany normal as findings it will HphI be published polymerase III; allelic turned out, that the beta-gamma-delta variation; alternate cap site; Unknowninin structured structured databases... databases... AvrII linkage group was onbeta-IVS2 a group probe B AsuI chromosome since the zone of labeling SEQUENCE was longer on that chromosome than on gaattctaatctccctctcactactgtctagtchromosome 2 (which by this reasoning atccctcaaggagtggtggctcatgtcttgag ctcaagagtttgatataaaaaaaaattagcca ggcaaatgggaggatcccttgagcgcactcca 64 Electronic Scholarly Publishing

ScientificScientific papers papers will will also also be be publishedpublished in in electronic electronic format, format, perhapsperhaps even even including including some some importantimportant papers papers originally originally publishedpublished only only in in print print format. format.

HTTP://WWW.ESP.ORG For example, the ESP site is dedicated to the electronic publishing of scientific and other scholarly materials. Of particular interest are the history of science, genetics, , and genome research.

65 Electronic Scholarly Publishing

The Classical Genetics: Foundations series provides ready access to typeset-quality, electronic editions of important publications that can otherwise be very difficult to find.

66 Electronic Scholarly Publishing

For example, “Hardy” (of Hardy-Weinberg) is a name well known to most students of biology, but few have access to his original publications dealing with the basic principles of population genetics.

67 Electronic Scholarly Publishing

And when they do have access, they may be surprised to find that all of Hardy’s biological writings consist of a single, one-page letter to the editor of Science.

68 21st Century Biology The People Human Resources Issues

•• ReductionReduction in in need need for for non-IT non-IT staff staff •• IncreaseIncrease in in need need for for IT IT staff, staff, especially especially “information“information engineers” engineers”

InIn modern modern biology, biology, a a general general trend trend is is to to convertconvert expert expert work work into into staff staff work work and and finallyfinally into into computation. computation. New New expertise expertise is is requiredrequired to to design, design, carry carry out, out, and and interpret interpret continuingcontinuing work. work.

70 Human Resources Issues

ElbertElbert Branscomb: Branscomb:“You“You must must recognize recognize that that somesome day day you you may may need need as as many many computer computer scientistsscientists as as biologists biologists in in your your labs.” labs.” CraigCraig Venter: Venter:“At“At TIGR, TIGR, we we already already have have twicetwice as as many many computer computer scientists scientists on on our our staff.”staff.” ExchangeExchange at at DOE DOE workshop workshop on on high- high- throughputthroughput sequencing. sequencing.

71 NewNew Discipline Discipline ofof Informatics Informatics What is Informatics? Computer Science Research Informatics

InformaticsInformatics is is a a conveyor conveyor belt belt that that carries carries the the Biological fruitsfruits of of basic basic computer computer science science research research to to thethe biological biological community, community, transforming transforming them them Application alongalong the the way way from from theory theory to to practical practical Programs applications.applications. 73 What is Informatics?

InformaticsInformatics combines combines expertise expertise from: from: •• domaindomain science science (e.g., (e.g., biology) biology) •• computercomputer science science •• librarylibrary science science •• managementmanagement science science

AllAll tempered tempered with with an an engineering engineering mindset... mindset...

74 What is Informatics?

Library Domain Medical Library Knowledge Medical ScienceScience InformaticsInformatics

ComputerComputer IS BioBio ScienceScience IS InformaticsInformatics

Mgt Other Mgt Engineering Other ScienceScience Principles InformaticsInformatics

75 Engineering Mindset

EngineeringEngineering is is often often defined defined as as the the use use of of scientificscientific knowledge knowledge and and principles principles for for practical practical purposes.purposes. While While the the original original usage usage restricted restricted thethe word word to to the the building building of of roads, roads, bridges, bridges, and and objectsobjects of of military military use, use, today's today's usage usage is is more more generalgeneral and and includes includes chemical, chemical, electronic, electronic, and and eveneven mathematical mathematical engineering. engineering.

Parnas,Parnas, David David Lorge. Lorge. 1990. 1990. ComputerComputer,, 23(1):17-22. 23(1):17-22.

... or even information engineering.

76 Engineering Mindset

EngineeringEngineering education education ...... stresses stresses finding finding good,good, as as contrasted contrasted with with workable, workable, designs. designs. WhereWhere a a scientist scientist may may be be happy happy with with a a device device thatthat validates validates his his theory, theory, an an engineer engineer is is taught taught toto make make sure sure that that the the device device is is efficient, efficient, reliable,reliable, safe, safe, easy easy to to use, use, and and robust. robust.

Parnas,Parnas, David David Lorge. Lorge. 1990. 1990. ComputerComputer,, 23(1):17-22. 23(1):17-22.

The assembly of working, robust systems, on time and on budget, is the key requirement for a federated information infrastructure for biology. 77 Informatics Triangle

AnAn unfolded unfolded tetrahedron tetrahedron can can representrepresent the the relationships relationships among among managementmanagement science science (MS), (MS), computer computer sciencescience (CS), (CS), library library science science (LS), (LS), MS andand biological biological science science (BS), (BS), as as they they connectconnect around around a a core core of of information information sciencescience (IS), (IS), or or informatics. informatics. IS CS LS IS IS IS MS BS MS

78 What is Informatics?

BS MS

ThisThis might might also also be be drawn drawn like like a a methane molecule... methane molecule... IS

CS LS

79 FederatedFederated InformationInformation InfrastructureInfrastructure Layer 4 Applications Financial Services ODN Model Interactive Fax Remote Audio Login Education Server Information Tele- Browsing Conferencing Electronic Video Mail Server

Layer 3 Middleware Services Privacy Security File Service Name Systems Directories Servers Electronic Storage Multisite Money SeveralSeveral years years ago, ago, a a report, report, Repositories Coordination Realizing the Information Realizing the Information Layer 2 Transport Services and Open Bearer Future,Future,laidoutavisionofanlaidoutavisionofan Representation Standards Service Interface (fax, video, audio, text, etc.) OpenOpen Data Data Network, Network, in in which which anyany information information appliance appliance Layer 1 ODN Bearer Service couldcould be be operated operated over over generic generic Network Technology networkingnetworking protocols... protocols... Substrate

Dial-up Modems LANs Point-to-Point ATM Circuits

SMDS Wireless Direct Frame Broadcast Relay Satellites 81 National Information Infrastructure

commercial non-commercial uses uses ETC otherEdu Lib Res

analog

digital 

AA national national information information infrastructure infrastructure can can be be partitioned partitioned according according to to technology technology (analog(analog vs vs digital) digital) and and usage usage (commercial (commercial vs vs non-commercial). non-commercial). Scientific Scientific use use of of the the internetinternet involves involves the the non-commercial, non-commercial, digital digital component. component. Entertainment, Entertainment, telephony, telephony, andand communication now now primarily primarily falls falls in in the the commercial, commercial, analog analog realm. realm. 82 commercial non-commercial uses uses FIIST & NII ETC otherEdu Lib Res

analog

digital 

FIIST FII (science & technology) (climatology) TheThe research research component component of of FIIS FIIE FII (science) (engineering) thethe NII NII contains contains a a Federated Federated (geography) FII Information Infrastructure FII Information Infrastructure (•••) (chemistry) FII FII FII for Science and Technology.. (physics) for Science and Technology.. (•••) (geology) FII FII (biology) FII (systematics) () FII FII FII (•••) (physiology) (structural biology) FII FII FII (botany) (genomics) FII () (human) FII (•••) FII FII (mouse) FII (E. coli) (Arabidopsis) 83 FIIST

FIIST FII (science & technology) (climatology) FIIS FIIE FII (science) (engineering) (geography) FII FII (•••) (chemistry) FII FII (physics) (geology) FII (biology)

84 FIIB

FII (•••) FII FII (biology) FII (ecology) (systematics) FII FII FII (•••) (physiology) (structural biology) FII FII FII (botany) FII (genomics) (zoology) (human) FII (•••) FII FII (mouse) FII (E. coli) (Arabidopsis)

85 FundingFunding for for Bio-InformationBio-Information InfrastructureInfrastructure Call for Change

AmongAmong the the many many new new tools tools that that are are or or will will be be needed needed (for (for 21st- 21st- centurycentury biology), biology), some some of of those those having having the the highest highest priority priority are: are: •• bioinformaticsbioinformatics •• computationalcomputational biology biology •• functionalfunctional imaging imaging tools tools using using biosensors biosensors and and biomarkers biomarkers •• transformationtransformation and and transient transient expression expression technologies technologies •• nanotechnologiesnanotechnologies

ImpactImpact of of Emerging Emerging Technologies Technologies on on the the Biological Biological Sciences: Sciences: Report Report of of a a Workshop.Workshop.NSF-supportedNSF-supported workshop, workshop, held held 26-27 26-27 June June 1995, 1995, Washington, Washington, DC. DC.

87 The Problem

•• ITIT moves moves at at “Internet “Internet Speed” Speed” and and responds responds rapidlyrapidly to to market market forces. forces. •• ITIT will will play play a a central central role role in in 21st 21st Century Century biology.biology. •• CurrentCurrent levels levels of of support support for for public public bio- bio- informationinformation infrastructure infrastructure are are too too low. low. •• ReallocationReallocation of of federal federal funding funding is is difficult, difficult, andand subject subject to to political political pressures. pressures. •• Federal-fundingFederal-funding decision decision processes processes are are ponderouslyponderously slow slow and and inefficient. inefficient.

88 Public Funding of Bio-Databases

TheThe challenges: challenges: •• providingproviding adequate adequate funding funding levels levels •• makingmaking timely, timely, efficient efficient decisions decisions

89 IT Budgets

A Reality Check Rhetorical Question

WhichWhich is is likely likely to to be be more more complex: complex: •• identifying,identifying, documenting, documenting, and and tracking tracking the the whereaboutswhereabouts of ofallall parcels parcelsinin transit transit in in the the US US at at oneone time time •• identifying,identifying, documenting, documenting, and and analyzing analyzing the the structurestructure and and function function of ofallall individual individual genes genes in in allall economically economically significant significant organisms organisms;; then then analyzinganalyzingallall significant significant gene-gene gene-gene and and gene- gene- environmentenvironment interactions interactionsinin those those organisms organisms andand their their environments environments

91 Business Factoids

UnitedUnited Parcel Parcel Service: Service: •• usesuses two two redundant redundant 3 3 Terabyte Terabyte (yes, (yes, 3000 3000 GB) GB) databasesdatabases to to track track all all packages packages in in transit. transit. •• hashas 4,000 4,000 full-time full-time employees employees dedicated dedicated to to IT IT •• spendsspends one one billion billion dollars dollars per per year year on on IT IT •• hashas an an income income of of 1.1 1.1 billion billion dollars, dollars, against against revenuesrevenues of of 22.4 22.4 billion billion dollars dollars

92 Business Comparisons Company Revenues IT Budget Pct Chase-Manhattan 16,431,000,000 1,800,000,000 10.95 % AMR Corporation 17,753,000,000 1,368,000,000 7.71 % Nation’s Bank 17,509,000,000 1,130,000,000 6.45 % Sprint 14,235,000,000 873,000,000 6.13 % IBM 75,947,000,000 4,400,000,000 5.79 % MCI 18,500,000,000 1,000,000,000 5.41 % Microsoft 11,360,000,000 510,000,000 4.49 % United Parcel 22,400,000,000 1,000,000,000 4.46 % Bristol-Myers Squibb 15,065,000,000 440,000,000 2.92 % Pfizer 11,306,000,000 300,000,000 2.65 % Pacific Gas & Electric 10,000,000,000 250,000,000 2.50 % Wal-Mart 104,859,000,000 550,000,000 0.52 % K-Mart 31,437,000,000 130,000,000 0.41 % 93 Federal Funding of Biomedical-IT

AppropriateAppropriate funding funding level: level: •• approx.approx. 5-10% 5-10% of of research research funding funding •• i.e.i.e.,1-2,1-2billionbilliondollarsdollars per per year year

SourceSource of of estimate: estimate: -- Experience Experience of of IT-transformed IT-transformed industries. industries. -- Current Current support support for for IT-rich IT-rich biological biological research. research.

94 Conference on Biological Informatics

ConferenceConference Sessions Sessions •• OverviewOverview of of Biological Biological Informatics Informatics •• BiodiversityBiodiversity Informatics Informatics •• EnvironmentalEnvironmental Informatics Informatics •• MolecularMolecular Informatics Informatics •• MedicalMedical / / Neuroinformatics •• TeachingTeaching and and Training Training in in Informatics Informatics Slides:

http://www.esp.org/rjr/canberra.pdfhttp://www.esp.org/rjr/canberra.pdf

96