GGENESENES IN INAACTIONCTION VIEWPOINT

ECTION The ENCODE (ENCyclopedia Of S DNA Elements) Project

PECIAL The ENCODE Project Consortium*. S

The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional approaches, such as cDNA-cloning efforts elements in the human sequence. The pilot phase of the Project is focused on (4, 5) and chip-based transcriptome analyses a specified 30 megabases (È1%) of the sequence and is organized as (6, 7), have revealed the existence of many an international consortium of computational and laboratory-based scientists transcribed sequences of unknown function. working to develop and apply high-throughput approaches for detecting all sequence As a reflection of this complexity, about 5% elements that confer biological function. The results of this pilot phase will guide of the human genome is evolutionarily future efforts to analyze the entire human genome. conserved with respect to rodent genomic sequences, and therefore is inferred to be With the complete human genome sequence namics; undoubtedly, additional, yet-to-be- functionally important (8, 9). Yet only about now in hand (1–3), we face the enormous defined types of functional sequences will one-third of the sequence under such selec- challenge of interpreting it and learning how also need to be included. tion is predicted to encode (1, 2). to use that information to understand the To illustrate the magnitude of the chal- Our collective knowledge about putative biology of human health and disease. The lenge involved, it only needs to be pointed out functional, noncoding elements, which repre- ENCyclopedia Of DNA Elements (ENCODE) that an inventory of the best-defined func- sent the majority of the remaining functional Project is predicated on the belief that a tional components in the human genome— sequences in the human genome, is remark- comprehensive catalog of the structural and the -coding sequences—is still incom- ably underdeveloped at the present time. functional components encoded in the hu- plete for a number of reasons, including the An added level of complexity is that many man genome sequence will be critical for fragmented of human . Even with functional genomic elements are only active understanding human biology well enough essentially all of the human genome sequence or expressed in a restricted fashion—for to address those fundamental aims of bio- in hand, the number of protein-coding genes example, in certain types or at particular medical research. Such a complete catalog, can still only be estimated (currently 20,000 developmental stages. Thus, one could envi- or Bparts list,[ would include protein-coding to 25,000) (3). Non–protein-coding genes are sion that a truly comprehensive inventory of genes, non–protein-coding genes, transcrip- much less well defined. Some, such as the functional elements might require high- tional regulatory elements, and sequences ribosomal RNA and tRNA genes, were iden- throughput analyses of every human cell type that mediate chromosome structure and dy- tified several decades ago, but more recent at all developmental stages. The path toward executing such a comprehensive study is not *Affiliations for all members of the ENCODE Con- Hartman, J. Rozowsky, O. Emanuelsson, T. Royce, S. clear and, thus, a major effort to determine sortium can be found on Science Online at www. Chung, M. Gerstein, Z. Lian, J. Lian, Y. Nakayama, S. sciencemag.org/cgi/content/full/306/5696/636/DC1. Weissman,V.Stolc,W.Tongprasit,H.Sethi). how to conduct such studies is warranted. .To whom correspondence should be addressed. Additional ENCODE Pilot Phase Participants: To complement ongoing large-scale proj- E-mail: [email protected] British Columbia Cancer Agency Genome Sciences ects that are contributing to the identification ENCODE Project Scientific Management: Centre (S. Jones, M. Marra, H. Shin, J. Schein); Broad of conserved elements in the human genome National Human Genome Research Institute (E. A. Institute (M. Clamp, K. Lindblad-Toh, J. Chang, D. B. Feingold, P. J. Good, M. S. Guyer, S. Kamholz, L. Liefer, Jaffe, M. Kamal, E. S. Lander, T. S. Mikkelsen, J. Vinson, M. (10) (www.intlgenome.org) and to the isola- K. Wetterstrand, F. S. Collins). C. Zody); Children’s Hospital Oakland Research Institute tion and characterization of human full- Initial ENCODE Pilot Phase Participants: (P.J.deJong,K.Osoegawa,M.Nefedov,B.Zhu);National length cDNAs (11), the ENCODE Project Affymetrix, Inc. (T. R. Gingeras, D. Kampa, E. A. Sekinger, Human Genome Research Institute/Computational (www.genome.gov/ENCODE) was launched J. Cheng, H. Hirsch, S. Ghosh, Z. Zhu, S. Patel, A. GenomicsUnit(A.D.Baxevanis,T.G.Wolfsberg); to develop and implement high-throughput Piccolboni, A. Yang, H. Tammana, S. Bekiranov, P. National Human Genome Research Institute/Molecular Kapranov, R. Harrison, G. Church, K. Struhl); Ludwig Genetics Section (F. S. Collins, G. E. Crawford, J. Whittle, methods for identifying functional elements Institute for Cancer Research (B. Ren, T. H. Kim, L. O. I.E.Holt,T.J.Vasicek,D.Zhou,S.Luo);NIHIntramural in the human genome. ENCODE is being Barrera, C. Qu, S. Van Calcar, R. Luna, C. K. Glass, M. G. Sequencing Center/National Human Genome Research implemented in three phases—a pilot phase, Rosenfeld); Municipal Institute of Medical Research (R. Institute (E. D. Green, G. G. Bouffard, E. H. Margulies, M. E. a technology development phase, and a Guigo, S. E. Antonarakis, E. Birney, M. Brent, L. Pachter, Portnoy,N.F.Hansen,P.J.Thomas,J.C.McDowell,B. A. Reymond, E. T. Dermitzakis, C. Dewey, D. Keefe, F. Maskeri,A.C.Young,J.R.Idol,R.W.Blakesley);National production phase. In the pilot phase, the Denoeud, J. Lagarde, J. Ashurst, T. Hubbard, J. J. Library of Medicine (G. Schuler); Pennsylvania State ENCODE Consortium (see below) is eval- Wesselink, R. Castelo, E. Eyras); Stanford University University (W. Miller, R. Hardison, L. Elnitski, P. Shah); uating strategies for identifying various (R. M. Myers, A. Sidow, S. Batzoglou, N. D. Trinklein, S. J. The Institute for Genomic Research (S. L. Salzberg, M. types of genomic elements. The goal of Hartman,S.F.Aldred,E.Anton,D.I.Schroeder,S.S. Pertea, W. H. Majoros); University of California, Santa Marticke,L.Nguyen,J.Schmutz,J.Grimwood,M.Dickson, Cruz (D. Haussler, D. Thomas, K. R. Rosenbloom, H. the pilot phase is to identify a set of pro- G.M.Cooper,E.A.Stone,G.Asimenos,M.Brudno); Clawson, A. Siepel, W. J. Kent). cedures that, in combination, can be applied UniversityofVirginia(A.Dutta,N.Karnani,C.M.Taylor, ENCODE Technology Development Phase Participants: cost-effectively and at high-throughput to H. K. Kim, G. Robins); University of Washington (G. Boston University (Z. Weng, S. Jin, A. Halees, H. Burden, accurately and comprehensively character- Stamatoyannopoulos, J. A. Stamatoyannopoulos, M. U. Karaoz, Y. Fu, Y. Yu, C. Ding, C. R. Cantor); Mas- Dorschner, P. Sabo, M. Hawrylycz, R. Humbert, J. sachusetts General Hospital (R. E. Kingston, J. Dennis); ize large regions of the human genome. The Wallace,M.Yu,P.A.Navas,M.McArthur,W.S.Noble); NimbleGenSystems,Inc.(R.D.Green,M.A.Singer,T.A. pilot phase will undoubtedly reveal gaps in Wellcome Trust Sanger Institute (I. Dunham, C. M. Koch, Richmond,J.E.Norton,P.JFarnham,M.J.Oberley,D.R. the current set of tools for detecting func- R.M.Andrews,G.K.Clelland,S.Wilcox,J.C.Fowler,K.D. Inman);NimbleGenSystems,Inc.(M.R.McCormick,H. tional sequences, and may reveal that some James, P. Groth, O. M. Dovey, P. D. Ellis, V. L. Wraight, A. J. Kim, C. L. Middle, M. C. Pirrung); University of California, methods being used are inefficient or Mungall, P. Dhami, H. Fiegler, C. F. Langford, N. P. Carter, SanDiego(X.D.Fu,Y.S.Kwon,Z.Ye);Universityof D. Vetrie); Yale University (M. Snyder, G. Euskirchen, A. E. Massachusetts Medical School (J. Dekker, T. M. Tabuchi, unsuitable for large-scale utilization. Some Urban, U. Nagalakshmi, J. Rinn, G. Popescu, P. Bertone, S. N. Gheldof, J. Dostie, S. C. Harvey). of these problems should be addressed in

636 22 OCTOBER 2004 VOL 306 SCIENCE www.sciencemag.org G ENES IN A CTION S PECIAL the ENCODE technology development region. To ensure that existing data sets and tium, who are working in a highly collabora- phase (being executed concurrently with knowledge would be effectively utilized, tive way to implement and evaluate a set of the pilot phase), which aims to devise new roughly half of the 30 Mb was selected computational and experimental approaches laboratory and computational methods that manually. The two main criteria used for the for identifying functional elements. The pilot S improve our ability to identify known func- manual selection were as follows: (i) the phase began in September 2003 with the ECTION tional sequences or to discover new func- presence of extensively characterized genes funding of eight projects (table S1) that in- tional genomic elements. The results of the and/or other functional elements; and (ii) the volve the application of existing technologies first two phases will be used to determine availability of a substantial amount of com- to the large-scale identification of a variety of the best path forward for analyzing the parative sequence data. For example, the functional elements in the ENCODE targets, remaining 99% of the human genome in a genomic segments containing the "-and$- specifically genes, promoters, enhancers, cost-effective and comprehensive produc- globin clusters were chosen because of repressors/silencers, , origins of repli- tion phase. the wealth of data available for these loci cation, sites of replication termination, tran- (12). On the other hand, the region encom- scription factor binding sites, methylation ENCODE Targets passing the CFTR (cystic fibrosis trans- sites, (DNase I) hyper- The defining feature of the ENCODE pilot membrane conductance regulator) gene sensitive sites, modifications, and phase is the uniform focus on a selected 30 was selected because of the extensive multispecies conserved sequences of yet un- Mb of the human genome. Each pilot-phase amount of multispecies sequence data avail- known function (Fig. 1). Genetic variation participant will study the entire set of able (13). Once the manual selections had within the conserved sequences is also being ENCODE targets—44 discrete regions that been made, the remaining targets were determined. The methodological approaches together encompass È1% of the human chosen at random by means of an algorithm being employed in the pilot phase include genome. All approaches will thus be tested that ensured that the complete set of targets transcript and chromatin immunoprecipitation– on a relatively large amount of genomic se- represented the range of gene content and microarray hybridization (ChIP-chip; see quence, allowing an assessment of the ability level of nonexonic conservation (relative to below) analyses with different microarray of each to be applied at large scale. The use mouse) found in the human genome. The platforms, computational methods for finding of a single target set will allow the results of locations and characteristics of the 44 genes and for identifying highly conserved different approaches to be directly compared ENCODE target regions (along with addi- sequences, and expression reporter assays. with one another. tional details about their selection) are In addition to the initial eight, other groups The set of ENCODE targets was chosen available at the UCSC ENCODE Genome have since joined the ENCODE Consortium. to represent a range of genomic features Browser (www.genome.ucsc.edu/ENCODE/ These include groups doing comparative (www.genome.gov/10005115). It was first regions_build34.html). sequencing specifically for ENCODE, groups decided that a number of smaller regions coordinating databases for sequence-related (0.5 to 2 Mb) distributed across many dif- The ENCODE Consortium and other types of ENCODE data, and groups ferent chromosomes should be chosen, as The pilot phase is being undertaken by a conducting studies on specific sequence opposed to (for example) a single 30-Mb group of investigators, the ENCODE Consor- elements (table S1). Beyond these current

Hypersensitive Sites

DNA Replication Sites

CH3CO CH3 Epigenetic Modifications

ChIP-chip other Computational DNase Reporter Microarray Predictions Digestion Assays Hybridization and RT-PCR

Gene

Long-range regulatory elements cis-regulatory elements (enhancers, repressors/ silencers, insulators) (promoters, Transcript factor binding sites)

Fig. 1. Functional genomic elements being identified by the ENCODE pilot phase. The indicated methods are being used to identify different types of functional elements in the human genome.

www.sciencemag.org SCIENCE VOL 306 22 OCTOBER 2004 637 GGENESENES IN INAACTIONCTION

participants, the ENCODE Consortium is tion of known elements and for identification determine common patterns of genetic vari- open to all interested academic, government, of heretofore unknown elements. ation (18). This will result in a quantitative

ECTION and private-sector investigators, as long as view of the evolutionary constraints on

S they abide by established Consortium guide- Research Plans conserved regions. lines (www.genome.gov/10006162), which Each group is using one or more high- require the commitment to work on the throughput approaches to detect a specific Data Management and Analysis entire set of ENCODE targets, to partici- genomic element(s). In some cases, multiple Capturing, storing, integrating, and display-

PECIAL pate in all Consortium activities, to make a platforms are being evaluated in comparable ing the diverse data generated will be S significant intellectual contribution, and to experiments. For example, several types of challenging. Data that can be directly linked release data in accordance with the policies microarrays [e.g., oligonucleotide arrays made to genomic sequence will be managed at the specifically established for the project (see by different technologies, polymerase chain UCSC Genome Browser (www.genome. below). reaction (PCR)–amplicon arrays] are being ucsc.edu/ENCODE) (19). Other data types The parallel technology development used to identify transcribed regions. The pilot will be stored either at available public phase (table S2) is intended to expand the project participants are primarily using a databases [e.g., the GEO ( ‘‘tool box’’ available for high-throughput restricted number of cell lines to identify Omnibus) (www.ncbi.nlm.nih.gov/geo) and identification of functional genomic ele- functional elements. This approach is being ArrayExpress (www.ebi.ac.uk/arrayexpress) ments, and includes projects to develop new taken for practical purposes, but it has the sites for microarray data] or on publicly methods both for more efficient identifica- limitation that not all cell types will be accessible Web sites specifically developed surveyed, and therefore some elements with by ENCODE Consortium participants. An tissue-restricted function may not be identi- ENCODE portal will also be established to Platypus Monotremata fied in the pilot phase. To facilitate compari- index these data, allowing users to query Marsupialia Opossum son of data generated on different platforms different data types regardless of location. Elephant Proboscidea Tenrec Insectivora and by different approaches, a common set of Access to metadata associated with each Armadillo Hyracoidea reagents is being included whenever appro- experiment will be provided. The ENCODE Hedgehog Edentata priate. So far, the common reagents chosen pilot phase will make use of the MAGE stan- Shrew Insectivora include two cell lines (HeLa S3, a cervical dard for representing microarray data (20), Bat Chiroptera adenocarcinoma, and GM06990, an Epstein- and data standards for other data types will Bovine Arteriodactyla Barr virus–transformed B-lymphocyte) and be developed as needed. Cat Carnivora two antibodies [one for the general tran- Figure 3 uses the early pilot-phase data Dog scription factor TAFII250 (14) and another for one of the ENCODE target regions Rat for the inducible STAT-1 (ENr231) to illustrate how the ENCODE Mouse Rodentia (15)]. data will be presented in a way that will cap- Guinea pig The ENCODE pilot phase also includes a ture the innovation of the ENCODE Project’s Rabbit Lagomorpha Galago component that is generating sequences of goal of developing a full representation of Mouse lemur the genomic regions that are orthologous to functional genomic elements. The results of Dusky titi the ENCODE target regions from a large set the different analyses are presented as paral- Owl monkey of nonhuman vertebrates (www.nisc.nih.gov/ lel tracks aligned with the genomic sequence. Marmoset open_page.html?/projects/encode/index.cgi) For example, the gene structures for known Colobus Primates (16 ). This will allow ENCODE to identify genes are shown; future ENCODE data will Macaque the quality and amount of comparative se- include the precise locations of 5¶ transcrip- Baboon quence data necessary to accurately identify tion start sites, intron/ boundaries, and 3¶ Orangutan evolutionarily conserved elements, and to polyadenylation sites for each gene in the Chimpanzee develop more powerful computational tools ENCODE targets. In addition, efforts will be Human for using comparative sequence data to infer made to confirm all gene predictions in these Fig. 2. Mammals for which genomic sequence biological function. An initial set of 10 ver- regions. Positions of the evolutionarily con- is being generated for regions orthologous to tebrates have been selected for targeted served regions, as detected by analyzing theENCODEtargets.Genomicsequencesof sequencing on the basis of multiple factors, sequences from several organisms, are shown theENCODEtargetsarebeinggeneratedfor including phylogenetic position (Fig. 2 and can be correlated with results of other the indicated mammalian species. The current and table S3) and the availability of a bacteri- studies. This presentation will allow a com- plans are to produce high-quality finished al artificial chromosome (BAC) library. In prehensive depiction of the results from all of (blue), comparative-grade finished (red), or assembled whole-genome shotgun (green) addition to this ENCODE-specific effort, the ENCODE components, enabling both sequence, as indicated. High-quality finished comparative sequence data are also being comparisons of different methods and of the reflects highly accurate and contiguous se- captured from ongoing whole-genome se- reproducibility of the data from different quence, with a best-faith effort used to resolve quencing projects, including those for mouse, laboratories. The former is illustrated by com- all difficult regions (26). Comparative-grade rat, dog, chicken, cow, chimpanzee, macaque, parison of two different methods for localiz- finished reflects sequence with greater than frog, and zebrafish. A unique RefSeq accession ing regions, one based on reporter eightfold coverage that has been subjected to additional manual refinement to ensure accu- number (17) is being assigned for the sequence constructs containing sequences around puta- rate order and orientation of all sequence of each ENCODE-orthologous target region in tive transcription initiation sites (21) and the contigs (16). In the case of whole-genome each species, with periodic data freezes. other involving chromatin immunoprecipita- shotgun sequence, the actual coverage and A feature of the evolutionarily conserved tion (ChIP) with an antibody to RNA poly- quality may vary. Other vertebrate species for elements to be assayed is sequence variation. merase (RNAP) and hybridization to DNA which sequences orthologous to the ENCODE This will be accomplished by resequencing microarrays (chip) (22) (so-called ChIP-chip) targets are being generated include chicken, frog, and zebrafish (not shown). A complete list PCR-amplified fragments from genomic to identify sequences bound by components of the ENCODE comparative sequencing efforts DNA of 48 individuals, the same samples of the transcriptional machinery. Reproduc- is provided in table S3. being used by the HapMap Consortium to ibility is illustrated by the comparison of

638 22 OCTOBER 2004 VOL 306 SCIENCE www.sciencemag.org G ENES IN A CTION S PECIAL

Base Position 148450000 148500000 148550000 148600000 148650000 148700000 148750000 148800000 148850000 Human mRNAs Conservation chimp mouse rat S

chicken ECTION RNA Transcripts / Affymetrix RNA Transcripts / Yale RefSeq Genes MGC Genes Promoters / Stanford ChIP-RNAP / Affymetrix ChIP-RNAP / Ludwig DNaseI HS / NHGRI DNaseI HS / DNA Rep 0-2 hr / UVa DNA Rep 2-4 hr / UVa

Fig. 3. UCSC Genome Browser display of representative ENCODE data. The sequences tested for promoter activity in a reporter assay is labeled as genomic coordinates for ENCODE target ENr231 on chromosome 1 are Promoters/Stanford. The positions of transcripts identified by oligonucleo- indicated along the top. The different tracks are labeled at the left with the tide microarray hybridization (RNA Transcripts/Affymetrix and RNA Tran- source of the data. The Conservation track shows a measure of evolutionary scripts/Yale) and sequences detected by ChIP/chip analysis (ChIP-RNAP/ conservation based on a phylogenetic hidden Markov model (phylo-HMM) Ludwig and ChIP-RNAP/Affymetrix) are indicated. The DNA replication (27). Multiz (28) alignments of the human, chimpanzee, mouse, rat, and tracks show segments that are detected to replicate during specified chicken assemblies were used to generate the species tracks. RefSeq Genes, intervals of S phase in synchronized HeLa cells. The 0-2 hours and 2-4 hours MGC Genes indicate the mapping of mRNA transcripts from RefSeq (17) tracks show segments that replicate during the first and second 2 hours and MGC (24) projects, respectively, whereas the track labeled Human periods of S phase. The DNA fragments released by DNase I cleavage were mRNAs represents all mRNAs in GenBank. Other tracks represent verified identified either from CD4þ cells (DNaseI HS/NHGRI) or from K562 cells data from the ENCODE Consortium (23). The track with the location of (DNaseI HS/Regulome). studies conducted by two laboratories within First, ‘‘data verification’’ refers to the assess- best path forward for the identification of the ENCODE Consortium, which analyzed ment of the reproducibility of an experi- functional elements in the human genome. different biological starting materials using ment; ENCODE data will be released once By the conclusion of the pilot phase, the 44 the ChIP-chip approach and found an 83% they have been experimentally shown to be ENCODE targets will inevitably be the most concordance in the identified RNAP-binding reliable. Because different types of experi- well-characterized regions in the human sites in the region (23). Representation of the mental data will require different means of genome and will likely be the basis of many ENCODE data in this manner will afford a demonstrating such reliability, the Consor- future genome studies. For example, other synthetic and integrative view of the func- tium will identify a minimal verification large genomics efforts, such as the Mamma- tional structure of each of the target regions standard necessary for public release of each lian Gene Collection (MGC) program (24) and, ultimately, the entire human genome. data type. These standards will be posted and the International HapMap Project (25), Each research group will analyze and on the ENCODE Web site. Subsequently, are already coordinating their efforts to publish its own data to evaluate the experi- ENCODE pilot-phase participants will use ensure effective synergy with ENCODE ac- mental methods it is using and to elucidate other experimental approaches to ‘‘validate’’ tivities. Although the identification of all new information about the biological function the initial experimental conclusion. This en- functional genomic elements remains the goal of the identified sequence elements. In addi- riched information will also be deposited in of the ENCODE project, attaining such tion, the Consortium will organize, analyze, the public databases. comprehensiveness will be challenging, and publish on all ENCODE data available on Second, the report of the Fort Lauderdale especially in the short term. For example, specific subjects, such as multiple sequence meeting recognized that deposition in a not all types of elements, such as centromeres, alignments, gene models, and comparison of public database is not equivalent to publica- telomeres, and other yet-to-be defined ele- different technological platforms to identify tion in a peer-reviewed journal. Thus, the ments, will be surveyed in the pilot project. specific functional elements. At the conclusion NHGRI and ENCODE participants respect- Nonetheless, this project will bring together of the pilot phase, the ENCODE Consortium fully request that, until ENCODE data are scientists with diverse interests and exper- expects to compare different methods used by published, users adhere to normal scientific tise, who are now focused on assembling the the Consortium members and to recommend a etiquette for the use of unpublished data. functional catalog of the human genome. set of methods to use for expanding this project Specifically, data users are requested to cite into a full production phase on the entire the source of the data (referencing this References and Notes genome. paper) and to acknowledge the ENCODE 1. International Human Genome Sequencing Consor- tium, Nature 409, 860 (2001). Consortium as the data producers. Data users 2. J. C. Venter et al., Science 291, 1304 (2001). Data Release and Accessibility are also asked to voluntarily recognize the 3. International Human Genome Sequencing Consor- The National Human Genome Research In- interests of the ENCODE Consortium and its tium, Nature, in press. 4. Y. Okazaki et al., Nature 420, 563 (2002). stitute (NHGRI) has identified ENCODE as members to publish initial reports on the 5. T. Ota et al., Nat. Genet. 36, 40 (2004). a ‘‘community resource project’’ as defined generation and analyses of their data as pre- 6. J. L. Rinn et al., Genes Dev. 17, 529 (2003). ataninternationalmeetingheldinFort viously described. Along with these publi- 7. P. Kapranov, V. I. Sementchenko, T. R. Gingeras, Brief Lauderdale in January 2003 (www.wellcome. cations, the complete annotations of the Funct. Genomic Proteomic 2, 47 (2003). 8. International Rat Sequencing Consortium, Nature ac.uk/doc_WTD003208.html). Accordingly, functional elements in the initial ENCODE 428, 493 (2004). the ENCODE data-release policy (www. targets will be made available at both the 9. International Mouse Sequencing Consortium, Nature genome.gov/ENCODE) stipulates that data, UCSC ENCODE Genome Browser and the 420, 520 (2002). 10. D. Boffelli, M. A. Nobrega, E. M. Rubin, Nat. Rev. once verified, will be deposited into public ENSEMBL Browser (www.ensembl.org). Genet. 5, 456 (2004). databases and made available for all to use 11. T. Imanishi et al., PLoS Biol. 2, 856 (2004). without restriction. Conclusion 12. T. Evans, G. Felsenfeld, M. Reitman, Annu. Rev. Cell Biol. 6, 95 (1990). Two concepts associated with this data- ENCODE will play a critical role in the next 13. J. W. Thomas et al., Nature 424, 788 (2003). release policy deserve additional discussion. phase of genomic research by defining the 14. S. Ruppert, E. H. Wang, R. Tjian, Nature 362, 175 (1993).

www.sciencemag.org SCIENCE VOL 306 22 OCTOBER 2004 639 GGENESENES IN INAACTIONCTION

15. K. Shuai, C. Schindler, V. R. Prezioso, J. E. Darnell Jr., 23. ENCODE Consortium, unpublished data. G. Weinstock, G. Churchill, M. Eisen, S. Elgin, S. Elledge, Science 258, 1808 (1992). 24. Mammalian Gene Collection (MGC) Project Team, J. Rine, and M. Vidal. We thank D. Leja, and M. 16. R. W. Blakesley et al., Genome Res. 14, 2235 (2004). Genome Res.14, 2121 (2004). Cichanowski for their work in creating figures for this

ECTION 17. K. D. Pruitt, T. Tatusova, D. R. Maglott, Nucleic Acids 25. International HapMap Consortium, Nature 426, 789 paper. Supported by the National Human Genome

S Res. 31, 34 (2003). (2003). Research Institute, the National Library of Medicine, 18. International HapMap Consortium, Nat. Rev. Genet. 26. A. Felsenfeld, J. Peterson, J. Schloss, M. Guyer, the Wellcome Trust, and the Howard Hughes Medical 5, 467 (2004). Genome Res. 9, 1 (1999). Institute. 19. W. J. Kent et al., Genome Res. 12, 996 (2002). 27. A. Siepel, D. Haussler, in Statistical Methods in 20. P. T. Spellman et al., Genome Biol 3, RESEARCH0046 Molecular Evolution,R.Nielsen,Ed.(Springer,New

PECIAL (2002). York, in press). Supporting Online Material

S 21. N. D. Trinklein et al., Genome Res. 14, 62 (2004). 28. M. Blanchette et al., Genome Res. 14, 708 (2004). www.sciencemag.org/cgi/content/full/306/5696/636/ 22. B. Ren, B. D. Dynlacht, Methods Enzymol. 376, 304 29. The Consortium thanks the ENCODE Scientific Ad- DC2 (2004). visory Panel for their helpful advice on the project: Tables S1 to S3

VIEWPOINT Systems Biology and New Technologies Enable Predictive and Preventative Medicine

Leroy Hood,1* James R. Heath,2,3 Michael E. Phelps,3 Biaoyang Lin1

Systems approaches to disease are grounded in the idea that disease-perturbed ic knockout strain had a distinct pattern of protein and gene regulatory networks differ from their normal counterparts; we have perturbed gene expression, with hundreds been pursuing the possibility that these differences may be reflected by multi- of mRNAs changing per knockout. About parameter measurements of the blood. Such concepts are transforming current 15% of the perturbed mRNAs potentially en- diagnostic and therapeutic approaches to medicine and, together with new tech- coded secreted proteins (8). If gene expres- nologies, will enable a predictive and preventive medicine that will lead to per- sion in diseased tissues also reveals patterns sonalized medicine. characteristic of pathologic, genetic, or envi- ronmental changes that are, in turn, reflected Biological information is divided into the dig- metabolic networks). The cis control ele- in the pattern of secreted proteins in the ital information of the genome and the envi- ments, together with transcription factors, blood, then perhaps blood could serve as a ronmental cues that arise outside the genome. regulate the levels of expression of individual diagnostic window for disease analysis. Fur- Integration of these types of information leads genes. They also form the linkages and ar- thermore, protein and gene regulatory net- to the dynamic execution of instructions as- chitectures of the gene regulatory networks works dynamically changed upon exposure sociated with the development of organisms that integrate dynamically changing inputs of yeast to an environmental perturbation (9). and their physiological responses to their en- from signal transduction pathways and pro- The dynamic progression of disease should vironments. The digital information of the vide dynamically changing outputs to the similarly be reflected in temporal change(s) genome is ultimately completely knowable, batteries of genes mediating physiological from the normal state to the various stages implying that biology is unique among the and developmental responses (5, 6). The of disease-perturbed networks. sciences, in that biologists start their quest hypothesis that is beginning to revolutionize for understanding systems with a knowable medicine is that disease may perturb the nor- Systems Approaches to Prostate core of information. Systems biology is a sci- mal network structures of a system through Cancer entific discipline that endeavors to quantify genetic perturbations and/or by pathological Cancer arises from multiple spontaneous all of the molecular elements of a biological environmental cues, such as infectious agents and/or inherited mutations functioning in system to assess their interactions and to in- or chemical carcinogens. networks that control central cellular events tegrate that information into graphical net- (10–12). It is becoming clear from our re- work models (1–4) that serve as predictive Systems Approaches to Model Systems search that the evolving states of prostate hypotheses to explain emergent behaviors. and Implications for Disease cancer are reflected in dynamically changing The genome encodes two major types of A model of a metabolic process (galactose expression patterns of the genes and proteins information: (i) genes whose proteins exe- utilization) in yeast was developed from exist- within the diseased cells. cute the functions of life and (ii) cis control ing literature data to formulate a network hy- A first step toward constructing a systems elements. Proteins may function alone, in pothesis that was tested and refined through a biology network model is to build a com- complexes, or in networks that arise from series of genetic knockouts and environmental prehensive expressed-mRNA database on protein interactions or from proteins that perturbations (7). Messenger RNA (mRNA) the cell type of interest. We have used a are interconnected functionally through small concentrations were monitored for all 6000 technology called multiple parallel signature molecules (such as signal transduction or genes in the genome, and these data were sequencing (MPSS) (13) to sequence a com- integrated with protein/protein and protein/ plementary DNA (cDNA) library at a rate of 1Institute for Systems Biology, Seattle, WA, USA. DNA interaction data from the literature by a a million sequences in a single run and to 2Department of Chemistry, California Institute of 3 graphical network program (Fig. 1). detect mRNA transcripts down to one or a Technology, Pasadena, CA, USA. Department of The model provided new insights into the few copies per cell. A database containing Molecular and Medical Pharmacology, The David Geffen School of Medicine at the University of control of a metabolic process and its in- more than 20 million mRNA signatures was California Los Angeles, Los Angeles, CA, USA. teractions with other cellular processes. It constructed for normal prostate tissues and *To whom correspondence should be addressed. also suggested several concepts for systems an androgen-sensitive prostate cancer cell E-mail: [email protected] approaches to human disease. Each genet- line, LNCaP, in four states: androgen-starved,

640 22 OCTOBER 2004 VOL 306 SCIENCE www.sciencemag.org