DOI: 10.1126/Science.1196914 , 1775 (2010); 330 Science , Et Al. Mark B. Gerstein Modencode Project Genome by the Caenorhabditis

Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project Mark B. Gerstein, et al. Science 330, 1775 (2010); DOI: 10.1126/science.1196914 This copy is for your personal, non-commercial use only. If you wish to distribute this article to others, you can order high-quality copies for your colleagues, clients, or customers by clicking here. Permission to republish or repurpose articles or portions of articles can be obtained by following the guidelines here. The following resources related to this article are available online at www.sciencemag.org (this infomation is current as of December 26, 2010 ): Updated information and services, including high-resolution figures, can be found in the online version of this article at: http://www.sciencemag.org/content/330/6012/1775.full.html Supporting Online Material can be found at: http://www.sciencemag.org/content/suppl/2010/12/20/science.1196914.DC1.html This article cites 65 articles, 22 of which can be accessed free: http://www.sciencemag.org/content/330/6012/1775.full.html#ref-list-1 on December 27, 2010 This article has been cited by 1 articles hosted by HighWire Press; see: http://www.sciencemag.org/content/330/6012/1775.full.html#related-urls This article appears in the following subject collections: Genetics http://www.sciencemag.org/cgi/collection/genetics www.sciencemag.org Downloaded from Science (print ISSN 0036-8075; online ISSN 1095-9203) is published weekly, except the last week in December, by the American Association for the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. Copyright 2010 by the American Association for the Advancement of Science; all rights reserved. The title Science is a registered trademark of AAAS. RESEARCH ARTICLES genes as a human and all of the information necessary to specify the major tissues and cell types of metazoans. Integrative Analysis of the From the project start in 2007 (2), the C. elegans modENCODE groups had by February 2010 collected 237 genome-wide data sets (table Caenorhabditis elegans Genome S1) bearing on gene structure, RNA expression profiling, chromatin structure and regulation, and by the modENCODE Project evolutionary conservation. To ensure the complete- ness and standardization of modENCODE data, 1,2,3 1,2 4 1,2 Mark B. Gerstein, *† Zhi John Lu, * Eric L. Van Nostrand, * Chao Cheng, * all data sets were submitted to the modENCODE 5,6 7,8 1,2 1 9 Bradley I. Arshinoff, * Tao Liu, * Kevin Y. Yip, * Rebecca Robilotto, * Andreas Rechtsteiner, * Data Coordinating Center; hand curated with ex- 10 1 11 5 12 Kohta Ikegami, * Pedro Alves, * Aurelien Chateigner, * Marc Perry, * Mitzi Morris, * tensive, structured metadata; validated for com- 1 5,22 1 13 14,15 Raymond K. Auerbach, * Xin Feng, * Jing Leng, * Anne Vielle, * Wei Niu, * pleteness; and checked for consistency before 12 2,3 1,2 16 4 Kahn Rhrissorrakrai, * Ashish Agarwal, Roger P. Alexander, Galt Barber, Cathleen M. Brdlik, release at www.modencode.org. 10 4 11 13 16 Jennifer Brennan, Jeremy Jean Brouillet, Adrian Carr, Ming-Sin Cheung, Hiram Clawson, Analyses of these data reveal (i) directly sup- 11 17 18 19 38 Sergio Contrino, Luke O. Dannenberg, Abby F. Dernburg, Arshad Desai, Lindsay Dick, ported protein-coding genes containing 5′ and 3′ 18 3 9 10 14 20 Andréa C. Dosé, Jiang Du, Thea Egelhofer, Sevinc Ercan, Ghia Euskirchen, Brent Ewing, ends and alternative splice junctions; (ii) sets of 21 19 21 20 11 Elise A. Feingold, Reto Gassmann, Peter J. Good, Phil Green, Francois Gullier, noncoding RNAs, including RNAs belonging to 12 21 1 23 24 Michelle Gutwein, Mark S. Guyer, Lukas Habegger, Ting Han, Jorja G. Henikoff, known classes and previously unknown types; (iii) 29 16 17 26 17 Stefan R. Henz, Angie Hinrichs, Heather Holster, Tony Hyman, A. Leo Iniguez, gene expression and transcription factor (TF)– 15 10 28 16 5 Judith Janette, Morten Jensen, Masaomi Kato, W. James Kent, Ellen Kephart, binding profiles across developmental stages; (iv) 23 1,2 23 13 30 Vishal Khivansara, Ekta Khurana, John K. Kim, Paulina Kolasinska-Zwierz, Eric C. Lai, genomic locations bound by many of the TFs 13 20 31 5 1 Isabel Latorre, Amber Leahey, Suzanna Lewis, Paul Lloyd, Lucas Lochovsky, analyzed, designated as HOT (high-occupancy 21 32 11 20 33 Rebecca F. Lowdon, Yaniv Lubling, Rachel Lyne, Michael MacCoss, Sebastian D. Mackowiak, target) regions; (v) a hierarchy of candidate regu- 12 34 12 20 Marco Mangone, Sheldon McKay, Desirea Mecenas, Gennifer Merrihew, latory interactions among TFs and its relationship 27 19 20 24 18 on December 27, 2010 David M. Miller III, Andrew Muroyama, John I. Murray, Siew-Loon Ooi, Hoang Pham, to the network of microRNAs (miRNAs) and their 9 20 33 25 17 Taryn Phippen, Elicia A. Preston, Nikolaus Rajewsky, Gunnar Rätsch, Heidi Rosenbaum, targets; (vi) differences in histone modifications 1,2 11 5 26 2 Joel Rozowsky, Kim Rutherford, Peter Ruzanov, Mihail Sarov, Rajkumar Sasidharan, and nuclear-envelope interactions between the 1,2 12 32 7,8 1 28 Andrea Sboner, Paul Scheid, Eran Segal, Hyunjin Shin, Chong Shou, Frank J. Slack, centers and arms of autosomes and between auto- 35 11 27 31 7 Cindie Slightam, Richard Smith, William C. Spencer, E. O. Stinson, Scott Taing, somes and the X chromosome; (vii) evidence 9 20 19 15 31 Teruaki Takasaki, Dionne Vafeados, Ksenia Voronina, Guilin Wang, Nicole L. Washington, for chromatin-mediated epigenetic transmission 10 35 1,2 25,36 5 14 Christina M. Whittle, Beijing Wu, Koon-Kiu Yan, Georg Zeller, Zheng Zha, Mei Zhong, of the memory of gene expression from adult 10 13 9 Xingliang Zhou, modENCODE Consortium,‡ Julie Ahringer, † Susan Strome, † germ cells to embryos; and (viii) predictive mod- 12,37 11 7,8 15 4,35 Kristin C. Gunsalus, † Gos Micklem, † X. Shirley Liu, † Valerie Reinke, † Stuart K. Kim, † els that relate chromatin state to TF-binding sites 20 24 12,37 4,14 LaDeana W. Hillier, † Steven Henikoff, † Fabio Piano, † Michael Snyder, † and to expression levels of protein- and miRNA- www.sciencemag.org 5,6,34 10 20 Lincoln Stein, † Jason D. Lieb, † Robert H. Waterston † encoding genes. The summation of features annotated through these functional data sets provides a potential We systematically generated large-scale data sets to improve genome annotation for the nematode explanation for most of the conserved sequences Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling in the C. elegans genome and lays the foundation – across a developmental time course, genome-wide identification of transcription factor binding for further study of how the genome of a multi- sites, and maps of chromatin organization. From this, we created more complete and accurate cellular organism accurately directs development gene models, including alternative splice forms and candidate noncoding RNAs. We constructed and maintains homeostasis. Downloaded from hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different The Transcriptome patterns of chromatin composition and histone modification were revealed between chromosome Accurate and comprehensive annotation of all arms and centers, with similarly prominent differences between autosomes and the X chromosome. RNA transcripts (the transcriptome) provides a Integrating data types, we built statistical models relating chromatin, transcription factor binding, and framework for interpreting other genomic features, gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome. such as TF-binding sites and chromatin marks. At the project’s inception [WS170; WormBase omplete genome sequences provide a ila melanogaster so as to systematically annotate versions used for specific analyses can be found view of the full instruction set of an or- the functional genomic elements in these orga- in (6)], the C. elegans genome lacked direct ex- Cganism. However, understanding the nisms (2). perimental support for about one third of pre- functional content of a genome requires more Given its intermediate complexity between dicted splice junctions, and some of these than DNA sequence. To address this need, in single-celled eukaryotes and mammals, C. elegans predictions were erroneous (7, 8). Many genes 2003 the U.S. National Human Genome Re- offers an outstanding system for studies of ge- lacked transcript start sites and polyadenylate search Institute (NHGRI) initiated the Encyclo- nome organization and function. C. elegans was [poly(A)] addition sites; exons and even whole pedia of DNA Elements (ENCODE) project in the first multicellular organism with a fully de- genes were missing. To address these deficiencies, order to study the human genome in greater depth fined cell lineage, a nervous system reconstructed cDNA-based evidence was obtained through high- (1). Recognizing the importance of well-annotated through serial-section electron microscopy, and throughput sequencing (RNA-seq), reverse tran- model genomes, in 2007 the NHGRI initiated a sequenced genome (3–5). Its 100.3-Mb genome scription polymerase chain reaction (RT-PCR)/ the model organism ENCODE (modENCODE) is only about eight times larger than that of S. rapid amplification of cDNA ends (RACE), and project on Caenorhabditis elegans and Drosoph- cerevisiae, and yet it contains almost as many tiling arrays from a variety of stages, conditions, www.sciencemag.org SCIENCE VOL 330 24 DECEMBER 2010 1775 RESEARCH ARTICLES and tissues (tables S1, S3, and S4). Analysis of overlapped with those detected with RNA-seq, to annotate protein-coding sequences (CDSs) and the data yielded previously unrecognized protein- providing independent support for 37,830 of 5′ and 3′ untranslated regions (UTRs). coding genes, refined the structure of known these features (fig. S9). In addition, mass spec- For each of the 19 stages and conditions, we protein-coding genes, revealed the dynamics of trometry proved that of 359 tested, 73 single- built transcript sets purely on the basis of RNA-seq expression and alternative splicing, provided evi- exon genes produced protein.

Load more