DATABASES and TOOLS for BROWSING GENOMES Ewan Birney,1 Michele Clamp,2 and Tim Hubbard2

Total Page:16

File Type:pdf, Size:1020Kb

Load more

31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD 10.1146/annurev.genom.3.030502.101529 Annu. Rev. Genomics Hum. Genet. 2002. 3:293–310 doi: 10.1146/annurev.genom.3.030502.101529 Copyright c 2002 by Annual Reviews. All rights reserved DATABASES AND TOOLS FOR BROWSING GENOMES Ewan Birney,1 Michele Clamp,2 and Tim Hubbard2 1European Bioinformatics Institute (EMBL-EBI), 2Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom; e-mail: [email protected], [email protected], [email protected] Key Words genome sequence, gene prediction, relational database, open source software, distributed annotation ■ Abstract To maximize the value of genome sequences they need to be integrated with other types of biological data and with each other. The entire collection of data then needs to be made available in a way that is easy to view and mine for complex relationships. The recently determined vertebrate genome sequences of human and mouse are so large that building the infrastructure to manage these datasets is a major challenge. This article reviews the database systems and tools for analysis that have so far been developed to address this. INTRODUCTION The human genome sequence represents the first bounded biological dataset con- cerning our species. Having access to it is a landmark because of the limits it sets on the problem of understanding biology as a whole. It has allowed us to assemble something equivalent to an “edge” of a multidimensional jigsaw puzzle. This is only a first step in completeness, but at least it gives us a feeling for size and boundaries and directs us toward the “middle” of the puzzle that now needs to be filled in. It is the first step toward other complete datasets: the complete set of genes, the complete set of proteins, the complete set of molecular interactions in the cell, etc. Completeness changes the way we ask questions: “Is this gene involved in this function?” becomes “which gene carries out this function?” These datasets will be determined by a combination of experimental work and compu- tational analysis, but in the context of the genome sequence. Genome sequences provide a framework around which all this biological knowledge can potentially be organized, so each layer of data will lead to a greater understanding of layers of organization of biological systems above it. The availability of several closely related genome sequences (e.g., mouse, rat) brings the possibility of building lists of molecular features common to all ver- tebrate species and those that are unique to our own. Evolutionary similarities between individual proteins can be identified across the whole of life. Among vertebrates, conservation between genome sequences goes beyond similarities 1527-8204/02/0728-0293$14.00 293 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD 294 BIRNEY CLAMP HUBBARD between protein-coding genes and extends to gene order, resulting in large syn- tenic blocks of many megabases. Other more distant nonvertebrate genomes, such as fly or worm, cannot be usefully compared to human at the level of chromoso- mal organization (except in rare cases such as the HOX gene cluster), but ortholog genes and proteins can be identified. Analysis of conserved networks of equivalent proteins in distant species will lead to understanding how common cellular sys- tems work and what makes species different from each other. It will also increase our understanding of how biology accommodates change through evolution and variation within populations of individuals. Refining and extending this set of or- thologies will have a substantial impact on the development of medical treatments, as these relate studies of molecular systems in model organisms to human. Because all biological data is in some way information about how biology as a whole is organized, it is most valuable when systematically organized and integrated. Having these large collections of raw data, which include protein and RNA as well as genome sequences and structures, protein and RNA expression patterns, and cellular localization images, has created a huge need for databases to store information, provide access, and add value. For a current snapshot of the huge range of biological databases a good source is the annual special database issue of Nucleic Acids Research, published each January. It lists 339 databases in its opening review article in 2002 (5). Long before the first complete genome sequences of free living cells were determined, groups from around the world had been tackling the issues of (a) building repositories for raw data, (b) adding annotation to this raw data, and (c) providing higher level structure and organization. Examples of repositories are the public DNA sequence databases of EMBL (37), GenBank (6), and DDBJ (38) as well as the public protein structure database PDB (40). Examples of annotation databases are Flybase, which maintained annotation around the genetics of Drosphila long before a genome se- quence was available (10), and SwissProt, which adds functional annotation to protein sequences largely originating from mRNA and genomic sequencing (3). Examples of organizational databases are Pfam, which groups protein sequence domains into families, thereby showing evolutionary relationships between paral- ogous proteins within an organism and orthologous proteins between organisms (4); SCOP, which groups more distantly related proteins together by structural similarity (26); and KEGG, which organizes proteins and ligands into networks of enzymic processes and regulatory networks (20). These examples are only illus- trative, as the roles are not even clear cut in these cases, e.g., the DNA sequence repositories and organizational databases both contain some annotation. What differentiates the above databases from genome-sequence-based databases is that the former are all founded around primary sets of data that are currently un- bounded, whereas in the case of the latter the primary dataset is essentially bounded. We do not know how many protein folds there are. We do know the complete DNA sequence for many organisms. One of the results of the integration of these existing databases with genome-sequence databases is that we can propagate this complete- ness to identify where the gaps in our knowledge lie. For example, we can identify 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD GENOME DATABASES 295 the places in the human genome where there is evidence for a gene but where we do not yet have the full-length mRNA transcript or the protein sequence for which it codes. We can identify which protein sequences in a complete genome can be as- sociated to a known three-dimensional structure and which need to be targeted for X-ray or nuclear magnetic resonance (NMR) structure determination, such as those determined by structural genomics projects. We can identify which proteins are not part of any known cellular pathway and thus need to be targeted for functional analysis. A major current challenge for all these database projects is to increase their in- tegration by means that may include propagating information upward from the complete genomes. One of the problems that urgently needs to be addressed in this integration is the maintenance of evidence trails linking derived annotation with the source of its evidence. For example, a protein of unknown function is labeled as a kinase because of a weak sequence homology to another protein that is known to be a kinase. Later it is discovered that the weak homology between the sequences was false and was due to a frameshift error in one of the protein sequences. Because most databases do not track the relationship between annotation and the evidence that supported it, the ‘kinase’ label is likely to persist even when the justification for it has vanished. Ideally all objects in all databases should have stable, versioned identifiers, and the relationships between them should be recorded so that when a sequence version changes it can be automatically determined that any evidence that relies on it needs to be reevaluated. When considering errors introduced by propagated annotation, it is worth also remembering the range of quality and completeness of data that is presented in databases as a whole. Ideally we would have a complete set of exact experimental measurements while being able to compute the behavior of biology exactly, in order to predict perturbations to naturally occurring systems. Currently we can do neither. For example, we have experimental methods for extracting and sequenc- ing mRNA from cells, however there are many transcripts that are present in such small quantities or for such a transient period that they have never been isolated. Similarly, our ability to “compute” biology is currently very limited. As is dis- cussed below, automatic prediction of gene structures in higher organisms gives good levels of accuracy, but it is still prone to a variety of errors. Between these two extremes of prediction and experimental determination lies a third type of anno- tation, that of manual curation. The combination of human skills and experience, coupled frequently with knowledge of the scientific literature, means that careful manual annotation is less error prone and more complete than automatic annota- tion, although it is much slower and harder to maintain. Prediction and curation complement each other well in fact. The volume of data from newly sequenced genomes makes automatic annotation essential, where speed is essentially only limited by the available central processing unit (CPU) and frequent updates are required to take account of the ever-increasing amount of biological data avail- able to support annotations. Therefore when making use of annotation, or ‘data’ (such as protein sequences) based on annotation (gene structures from genome 31 Jul 2002 11:44 AR AR167-GG03-12.tex AR167-GG03-12.sgm LaTeX2e(2002/01/18) P1: IBD 296 BIRNEY CLAMP HUBBARD sequence), users must be aware of whether it is based on prediction, curation, or experimentation; what the expected accuracy is; and when it was last updated.
Recommended publications
  • Gene Prediction: the End of the Beginning Comment Colin Semple

    Gene Prediction: the End of the Beginning Comment Colin Semple

    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by PubMed Central http://genomebiology.com/2000/1/2/reports/4012.1 Meeting report Gene prediction: the end of the beginning comment Colin Semple Address: Department of Medical Sciences, Molecular Medicine Centre, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK. E-mail: [email protected] Published: 28 July 2000 reviews Genome Biology 2000, 1(2):reports4012.1–4012.3 The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2000/1/2/reports/4012 © GenomeBiology.com (Print ISSN 1465-6906; Online ISSN 1465-6914) Reducing genomes to genes reports A report from the conference entitled Genome Based Gene All ab initio gene prediction programs have to balance sensi- Structure Determination, Hinxton, UK, 1-2 June, 2000, tivity against accuracy. It is often only possible to detect all organised by the European Bioinformatics Institute (EBI). the real exons present in a sequence at the expense of detect- ing many false ones. Alternatively, one may accept only pre- dictions scoring above a more stringent threshold but lose The draft sequence of the human genome will become avail- those real exons that have lower scores. The trick is to try and able later this year. For some time now it has been accepted increase accuracy without any large loss of sensitivity; this deposited research that this will mark a beginning rather than an end. A vast can be done by comparing the prediction with additional, amount of work will remain to be done, from detailing independent evidence.
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

    The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe

    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD [email protected] www.ebi.ac.uk What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science www.ebi.ac.uk/services Browse our services 9 What services do we provide? Labs around the
  • Functional Effects Detailed Research Plan

    Functional Effects Detailed Research Plan

    GeCIP Detailed Research Plan Form Background The Genomics England Clinical Interpretation Partnership (GeCIP) brings together researchers, clinicians and trainees from both academia and the NHS to analyse, refine and make new discoveries from the data from the 100,000 Genomes Project. The aims of the partnerships are: 1. To optimise: • clinical data and sample collection • clinical reporting • data validation and interpretation. 2. To improve understanding of the implications of genomic findings and improve the accuracy and reliability of information fed back to patients. To add to knowledge of the genetic basis of disease. 3. To provide a sustainable thriving training environment. The initial wave of GeCIP domains was announced in June 2015 following a first round of applications in January 2015. On the 18th June 2015 we invited the inaugurated GeCIP domains to develop more detailed research plans working closely with Genomics England. These will be used to ensure that the plans are complimentary and add real value across the GeCIP portfolio and address the aims and objectives of the 100,000 Genomes Project. They will be shared with the MRC, Wellcome Trust, NIHR and Cancer Research UK as existing members of the GeCIP Board to give advance warning and manage funding requests to maximise the funds available to each domain. However, formal applications will then be required to be submitted to individual funders. They will allow Genomics England to plan shared core analyses and the required research and computing infrastructure to support the proposed research. They will also form the basis of assessment by the Project’s Access Review Committee, to permit access to data.
  • Download Final Programme

    Download Final Programme

    Session Overview Saturday 17 September 2011 11:15 - 13:15 Arrival and Registration ATC Main Entrance 13:15 - 13:30 Welcome and Opening Remarks Klaus Tschira Auditorium 13:30 - 18:00 Session 1: Somatic Genetics I Chaired by David Tuveson and Ewan Birney Klaus Tschira Auditorium 18:00 - 19:00 Keynote Lecture: Lynda Chin Klaus Tschira Auditorium 19:00 - 20:30 Dinner ATC Canteen Sunday 18 September 2011 09:00 - 12:30 Session 2: Somatic Genetics II / Epigenetics Chaired by James R. Downing Klaus Tschira Auditorium 12:30 - 14:30 Poster Session I and Lunch ATC Foyer and Helix A 14:30 - 18:30 Session 3: Mouse Genetics Chaired by Lynda Chin Klaus Tschira Auditorium 18:30 - 23:00 Gala Dinner and Live Music ATC Canteen and ATC Rooftop Lounge Page 1 EMBO|EMBL Symposium: Cancer Genomics Monday 19 September 2011 09:00 - 13:00 Session 4: Computational Chaired by Peter Lichter Klaus Tschira Auditorium 13:00 - 15:00 Poster Session II and Lunch ATC Foyer and Helix A 15:00 - 16:00 Session 5: Somatic Genetics III Chaired by Andy Futreal Klaus Tschira Auditorium 16:00 - 17:00 Keynote Lecture: Michael Stratton Klaus Tschira Auditorium 17:00 - 17:15 Closing Remarks and Poster Prize Klaus Tschira Auditorium Page 2 Programme Saturday 17 September 2011 11:15 - 13:15 Arrival and Registration ATC Main Entrance 13:15 - 13:30 Welcome and Opening Remarks Klaus Tschira Auditorium 13:30 - 18:00 Session 1: Somatic Genetics I Chaired by David Tuveson and Ewan Birney Klaus Tschira Auditorium 13:30 - 14:00 Somatic genomic alterations in chronic lymphocytic 1 leukemia Elias
  • Phenotype Inference in an Escherichia Coli Strain Panel

    Phenotype Inference in an Escherichia Coli Strain Panel

    TOOLS AND RESOURCES Phenotype inference in an Escherichia coli strain panel Marco Galardini1, Alexandra Koumoutsi2, Lucia Herrera-Dominguez2, Juan Antonio Cordero Varela1, Anja Telzerow2, Omar Wagih1, Morgane Wartel2, Olivier Clermont3,4, Erick Denamur3,4,5, Athanasios Typas2*, Pedro Beltrao1* 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL- EBI), Hinxton, United Kingdom; 2Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany; 3INSERM, IAME, UMR1137, Paris, France; 4Universite´ Paris Diderot, Paris, France; 5APHP, Hoˆpitaux Universitaires Paris Nord Val-de-Seine, Paris, France Abstract Understanding how genetic variation contributes to phenotypic differences is a fundamental question in biology. Combining high-throughput gene function assays with mechanistic models of the impact of genetic variants is a promising alternative to genome-wide association studies. Here we have assembled a large panel of 696 Escherichia coli strains, which we have genotyped and measured their phenotypic profile across 214 growth conditions. We integrated variant effect predictors to derive gene-level probabilities of loss of function for every gene across all strains. Finally, we combined these probabilities with information on conditional gene essentiality in the reference K-12 strain to compute the growth defects of each strain. Not only could we reliably predict these defects in up to 38% of tested conditions, but we could also directly identify the causal variants that were validated through complementation assays. Our work demonstrates the power of forward predictive models and the possibility of precision genetic interventions. DOI: https://doi.org/10.7554/eLife.31035.001 *For correspondence: [email protected] (AT); [email protected] (PB) Introduction Competing interests: The Understanding the genetic and molecular basis of phenotypic differences among individuals is a authors declare that no long-standing problem in biology.
  • Molecular Genetics & Genomics

    Molecular Genetics & Genomics

    page 46 Lab Times 5-2010 Ranking Illustration: Christina Ullman Publication Analysis 1997-2008 Molecular Genetics & Genomics Under the premise of a “narrow” definition of the field, Germany and England co-dominated European molecular genetics/genomics. The most frequently citated sub-fields were bioinformatical genomics, epigenetics, RNA biology and DNA repair. irst of all, a little science history (you’ll soon see why). As and expression. That’s where so-called computational biology is well known, in the 1950s genetics went molecular – and and systems biology enter research into basic genetic problems. Fdid not just become molecular genetics but rather molec- Given that development, it is not easy to answer the question ular bio logy. In 1963, however, Sydney Brenner wrote in his fa- what “molecular genetics & genomics” today actually is – and, mous letter to Max Perutz: “[...] I have long felt that the future of in particular, what is it in the context of our publication analy- molecular biology lies in the extension of research to other fields sis of the field? It is obvious that, as for example science historian of biology, notably development and the nervous system.” He Robert Olby put it, a “wide” definition can be distinguished from appeared not to be alone with this view and, as a consequence, a “narrow” definition of the field. The wide definition includes along with Brenner many of the leading molecular biologists all fields, into which molecular biology has entered as an exper- from the classical period redirected their research agendas, utilis- imental and theoretical paradigm. The “narrow” definition, on ing the newly developed molecular techniques to investigate un- the other hand, still tries to maintain the status as an explicit bio- solved problems in other fields.
  • Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

    Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

    European Bioinformatics Institute EMBL-EBI Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by PickeringHutchins Ltd www.pickeringhutchins.com EMBL member states: Austria, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, United Kingdom. Associate member state: Australia EMBL-EBI is a part of the European Molecular Biology Laboratory (EMBL) EMBL-EBI EMBL-EBI EMBL-EBI EMBL-European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD United Kingdom Tel. +44 (0)1223 494 444, Fax +44 (0)1223 494 468 www.ebi.ac.uk EMBL Heidelberg Meyerhofstraße 1 69117 Heidelberg Germany Tel. +49 (0)6221 3870, Fax +49 (0)6221 387 8306 www.embl.org [email protected] EMBL Grenoble 6, rue Jules Horowitz, BP181 38042 Grenoble, Cedex 9 France Tel. +33 (0)476 20 7269, Fax +33 (0)476 20 2199 EMBL Hamburg c/o DESY Notkestraße 85 22603 Hamburg Germany Tel. +49 (0)4089 902 110, Fax +49 (0)4089 902 149 EMBL Monterotondo Adriano Buzzati-Traverso Campus Via Ramarini, 32 00015 Monterotondo (Rome) Italy Tel. +39 (0)6900 91402, Fax +39 (0)6900 91406 © 2012 EMBL-European Bioinformatics Institute All texts written by EBI-EMBL Group and Team Leaders. This publication was produced by the EBI’s Outreach and Training Programme. Contents Introduction Foreword 2 Major Achievements 2011 4 Services Rolf Apweiler and Ewan Birney: Protein and nucleotide data 10 Guy Cochrane: The European Nucleotide Archive 14 Paul Flicek:
  • I S C B N E W S L E T T

    I S C B N E W S L E T T

    ISCB NEWSLETTER FOCUS ISSUE {contents} President’s Letter 2 Member Involvement Encouraged Register for ISMB 2002 3 Registration and Tutorial Update Host ISMB 2004 or 2005 3 David Baker 4 2002 Overton Prize Recipient Overton Endowment 4 ISMB 2002 Committees 4 ISMB 2002 Opportunities 5 Sponsor and Exhibitor Benefits Best Paper Award by SGI 5 ISMB 2002 SIGs 6 New Program for 2002 ISMB Goes Down Under 7 Planning Underway for 2003 Hot Jobs! Top Companies! 8 ISMB 2002 Job Fair ISCB Board Nominations 8 Bioinformatics Pioneers 9 ISMB 2002 Keynote Speakers Invited Editorial 10 Anna Tramontano: Bioinformatics in Europe Software Recommendations11 ISCB Software Statement volume 5. issue 2. summer 2002 Community Development 12 ISCB’s Regional Affiliates Program ISCB Staff Introduction 12 Fellowship Recipients 13 Awardees at RECOMB 2002 Events and Opportunities 14 Bioinformatics events world wide INTERNATIONAL SOCIETY FOR COMPUTATIONAL BIOLOGY A NOTE FROM ISCB PRESIDENT This newsletter is packed with information on development and dissemination of bioinfor- the ISMB2002 conference. With over 200 matics. Issues arise from recommendations paper submissions and over 500 poster submis- made by the Society’s committees, Board of sions, the conference promises to be a scientific Directors, and membership at large. Important feast. On behalf of the ISCB’s Directors, staff, issues are defined as motions and are discussed EXECUTIVE COMMITTEE and membership, I would like to thank the by the Board of Directors on a bi-monthly Philip E. Bourne, Ph.D., President organizing committee, local organizing com- teleconference. Motions that pass are enacted Michael Gribskov, Ph.D., mittee, and program committee for their hard by the Executive Committee which also serves Vice President work preparing for the conference.
  • Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments

    Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments

    Structure-based Realignment of Non-coding RNAs in Multiple Whole Genome Alignments. by Michael Ku Yu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of ARCHIVES Masters of Engineering in Computer Science and Engineering MASSACHUE N U TE at the OF TECH IOLOY MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 2 1 2011 June 2011 LIBRARI ES @ Massachusetts Institute of Technology 2011. All rights reserved. '$7 A uthor ............ .. .. ... ............. Department of Electrical Wgineering and Computer Science May 20, 2011 Certified by..................................... ...... Bonnie Berger Professor of Applied Mathematics and Computer Science Thesis Supervisor Accepted by.... ....................................... Christopher J. Terman Chairman, Department Committee on Graduate Theses 2 Structure-based Realignment of Non-coding RNAs in Multiple Whole Genome Alignments by Michael Ku Yu Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2011, in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Engineering Abstract Whole genome alignments have become a central tool in biological sequence analy- sis. A major application is the de novo prediction of non-coding RNAs (ncRNAs) from structural conservation visible in the alignment. However, current methods for constructing genome alignments do so by explicitly optimizing for sequence simi- larity but not structural similarity. Therefore, de novo prediction of ncRNAs with high structural but low sequence conservation is intrinsically challenging in a genome alignment because the conservation signal is typically hidden. This study addresses this problem with a method for genome-wide realignment of potential ncRNAs ac- cording to structural similarity.
  • UC Irvine UC Irvine Previously Published Works

    UC Irvine UC Irvine Previously Published Works

    UC Irvine UC Irvine Previously Published Works Title An integrated encyclopedia of DNA elements in the human genome. Permalink https://escholarship.org/uc/item/4mx2r10v Journal Nature, 489(7414) ISSN 0028-0836 Author ENCODE Project Consortium Publication Date 2012-09-01 DOI 10.1038/nature11247 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed eScholarship.org Powered by the California Digital Library University of California HHS Public Access Author manuscript Author Manuscript Author ManuscriptNature. Author ManuscriptAuthor manuscript; Author Manuscript available in PMC 2013 March 06. Published in final edited form as: Nature. 2012 September 6; 489(7414): 57–74. doi:10.1038/nature11247. An Integrated Encyclopedia of DNA Elements in the Human Genome The ENCODE Project Consortium Summary The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure, and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall the project provides new insights into the organization and regulation of our genes and genome, and an expansive resource of functional annotations for biomedical research.
  • Functional Genomics Workshop Report

    Functional Genomics Workshop Report

    Report From A Planning Workshop for the National Human Genome Research Institute “From Genome Function to Biomedical Insight: ENCODE and Beyond” March 10-11, 2015 Executive Summary On March 10-11, 2015, the National Human Genome Research Institute (NHGRI) convened a workshop to discuss scientific questions and opportunities that can be addressed by large-scale functional genomics studies, and options for future NHGRI programs in this area following the conclusion of the current phase of ENCODE. ENCODE (Encyclopedia of DNA Elements) aims to develop a comprehensive map of functional elements in the human and mouse genomes. The workshop was video cast and archived (see http://www.genome.gov/27560819). The workshop was organized around three scientific opportunities that would use genomic assays of function to: identify and characterize functional elements in an unbiased manner; interpret disease-associated genetic variation; and address basic biological questions. For each, illustrations of projects that could address unmet scientific needs were presented, followed by discussion of key themes. The workshop concluded with a discussion of goals and priorities spanning the three topics, including discussion of implementation (see Appendix 1: Agenda). There was strong support for pursuit of all three opportunities. Participants were enthusiastic about ENCODE accomplishments to date and the opportunities that flow from that resource. Important themes included that NHGRI should: focus on catalytic work such as developing resources and new approaches;
  • Professor Dame Janet Thornton Director T + 44

    Professor Dame Janet Thornton Director T + 44

    Professor Dame Janet Thornton Director T + 44 (0)1223 494648 F + 44 (0)1223 494496 [email protected] Professor Dame Sally Davies Chief Medical Officer Department of Health Richmond House, Room 123b 79 Whitehall London SW1A 2NS 21 March 2013 Dear Sally, Re: 100,000 Whole Genomes Project Letter from the Chief Medical Officer’s Data Working Group In January this year, you invited me to chair the above working group to consider the necessary standards, infrastructure and expertise that would be required, ideally building on existing platforms to provide the necessary data for clinicians and researchers. You also asked our working group to take into account the Government’s wider programme of activities to support Big Data and to provide platforms for both public and commercial researchers. The advice and opinion on how an integrated, interoperable data management framework can support the aims of the 100,000 Whole Genome project is presented below, with detail in the accompanying annexes. However, it must be stressed that this is a first look at the many and varied issues involved. There will be a need for continued work going forward to refine and formulate more detailed specifications for commissioning of services. The Data Working Group would be happy to provide what further advice it can to support you in this important initiative. The Group looked at three specific areas: 1. Data Infrastructure and Flow, 2. Data Specification and Standards EMBL-European Bioinformatics Institute (EBI) Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD . UK Web www.ebi.ac.uk Email [email protected] 3.