Introduction to Genomes with Ensembl

Introduction to Genomes with Ensembl Dr. Giulietta M. Spudich Ensembl Outreach Team1 of 24 Objectives What information about a gene can I find? What about a region of the genome? How do I navigate the data? 2 of 31 Introduction 1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 gb) Large amounts of raw DNA sequence data Genome Sequencing Fragment BAC clones Assemble Sequence Assemble Scaffolds Contigs Genome sequence CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT GCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGC CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT TTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTT AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AAGAATCTGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAA AGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAG GGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATT AGACTTAGGAAGGAATGTTCCCAATAGTAGACTAAAAGTCTTCGCACAGTGAAAT CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG TGAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA The Ensembl genome browser: making it interesting Regulation Conserved sequence Gene Allele • Splice variants, proteins, non-coding RNA • Small and large scale sequence variation, phenotype associations • Whole genome alignments, protein trees • Potential promoters and enhancers, DNA methylation • User upload, custom data Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/ 6 21 May 2012 Genome Browsers • Ensembl Genome Browsers http://www.ensemblgenomes.org • NCBI Map Viewer http://www.ncbi.nlm.nih.gov/mapview/ • UCSC Genome Browser http://genome.ucsc.edu 7 of 31 Ensembl is Used Worldwide Top users: UK US Canada China France Germany Italy Japan Spain 8 of 31 Data Volume Challenge • UniProtKB/Swiss-Prot (reviewed) Q8IU82 536,029 (25,871 human) protein sequences • UniProtKB/TrEMBL 22,128,511 (217,918) NCBI RefSeq (reviewed) 15,744,232 (24,539) NP_006570 NM_006579 www.uniprot.org 9 of 24 A consensus set of protein coding sequences • Reaching a consensus coding sequence set for human and mouse. • 26,473 (human) 22,187 (mouse) (*as of Sept 2011) • If you see a “CCDS ID”, the coding sequence is agreed upon. Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4 10 of 31 What are the gold transcripts? UTR Coding Intron 11 of 31 VEGA/Havana (human, mouse, z-fish) • Automatic annotation pipeline: Gene building all at once (whole genome) Ensembl • Manual curation: reviewed by experts VEGA: Vertebrate Genome Annotation Havana 12 of 31 Genes and Transcripts in Ensembl High Quality: • CCDS transcripts • Ensembl/Havana merged (gold) transcripts 13 of 31 Ensembl/Havana • Transcripts are from: Ensembl Havana Ensembl/Havana Both (“gold”) Havana (00_) Ensembl (20_) Havana (00_) 14 of 31 Gene Names in Ensembl • ENSG### Ensembl Gene ID • ENST### Ensembl Transcript ID • ENSP### Ensembl Peptide ID • ENSE### Ensembl Exon ID • For non-human species a suffix is added: MUS for M. musculus ENSMUSG### DAR (Danio rerio) for zebrafish: ENSDARG### 15 of 31 Ensembl Features • The gene set. • Comparative analysis • Variation and regulation • BioMart (data export) • Display of external data (DAS) • Programmatic access via the Perl API • Open Source 16 of 31 Objectives What information about a gene can I find? What about a region of the genome? How do I navigate the data? See our coursebook for walk-throughs and exercises using our browser: http://www.ensembl.org/info/website/tutorials/coursebook.pdf 17 of 31 Variation • Nucleotide level • Single nucleotide polymorphism (SNP) • Small insertions and deletions (InDels) • Microsatellites (short tandem repeats) • Structural • Copy number variations (CNV) • Large insertions and deletions Sequence displays Gene: Sequence Transcript:cDNA Transcript: Exons Comparative Genomics 69 species in e!67 Ensembl tools Phenotype for a gene How is all this information organised? • Ensembl Views (Website) • Ensembl Database (open source) • BioMart „DataMining tool‟ 23 of 31 Help and documentation • Comments and questions? [email protected] • Mailing lists [email protected], [email protected] • Course online www.ensembl.info/ecourse • Our tutorials page www.ensembl.org/info/website/tutorials • YouTube channel www.youtube.com/user/EnsemblHelpdesk Follow us • Facebook www.facebook.com/Ensembl.org • Twitter https://twitter.com/Ensembl • Come visit our blog! www.ensembl.info Publications http://www.ensembl.org/info/about/publications.html • Flicek, P. et. al. Ensembl 2012 Nucleic Acids Res 40:D84-90 (2012) http://nar.oxfordjournals.org/content/40/D1/D84.long • Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244 • Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295 Ensembl Team Ensembl Paul Flicek (EBI), Steve Searle (Wellcome Trust Sanger Institute) Software Andy Yates, Stephen Keenan, Monika Komorowska, Rhoda Kinsella, Thomas Maurel, Kieron Taylor Comparative Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato, Miguel Pignatelli Genomics Regulation Ian Dunham, Ikhlak Ahmed, Nathan Johnson, Thomas Juettemann, Steven Wilder Variation Fiona Cunningham, Laurent Gil, Sarah Hunt, Will McLaren, Graham Ritchie, Anja Thormann Analysis and Bronwen Aken, Amonida Zadissa, Dan Barrell, Susan Fairley, Carlos Garcίa Girón, Thibaut Annotation Hourlier, Andreas Kähäri, Rishi Nag, Magali Ruffier, Simon White Anne Parker, Ridwan Amode, Simon Brent, Bethan Pritchard, Harpreet Riat, Dan Sheppard, Steve Web Team Trevanion Outreach Giulietta M. Spudich, Jeff Almeida-King, Denise Carvalho-Silva, Bert Overduin, Michael Schuster Paul Kersey, Paul Derwent, Jay Humphrey, Arnaud Kerhornou, Eugene Kulesha, Nick Langridge, Ensembl Uma Maheswari, Mark McDowall, Michael Nuhn, Helder Pedro, Claudia Rato da Silva, Dan Genomes Staines, Iliana Toneva Ensembl Ewan Birney, Richard Durbin, Paul Flicek, Jen Harrow, Tim Hubbard, Glenn Proctor, Steve Searle Strategy .

Introduction to Genomes with Ensembl

Multi-Class Protein Classification Using Adaptive Codes

Establishing Incentives and Changing Cultures to Support Data Access

Transformative Models to Promote Prescription Drug Innovation and Access: a Landscape Analysis

A Database of Globular Protein Structural Domains: Clustering of Representative Family Members Into Similar Folds R Sowdhamini, Stephen D Rufino and Tom L Blundell

Governance of Data Access

HMMER User's Guide

A Novel and Effective Scoring Scheme for Structure Classification And

Genomic Anatomy of the Tyrp1 (Brown) Deletion Complex

Genomethics Study End of Project Report and Evidence of Impact and Reach, 2010-2016

8:45 Arrival and Registration 9:15 Introduction: Edinburgh Genomics

Breakout 7: What Do National and International Health Data

Genomics England Publication Policy