Introduction to Genomes with Ensembl
Dr. Giulietta M. Spudich
Ensembl Outreach Team1 of 24 Objectives
What information about a gene can I find? What about a region of the genome? How do I navigate the data?
2 of 31 Introduction
1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 gb)
Large amounts of raw DNA sequence data Genome Sequencing
Fragment
BAC clones
Assemble Sequence
Assemble Scaffolds Contigs Genome sequence CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT GCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGC CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT TTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTT AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AAGAATCTGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAA AGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAG GGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATT AGACTTAGGAAGGAATGTTCCCAATAGTAGACTAAAAGTCTTCGCACAGTGAAAT CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG TGAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA The Ensembl genome browser: making it interesting
Regulation
Conserved sequence Gene Allele
• Splice variants, proteins, non-coding RNA • Small and large scale sequence variation, phenotype associations • Whole genome alignments, protein trees • Potential promoters and enhancers, DNA methylation • User upload, custom data
Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/
6 21 May 2012 Genome Browsers
• Ensembl Genome Browsers http://www.ensemblgenomes.org
• NCBI Map Viewer http://www.ncbi.nlm.nih.gov/mapview/
• UCSC Genome Browser http://genome.ucsc.edu
7 of 31 Ensembl is Used Worldwide
Top users:
UK US Canada China France Germany Italy Japan Spain
8 of 31 Data Volume Challenge
• UniProtKB/Swiss-Prot (reviewed) Q8IU82 536,029 (25,871 human) protein sequences • UniProtKB/TrEMBL 22,128,511 (217,918)
NCBI RefSeq (reviewed) 15,744,232 (24,539) NP_006570 NM_006579 www.uniprot.org 9 of 24 A consensus set of protein coding sequences
• Reaching a consensus coding sequence set for human and mouse.
• 26,473 (human)
22,187 (mouse) (*as of Sept 2011)
• If you see a “CCDS ID”, the coding sequence is agreed upon.
Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4 10 of 31 What are the gold transcripts?
UTR Coding Intron
11 of 31 VEGA/Havana (human, mouse, z-fish) • Automatic annotation pipeline: Gene building all at once (whole genome) Ensembl
• Manual curation: reviewed by experts VEGA: Vertebrate Genome Annotation Havana
12 of 31 Genes and Transcripts in Ensembl
High Quality:
• CCDS transcripts
• Ensembl/Havana merged (gold) transcripts
13 of 31 Ensembl/Havana • Transcripts are from: Ensembl Havana Ensembl/Havana
Both (“gold”)
Havana (00_)
Ensembl (20_)
Havana (00_)
14 of 31 Gene Names in Ensembl
• ENSG### Ensembl Gene ID • ENST### Ensembl Transcript ID • ENSP### Ensembl Peptide ID • ENSE### Ensembl Exon ID
• For non-human species a suffix is added: MUS for M. musculus ENSMUSG### DAR (Danio rerio) for zebrafish: ENSDARG###
15 of 31 Ensembl Features
• The gene set.
• Comparative analysis
• Variation and regulation
• BioMart (data export)
• Display of external data (DAS)
• Programmatic access via the Perl API
• Open Source 16 of 31
Objectives
What information about a gene can I find? What about a region of the genome? How do I navigate the data?
See our coursebook for walk-throughs and exercises using our browser: http://www.ensembl.org/info/website/tutorials/coursebook.pdf 17 of 31
Variation
• Nucleotide level • Single nucleotide polymorphism (SNP) • Small insertions and deletions (InDels) • Microsatellites (short tandem repeats)
• Structural • Copy number variations (CNV) • Large insertions and deletions
Sequence displays
Gene: Sequence Transcript:cDNA
Transcript: Exons Comparative Genomics
69 species in e!67 Ensembl tools Phenotype for a gene How is all this information organised? • Ensembl Views (Website)
• Ensembl Database (open source)
• BioMart „DataMining tool‟
23 of 31 Help and documentation
• Comments and questions? [email protected]
• Mailing lists [email protected], [email protected]
• Course online www.ensembl.info/ecourse
• Our tutorials page www.ensembl.org/info/website/tutorials
• YouTube channel www.youtube.com/user/EnsemblHelpdesk
Follow us
• Facebook www.facebook.com/Ensembl.org
• Twitter https://twitter.com/Ensembl
• Come visit our blog! www.ensembl.info
Publications http://www.ensembl.org/info/about/publications.html
• Flicek, P. et. al. Ensembl 2012 Nucleic Acids Res 40:D84-90 (2012) http://nar.oxfordjournals.org/content/40/D1/D84.long
• Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244
• Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295 Ensembl Team
Ensembl Paul Flicek (EBI), Steve Searle (Wellcome Trust Sanger Institute)
Software Andy Yates, Stephen Keenan, Monika Komorowska, Rhoda Kinsella, Thomas Maurel, Kieron Taylor
Comparative Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato, Miguel Pignatelli Genomics
Regulation Ian Dunham, Ikhlak Ahmed, Nathan Johnson, Thomas Juettemann, Steven Wilder
Variation Fiona Cunningham, Laurent Gil, Sarah Hunt, Will McLaren, Graham Ritchie, Anja Thormann
Analysis and Bronwen Aken, Amonida Zadissa, Dan Barrell, Susan Fairley, Carlos Garcίa Girón, Thibaut Annotation Hourlier, Andreas Kähäri, Rishi Nag, Magali Ruffier, Simon White
Anne Parker, Ridwan Amode, Simon Brent, Bethan Pritchard, Harpreet Riat, Dan Sheppard, Steve Web Team Trevanion
Outreach Giulietta M. Spudich, Jeff Almeida-King, Denise Carvalho-Silva, Bert Overduin, Michael Schuster
Paul Kersey, Paul Derwent, Jay Humphrey, Arnaud Kerhornou, Eugene Kulesha, Nick Langridge, Ensembl Uma Maheswari, Mark McDowall, Michael Nuhn, Helder Pedro, Claudia Rato da Silva, Dan Genomes Staines, Iliana Toneva
Ensembl Ewan Birney, Richard Durbin, Paul Flicek, Jen Harrow, Tim Hubbard, Glenn Proctor, Steve Searle Strategy