Introduction to with Ensembl

Dr. Giulietta M. Spudich

Ensembl Outreach Team1 of 24 Objectives

 What information about a gene can I find?  What about a region of the ?  How do I navigate the data?

2 of 31 Introduction

1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 gb)

Large amounts of raw DNA sequence data Genome Sequencing

Fragment

BAC clones

Assemble Sequence

Assemble Scaffolds Contigs Genome sequence CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT GCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGC CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT TTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTT AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AAGAATCTGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAA AGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAG GGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATT AGACTTAGGAAGGAATGTTCCCAATAGTAGACTAAAAGTCTTCGCACAGTGAAAT CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG TGAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA The Ensembl genome browser: making it interesting

Regulation

Conserved sequence Gene Allele

• Splice variants, proteins, non-coding RNA • Small and large scale sequence variation, phenotype associations • Whole genome alignments, protein trees • Potential promoters and enhancers, DNA methylation • User upload, custom data

Figure adapted from the ENCODE project www..com/nature/focus/encode/

6 21 May 2012 Genome Browsers

• Ensembl Genome Browsers http://www.ensemblgenomes.org

• NCBI Map Viewer http://www.ncbi.nlm.nih.gov/mapview/

• UCSC Genome Browser http://genome.ucsc.edu

7 of 31 Ensembl is Used Worldwide

Top users:

UK US Canada China France Germany Italy Japan Spain

8 of 31 Data Volume Challenge

• UniProtKB/Swiss-Prot (reviewed) Q8IU82 536,029 (25,871 human) protein sequences • UniProtKB/TrEMBL 22,128,511 (217,918)

NCBI RefSeq (reviewed) 15,744,232 (24,539) NP_006570 NM_006579 www.uniprot.org 9 of 24 A consensus set of protein coding sequences

• Reaching a consensus coding sequence set for human and mouse.

• 26,473 (human)

22,187 (mouse) (*as of Sept 2011)

• If you see a “CCDS ID”, the coding sequence is agreed upon.

Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4 10 of 31 What are the gold transcripts?

UTR Coding Intron

11 of 31 VEGA/Havana (human, mouse, z-fish) • Automatic annotation pipeline: Gene building all at once (whole genome) Ensembl

• Manual curation: reviewed by experts VEGA: Vertebrate Genome Annotation Havana

12 of 31 Genes and Transcripts in Ensembl

High Quality:

• CCDS transcripts

• Ensembl/Havana merged (gold) transcripts

13 of 31 Ensembl/Havana • Transcripts are from: Ensembl Havana Ensembl/Havana

Both (“gold”)

Havana (00_)

Ensembl (20_)

Havana (00_)

14 of 31 Gene Names in Ensembl

• ENSG### Ensembl Gene ID • ENST### Ensembl Transcript ID • ENSP### Ensembl Peptide ID • ENSE### Ensembl Exon ID

• For non-human species a suffix is added: MUS for M. musculus ENSMUSG### DAR (Danio rerio) for zebrafish: ENSDARG###

15 of 31 Ensembl Features

• The gene set.

• Comparative analysis

• Variation and regulation

• BioMart (data export)

• Display of external data (DAS)

• Programmatic access via the Perl API

• Open Source 16 of 31

Objectives

 What information about a gene can I find?  What about a region of the genome?  How do I navigate the data?

See our coursebook for walk-throughs and exercises using our browser: http://www.ensembl.org/info/website/tutorials/coursebook.pdf 17 of 31

Variation

• Nucleotide level • Single nucleotide polymorphism (SNP) • Small insertions and deletions (InDels) • Microsatellites (short tandem repeats)

• Structural • Copy number variations (CNV) • Large insertions and deletions

Sequence displays

Gene: Sequence Transcript:cDNA

Transcript: Exons Comparative Genomics

69 species in e!67 Ensembl tools Phenotype for a gene How is all this information organised? • Ensembl Views (Website)

• Ensembl Database (open source)

• BioMart „DataMining tool‟

23 of 31 Help and documentation

• Comments and questions? [email protected]

• Mailing lists [email protected], [email protected]

• Course online www.ensembl.info/ecourse

• Our tutorials page www.ensembl.org/info/website/tutorials

• YouTube channel www.youtube.com/user/EnsemblHelpdesk

Follow us

• Facebook www.facebook.com/Ensembl.org

• Twitter https://twitter.com/Ensembl

• Come visit our blog! www.ensembl.info

Publications http://www.ensembl.org/info/about/publications.html

• Flicek, P. et. al. Ensembl 2012 Nucleic Acids Res 40:D84-90 (2012) http://nar.oxfordjournals.org/content/40/D1/D84.long

• Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244

• Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295 Ensembl Team

Ensembl Paul Flicek (EBI), Steve Searle (Wellcome Trust Sanger Institute)

Software Andy Yates, Stephen Keenan, Monika Komorowska, Rhoda Kinsella, Thomas Maurel, Kieron Taylor

Comparative Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato, Miguel Pignatelli Genomics

Regulation Ian Dunham, Ikhlak Ahmed, Nathan Johnson, Thomas Juettemann, Steven Wilder

Variation Fiona Cunningham, Laurent Gil, Sarah Hunt, Will McLaren, Graham Ritchie, Anja Thormann

Analysis and Bronwen Aken, Amonida Zadissa, Dan Barrell, Susan Fairley, Carlos Garcίa Girón, Thibaut Annotation Hourlier, Andreas Kähäri, Rishi Nag, Magali Ruffier, Simon White

Anne Parker, Ridwan Amode, Simon Brent, Bethan Pritchard, Harpreet Riat, Dan Sheppard, Steve Web Team Trevanion

Outreach Giulietta M. Spudich, Jeff Almeida-King, Denise Carvalho-Silva, Bert Overduin, Michael Schuster

Paul Kersey, Paul Derwent, Jay Humphrey, Arnaud Kerhornou, Eugene Kulesha, Nick Langridge, Ensembl Uma Maheswari, Mark McDowall, Michael Nuhn, Helder Pedro, Claudia Rato da Silva, Dan Genomes Staines, Iliana Toneva

Ensembl , Richard Durbin, Paul Flicek, Jen Harrow, Tim Hubbard, Glenn Proctor, Steve Searle Strategy