<<

with Ensembl

http://www.ebi.ac.uk/~gspudich/ Dr. Giulietta M. Spudich workshop_presentations/Torino European Institute1 of 24 Today

• Ensembl introduction and browser exercises • BioMart (data mining with Ensembl) • Lunch break • Game – review! • Variations or Comparative • Course end and feedback session (17.00 end)

Objectives

 What information about a gene can I find?  What about sequence variation?  How do I navigate the data?

3 of 27 Introduction

1977: 1st to be sequenced (5 kb) 2004: finished human sequence (3 gb)

Large amounts of raw DNA sequence data Genome sequence CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT GCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGC CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT TTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATAAGTCTT AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AAGAATCTGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAA AGGAAACCATCTTATAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAG GGCTGACTCTGCCGCTGTACCAATCTCCTGTAAAAGAATTAGATAAATTCAAATT AGACTTAGGAAGGAATGTTCCCAATAGTAGACTAAAAGTCTTCGCACAGTGAAAT CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAG CTTACTCCGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCA TTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATT ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG TGAAAGTCCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA The Ensembl genome browser: making it interesting

Regulation

Conserved sequence Gene Allele

• Splice variants, proteins, non-coding RNA • Small and large scale sequence variation, phenotype associations • Whole genome alignments, protein trees • Potential promoters and enhancers, DNA methylation • User upload, custom data

Figure adapted from the ENCODE project www..com/nature/focus/encode/

6 17 February 2012 Genome Browsers

• Ensembl Genome Browsers http://www.ensemblgenomes.org

• NCBI Map Viewer http://www.ncbi.nlm.nih.gov/mapview/

• UCSC Genome Browser http://genome.ucsc.edu

7 of 27 Ensembl is Used Worldwide

Top users:

UK US Canada China France Germany Italy Japan Spain

8 of 27 What is known? Proteins and cDNA/mRNA sequences from the research community found in:

• UniProtKB/Swiss-Prot (manually curated) • UniProtKB/TrEMBL www.uniprot.org

• NCBI RefSeq (manually curated) www.ncbi.nlm.nih.gov/RefSeq

*Is there any consensus between these groups? 9 of 27 CCDS

• Reaching a consensus coding sequence set for human and mouse.

• 26,473 (human)

22,187 (mouse) (*as of Sept 2011)

• If you see a “CCDS ID”, the coding sequence is agreed upon.

Genome Res. 2009 Jul;19(7):1316-23. Epub 2009 Jun 4 10 of 27 What are the gold transcripts?

UTR Coding Intron

11 of 27 VEGA/Havana (human, mouse, z-fish) • Automatic annotation pipeline: Gene building all at once (whole genome) Ensembl

• Manual curation: case-by-case basis VEGA: Vertebrate Genome Annotation Havana

12 of 27 Genes and Transcripts in Ensembl

High Quality:

• CCDS transcripts

• Ensembl/Havana merged transcripts

13 of 27 Ensembl/Havana • Transcripts are from: Ensembl Havana Ensembl/Havana merge

Merged (“gold”)

Havana (00_)

Ensembl (20_)

Havana (00_)

14 of 27 Gene Names in Ensembl

• ENSG### Ensembl Gene ID • ENST### Ensembl Transcript ID • ENSP### Ensembl Peptide ID • ENSE### Ensembl Exon ID

• For non-human species a suffix is added: MUS for M. musculus ENSMUSG### DAR (Danio rerio) for zebrafish: ENSDARG###

15 of 27 Ensembl Features

• The gene set.

• Comparative analysis

• Variation and regulation

• BioMart (data export)

• Display of external data (DAS)

• Programmatic access via the Perl API

• Open Source 16 of 27

Objectives

 What information about a gene can I find?  What about sequence variation?  How do I navigate the data?

17 of 27 Today: Variations

• Nucleotide level • Single nucleotide polymorphism (SNP) • Small insertions and deletions (InDels) • Microsatellites (short tandem repeats)

• Structural • Copy number variations (CNV) • Large insertions and deletions

Sequence displays

Gene: Sequence Transcript:cDNA

Transcript: Exons Ensembl tools Phenotype for a gene Objectives

 What information about a gene can I find?  What about sequence variation?  How do I navigate the data?

22 of 27 How is all this information organised? • Ensembl Views (Website)

• Ensembl Database (open source)

• BioMart „DataMining tool‟

23 of 27 Help and documentation

• Comments and questions? [email protected]

• Mailing lists [email protected], [email protected]

• Course online www.ensembl.info/ecourse

• Our tutorials page www.ensembl.org/info/website/tutorials

• YouTube channel www..com/user/EnsemblHelpdesk

Follow us

• Facebook www.facebook.com/Ensembl.org

https://twitter.com/Ensembl

• Come visit our blog! www.ensembl.info

Publications http://www.ensembl.org/info/about/publications.html

• Flicek, P. et. al. Ensembl 2012 Nucleic Acids Res 40:D84-90 (2012) http://nar.oxfordjournals.org/content/40/D1/D84.long

• Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244

• Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295 Ensembl Team

Ensembl Paul Flicek (EBI), Steve Searle (Sanger Institute)

Software Andy Yates, Stephen Keenan, Monika Komorowska, Rhoda Kinsella, Ian Longden, Thomas Maurel, Kieron Taylor

Comparative Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato, Miguel Pignatelli Genomics

Regulation Ian Dunham, Nathan Johnson, Daniel Sobral, Steven Wilder

Variation Fiona Cunningham, Laurent Gil, Jackie MacArthur, Will McLaren, Graham Ritchie

Analysis and Bronwen Aken, Amonida Zadissa, Dan Barrell, Susan Fairley, Carlos Garcίa Girón, Thibaut Hourlier, Andreas Kähäri, Rishi Nag, Magali Ruffier, Annotation Simon White

Web Team Anne Parker, Ridwan Amode, Simon Brent, Bethan Pritchard, Harpreet Riat, Steve Trevanion (VEGA)

Outreach Giulietta M. Spudich, Jeff Almeida-King, Denise Carvalho-Silva, Bert Overduin, Michael Schuster (QC)

Ensembl Paul Kersey, Paul Derwent, Jay Humphrey, Arnaud Kerhornou, Eugene Kulesha, Nick Langridge, Uma Maheswari, Mark McDowall, Michael Nuhn, Genomes Helder Pedro, Dan Staines, Iliana Toneva

Ensembl , Richard Durbin, Paul Flicek, Jen Harrow, , Glenn Proctor, Steve Searle Strategy What data are you interested in?

Write down 1-2 bullet points on a Post it note and put this up in front of the room