Ensembl – An Overview

Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach

EBI is an Outstation of the European Molecular Biology Laboratory. EMBL-EBI This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation Comparative Regulation  Access Beginnings …

1995: 1st free-living organism: bacterium Haemophilus influenzae (1.8 million bp)

2001: First draft of the human sequence (3 gb) 2004: ‘Finished’ human sequence

2014: Polished human sequence with haplotypes (GRCh38) Today’s genomics - human

1000 Genomes Project

ENCODE

COURTESY OF NIH

THOMAS POROSTOCKY; SOURCE: MEETINGZONE Today’s genomics – other species

5 of 24 Ensembl – Access to …

6 of 24 Sister project …

Bacteria, Protists, Plants, Fungi, (non-vertebrate) Metazoa

7 of 24 Raw sequence

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTGCACTGCTGCGCCTCTGCTG CGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGA TTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTGCACCTCTGGA GCGGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAG AGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATA AGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGAAGAATC TGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAAAGGAAACCATCTTA TAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAGGGCTGACTCTGCCGCTGTAC CAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGT AGACTAAAAGTCTTCGCACAGTGAAAT CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTACTAAAATGGATCAAGCAGAT GATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAGTGAAAGT CCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA Ensembl – unlocking the code

Regulation

Conserved sequence Gene Allele

• Splice variants, proteins, non-coding RNA • Small and large scale sequence variation, phenotype associations • Whole genome alignments, protein trees • Potential promoters and enhancers, DNA methylation • User upload, custom data

Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/

06 March 2014 9 This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation  Regulation  Access Challenge: number of gene/protein sequences increases • UniProtKB/Swiss-Prot (e.g.Q8IU82) 542,258 • UniProtKB/TrEMBL 51,616,950 • NCBI RefSeq (e.g. NP_006570) 37,371,278

11 of 24 Is there a consensus?

• Reaching a consensus coding sequence set for human and mouse.

• Human 29,045 CCDS IDs -18,683 Ensembl Gene IDs (e74) • Mouse 23,093 CCDS IDs- 19,988 Ensembl GeneIDs (e72) The GENCODE set www.gencodegenes.org • Ensembl has long been respected for its high- quality gene sets • GENCODE genes = Ensembl Automatic Pipeline + Havana Manual Annotation (+ Yale pseudogenes) • GENCODE is used by ENCODE, 1000 Genomes, and other projects.

13 of 24 This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation Comparative Genomics Regulation  Access Ensembl Variation

Aims: • Collect, integrate and annotate all known variants • Provide tools for comparison to other genomic data • Provide a framework for access and to improve understanding Practical applications of variation

Molecular and clinical medicine • Diagnosis, detection and treatment: – e.g. myotonic dystrophy, fragile X syndrome, inherited colon cancer, familial breast cancer • "custom drugs" DNA forensics • Identification of suspects • catastrophe victims • endangered species Agriculture, livestock breeding • Disease-, insect-, and drought-resistant crops • Healthier, disease-resistant animals • Marker-assisted breeding • More nutritious produce • Reducing the costs of agriculture Anthropology, evolution, and human migration Variation Sources

. dbSNP (1000 Genomes, ClinVar, etc) . ESP (Exome Sequencing Project) . UniProt . COSMIC . HGMD_Public . NHGRI-GWAS . & more … www.ensembl.org/info/genome/variation/sources_documentation Variation in the Browser Ensembl Variant Effect Predictor

. Uses an Ensembl gene set to annotate: . SNPs . Indels New Interface! . Variants in regulatory regions . Structural variants

Web interface Perl script REST API XM L . Publication: McLaren et al. 2010 () Ensembl Comparative Genomics

o

v

a

p s o l o l s h a

u c l g l n a _ t a s y i a g h t r r _ t g y s u s a t M g u s u a l e _ i S l l l o n a s M p i a a e i n n t _

r g a e o G M s a c y n d c a i o D n p l s r e n i C p o a o o a i s l r _ A h h s p p n a n o s i e y u h e c l u n p l u _ o i s a i s s P u h T s s e _ s i i r _ l _ l s e c o s _ u L p d o s a

E _ h n c o c u c i s u n u i i x a c n o t d a y c i g A p lo r o v h o s m s i i

r e h i o v d i i v _ e r n o r a o a n r d t _ r h e n s e i _ n i ig _ o o n o s l n s o s s t c m i m m _ e p i i p a a f t h e u u o n i s i l r u t r o t _ s c f i P p b a p i h d u l _ m c a r u a e i o h c o _ ac n a f t n a n c o r gu _m r n e a r e _ e t u s E ic s l c r Te if ru f t n X ia O k o r O a h s i a is a u r T p ipe n S n n e ho _lat a o i s ip as c a r i X yzi e re i im Or P u x t uleatus te s _ a steus_ac ro _e a L Gastero p u ra us ro n M _ p e yo va a u Gadus_m orhua C tis m eu s Must ani _lu py s ela_p s_fa cif ru Da utor m i ug s nio_rerio Ailur ius_fu liar us opoda_m e ro is lanoleuca Felis_catus Petro caballus m yzo Equus_ n_m a rinus crofa Sus_s s uru us os Cio _ta cat c C na_ Bos un pa io sav _tr a_ na ign ps gn _in yi sio cu tes Tur Vi tin s s al ep lu s is c u u i n t r ri ic a Ca D p n e e e ro _ u n g n s a c li s o o n _ n p o s lu rh h t m l a T il u l i a a

o a e e i b h g i s _ c c e t d m c a i r l e r t u S i e C s t O o d o d b a is la e t r n i _ i n a c _ i c r p o u e o a n c y t _ r g l l _ i r r _ s h e a l a s i s u i a g s O a _ a y v y s t r a te s p s h N o n r u g a m M m s m u y m c u _ r o o C i l _ i y r o T r a P d u s x m c i g i d u o t c c e _ e c u c o a n G s h a I s j v b _ p m a s H c g o c i r u t e P o e e c c o a a o r D l a m r c i c u e n m _ _ l o l n o v _ h s a a o _ t m _ r i s _ b _ _ s s u t s i O c l u u g r a u i e a s e o t o p e l u l g t M a i M i r l i a c i o e t ll n R t o a d s a g y t e e n s y s Image obtained using Dendroscope (D.H. Huson and C Scornavacca, Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks, Syst ematic Biology, 2012 ) Whole genome alignments

Homo sapiens ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---ACTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Pan troglodytes ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAAGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCACTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Gorilla gorilla gorilla ...... Great apes Ancestral sequences ...... Old world Pongo abelii ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TGTGCTGCACCTGTG-CTGCGCCTCGGGTCTCTTTTGCGGCG monkeys Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-TAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Macaca mulatta ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-CAGGCG-GCAGAGGTGGAAC--TGCTGCTGGC------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Primates Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC------TCTG-CCGCGCCTCGGGTCTTTTCTGCGGCG Callithrix jacchus ACGT-GG--TCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGACC--TGCTG-TGTC------TCTG-CCGCGCCTCCGGTCTTTTCTGCGACG Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------CCGCGTCTCGGGTCTTTTCTGCGGCA Mus musculus ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGGGAGCCGT-G------AGGCGTTGCCGTCAGT-CAGCT------ACCGCTGC------Rodents Ancestral sequences ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGTGAGAAGT-G------AGGCGGTGCCGTCCGT-CAGCT------ACCGCAAC------Rattus norvegicus ACGGCGC--AGAGCGCGGGCTTTTCGCAGGAGCGTGAGAAGT-G------AGGCGGCGCCGTCCGT-CAGCG------GCCGCAAC------Glires Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TCCTT-CAGCT------CCGCGTCTCGGGTCTTTTCTGCGGCA Boroeutherian Oryctolagus cuniculus ACGT-GC--CCAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAA-AAGGCT-ATGGAGGCGGAGC--TCCTT-CAGCT------CCGCGTCTGGGGTCTTGCCTAGGGCA Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------CCGCGTCTCGGGTCTTTTCTGCGGCA Bos taurus ACAT-ATCCCGAGAGCAGGCTTTTGGCGCGAGAATCTGAAAC-CCGGTGGGCGGAGGTGCGGC--TGCTG-AAGTTTG------C--TGTCTCGGGCGG-T------Laurasiatheria Ancestral sequences ACGT-GCTCCGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCGAGCGGAGGCGGAGC--TGCTG-GGGCTCC------C--TGTCTCGGGTGG-TTCTGTGGCA Canis lupus familiaris ...... Ancestral sequences ...... Equus caballus ACGT-GCTCAGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAAGAAGGCAAGCGGAGGCGGAGT--TGCTG-GGGCTCC------C--TGACTGGGGTGG-TTGTGTGGCA This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation Comparative Genomics Regulation  Access Gene expression: The basic model

mRNA

Transcription Factors Activation

Repression RNA polymerase complex

2 nm Transcription Factor Binding Sites Promoter Gene Available data Regulation (ENCODE + …) This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation Comparative Genomics Regulation  Access Open source- access our data!

• Ensembl Views (Website, ftp)

• Ensembl Database (Perl API, REST API, MySQL)

• BioMart – Quick Data Retrieval (Web interface , Bioconductor, Galaxy, BioMaRt) Ensembl is used worldwide

Top users:

UK US Canada China France Germany Italy Japan Spain Workshops Worldwide (2013)

EBI is an Outstation of the European Molecular Biology Laboratory. What’s coming? (2014)

New Assemblies: • GRCh38 (and all the updated annotation) www.ensembl.info/blog (category GRCh38) • Baboon • Vervet monkey • Amazon molly • Crab eating macaque (Pre.ensembl.org) • Hedgehog (Pre.ensembl.org)

New BLAST New Regulatory Build www.ensembl.info/blog/2013/12/26/the-new-ensembl-regulatory-annotation

EBI is an Outstation of the European Molecular Biology Laboratory. Learn more

• Comments and questions? [email protected]

• YouTube channel www.youtube.com/user/EnsemblHelpdesk

• Mailing lists [email protected], [email protected]

• Courses online www.ensembl.info/ecourse

• Our tutorials page www.ensembl.org/info/website/tutorials Follow us • Facebook www.facebook.com/Ensembl.org

• Twitter https://twitter.com/Ensembl

• Come visit our blog! www.ensembl.info Acknowledgements

Funding European Commission Framework Programme 7