Exploring with Ensembl

Prague June, 2014

Dr. Giulietta M. Spudich Ensembl

EBI is an Outstation of the European Molecular Biology Laboratory. EMBL-EBI

This Course

 Monday EBI Introduction (Giulietta) Ensembl (Giulietta) ArrayExpress (Karyn)  Tuesday Phylogenies (Laura)  Wednesday  (Sandra)

www.ebi.ac.uk/training/Prague2014

EBI is an Outstation of the European Molecular Biology Laboratory.

This Morning

 Ensembl Introduction  Ensembl Browser Walkthrough  Hands-on Exercises  11 AM Coffee  BioMart  1 PM Lunch

EBI is an Outstation of the European Molecular Biology Laboratory. Beginnings …

1995: 1st free-living organism: bacterium Haemophilus influenzae (1.8 million bp)

2001: First draft of the human sequence (3 gb) 2004: ‘Finished’ human sequence

2014: Polished human sequence with haplotypes (GRCh38) Today’s - human

1000 Genomes Project

ENCODE

COURTESY OF NIH

THOMAS POROSTOCKY; SOURCE: MEETINGZONE Today’s genomics – other species

6 of 24 Ensembl – Access to …

7 of 24 Ensembl Genomes - Expanding Species

Bacteria, Protists, Plants, Fungi, (non-vertebrate) Metazoa

8 of 24 Ensembl Genomes – Examples http://ensemblgenomes.org

Metazoa Protists Ixodes scapularis Leishmania major Schistosoma mansoni Plasmodium falciparum Anophelese gambiae Plants Arabidopsis thaliana Triticum aestivum

Bacteria (10,000+) Fungi Acetobacter tropicalis Saccharomyces cerevisiae Escherichia coli Schizosaccharomyces_pombe

9 of 24 Raw sequence

CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTGCACTGCTGCGCCTCTGCTG CGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGA TTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTGCACCTCTGGA GCGGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAG AGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATA AGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGAAGAATC TGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAAAGGAAACCATCTTA TAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAGGGCTGACTCTGCCGCTGTAC CAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGT AGACTAAAAGTCTTCGCACAGTGAAAT CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTACTAAAATGGATCAAGCAGAT GATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAGTGAAAGT CCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA Ensembl – unlocking the code

Regulation

Conserved sequence Gene Allele

• Splice variants, proteins, non-coding RNA • Small and large scale sequence variation, phenotype associations • Whole alignments, protein trees • Potential promoters and enhancers, DNA methylation • User upload, custom data

Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/

29 May 2014 11 This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation  Regulation  Access

Challenge: number of gene/protein sequences increases • UniProtKB/Swiss-Prot (e.g.Q8IU82) 542,258 • UniProtKB/TrEMBL 51,616,950 • NCBI RefSeq (e.g. NP_006570) 37,371,278

13 of 24 Is there a consensus?

• Reaching a consensus coding sequence set for human and mouse.

• Human 29,045 CCDS IDs -18,683 Ensembl Gene IDs (e74) • Mouse 23,093 CCDS IDs- 19,988 Ensembl GeneIDs (e72) The GENCODE set www.gencodegenes.org • Human and mouse • Not just the cds (coding sequence) • Ensembl contributes genes to GENCODE • Havana contributes genes to GENCODE • GENCODE is used by ENCODE, 1000 Genomes, and other projects.

15 of 24 This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation Comparative Genomics Regulation  Access

Ensembl Variation

Aims: • Collect, integrate and annotate all known variants • Provide tools for comparison to other genomic data • Provide a framework for access and to improve understanding Practical applications of variation

Molecular and clinical medicine • Diagnosis, detection and treatment: – e.g. myotonic dystrophy, fragile X syndrome, inherited colon cancer, familial breast cancer • "custom drugs" DNA forensics • Identification of suspects • catastrophe victims • endangered species Agriculture, livestock breeding • Disease-, insect-, and drought-resistant crops • Healthier, disease-resistant animals • Marker-assisted breeding • More nutritious produce • Reducing the costs of agriculture Anthropology, evolution, and human migration Variation Sources

. dbSNP (1000 Genomes, ClinVar, etc) . ESP (Exome Sequencing Project) . UniProt . COSMIC . HGMD_Public . NHGRI-GWAS . & more …

www.ensembl.org/info/genome/variation/sources_documentation

Variation in the Browser Ensembl Variant Effect Predictor

. Uses an Ensembl gene set to annotate: . SNPs . Indels New Interface! . Variants in regulatory regions . Structural variants

REST API Web interface Perl script XM L

. Publication: McLaren et al. 2010 () Ensembl Comparative Genomics

o

v

a

p s o l o l s h a

u c l g l n a _ t a s y i a g h t r r _ t g y s u • Whole Genome Alignments s a t M g u s u a l e _ i S l l l o n a s M p i a a e i n n t _

r g a e o G M s a c y n d c a i o D n p l s r e n i C p o a o o a i s l r _ A h h s p p n a n o s i e y u h e c l u n p l u _ o i s a i s s P u h T s s e _ s i i r _ l _ l s e c o s _ u L p d o s a

• Gene Trees _ h E n c o c u c i s u n u i i x a c n o t d a y c i g A p lo r o v h o s m s i i

r e h i o v d i i v _ e r n o r a o a n r d t _ r h e n s e i _ n i ig _ o o n o s l n s o s s t c m i m m _ e p i i p a a f t h e u u o n i s i l r u t r o t _ s c f i P p b a p i h d u l _ m c a r u a e i o h c o _ ac n a f t n a n c o r gu _m r n e a r e _ e t u s E ic s l c r Te if ru f t n X ia O k o r O a h s i a is a u r T p ipe n S n n e ho _lat a o i s ip ias c a r i m X ryz • Homologues e re i ti O us P u x uleat te s _ a steus_ac ro _e a L Gastero p u ra us ro n M _ p e yo va a u Gadus_m orhua C tis m eu s Must ani _lu py s ela_p s_fa cif ru Da utor m i ug s nio_rerio Ailur ius_fu liar us opoda_m e ro is • Protein Families lanoleuca Felis_catus Petro caballus m yzo Equus_ n_m a rinus crofa Sus_s s uru us os Cio _ta cat c C na_ Bos un pa io sav _tr a_ na ign ps gn _in yi sio cu tes Tur Vi tin s s al ep lu s is c u u i n t r ri ic a Ca D p n e e e ro _ u n g n s a c li s o o n _ n p o s lu rh h t m l a T il u l i a a

o a e e i b h g i s _ c c e t d m c a i r l e r t u S i e C s t O o d o d b a is la e t r n i _ i n a c _ i c r p o u e o a n c y t _ r g l l _ i r r _ s h e a l a s i s u i a g s O a _ a y v y s t r a te s p s h N o n r u g a m M m s m u y m c u _ r o o C i l _ i y r o T r a P d u s x m c i g i d u o t c c e _ e c u c o a n G s h a I s j v b _ p m a s H c g o c i r u t e P o e e c c o a a o r D l a m r c i c u e n m _ _ l o l n o v _ h s a a o _ t m _ r i s _ b _ _ s s u t s i O c l u u g r a u i e a s e o t o p e l u l g t M a i M i r l i a c i o e t ll n R t o a d s a g y t e e n s y s Image obtained using (D.H. Huson and C Scornavacca, Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks, Syst ematic Biology, 2012 ) Whole genome alignments

Homo sapiens ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---ACTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Pan troglodytes ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAAGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCACTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Gorilla gorilla gorilla ...... Great apes Ancestral sequences ...... Old world Pongo abelii ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TGTGCTGCACCTGTG-CTGCGCCTCGGGTCTCTTTTGCGGCG monkeys Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-TAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Macaca mulatta ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-CAGGCG-GCAGAGGTGGAAC--TGCTGCTGGC------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Primates Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC------TCTG-CCGCGCCTCGGGTCTTTTCTGCGGCG Callithrix jacchus ACGT-GG--TCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGACC--TGCTG-TGTC------TCTG-CCGCGCCTCCGGTCTTTTCTGCGACG Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------CCGCGTCTCGGGTCTTTTCTGCGGCA Mus musculus ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGGGAGCCGT-G------AGGCGTTGCCGTCAGT-CAGCT------ACCGCTGC------Rodents Ancestral sequences ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGTGAGAAGT-G------AGGCGGTGCCGTCCGT-CAGCT------ACCGCAAC------Rattus norvegicus ACGGCGC--AGAGCGCGGGCTTTTCGCAGGAGCGTGAGAAGT-G------AGGCGGCGCCGTCCGT-CAGCG------GCCGCAAC------Glires Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TCCTT-CAGCT------CCGCGTCTCGGGTCTTTTCTGCGGCA Boroeutherian Oryctolagus cuniculus ACGT-GC--CCAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAA-AAGGCT-ATGGAGGCGGAGC--TCCTT-CAGCT------CCGCGTCTGGGGTCTTGCCTAGGGCA Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------CCGCGTCTCGGGTCTTTTCTGCGGCA Bos taurus ACAT-ATCCCGAGAGCAGGCTTTTGGCGCGAGAATCTGAAAC-CCGGTGGGCGGAGGTGCGGC--TGCTG-AAGTTTG------C--TGTCTCGGGCGG-T------Laurasiatheria Ancestral sequences ACGT-GCTCCGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCGAGCGGAGGCGGAGC--TGCTG-GGGCTCC------C--TGTCTCGGGTGG-TTCTGTGGCA Canis lupus familiaris ...... Ancestral sequences ...... Equus caballus ACGT-GCTCAGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAAGAAGGCAAGCGGAGGCGGAGT--TGCTG-GGGCTCC------C--TGACTGGGGTGG-TTGTGTGGCA This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation Comparative Genomics Regulation  Access

Gene expression: The basic model

mRNA

Transcription Factors Activation

Repression RNA polymerase complex

2 nm Transcription Factor Binding Sites Promoter Gene Available data Regulation (ENCODE + …) This talk …

 Genome Sequencing and Browsers  Ensembl Data Genes Variation Comparative Genomics Regulation  Access

Open source- access our data!

• Ensembl Views (Website, ftp)

• Ensembl Database (Perl API, REST API, MySQL)

• BioMart – Quick Data Retrieval (Web interface , Bioconductor, Galaxy, BioMaRt) Ensembl is used worldwide

Top users:

UK US Canada China France Germany Italy Japan Spain

Learn more

• Comments and questions? [email protected]

• YouTube channel www.youtube.com/user/EnsemblHelpdesk

• Mailing lists [email protected], [email protected]

• Courses online www.ensembl.info/ecourse

• Our tutorials page www.ensembl.org/info/website/tutorials

Follow us • Facebook www.facebook.com/Ensembl.org

• Twitter https://twitter.com/Ensembl

• Come visit our blog! www.ensembl.info

Acknowledgements

Funding European Commission Framework Programme 7