Genome Viewing
Total Page:16
File Type:pdf, Size:1020Kb
Ensembl – An Overview Twitter: #Ensembl Dr. Giulietta M. Spudich Ensembl Outreach EBI is an Outstation of the European Molecular Biology Laboratory. EMBL-EBI This talk … Genome Sequencing and Browsers Ensembl Data Genes Variation Comparative Genomics Regulation Access Beginnings … 1995: 1st free-living organism: bacterium Haemophilus influenzae (1.8 million bp) 2001: First draft of the human sequence (3 gb) 2004: ‘Finished’ human sequence 2014: Polished human sequence with haplotypes (GRCh38) Today’s genomics - human 1000 Genomes Project ENCODE COURTESY OF NIH THOMAS POROSTOCKY; SOURCE: MEETINGZONE Today’s genomics – other species 5 of 24 Ensembl – Access to … 6 of 24 Sister project … Bacteria, Protists, Plants, Fungi, (non-vertebrate) Metazoa 7 of 24 Raw sequence CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTGCACTGCTGCGCCTCTGCTG CGCCTCGGGTGTCTTTTGCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAGA TTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCAAAAAAGAACTGCACCTCTGGA GCGGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAG AGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGATTTAGGACCAATA AGTCTTAATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAG ACTAAAATGGATCAAGCAGATGATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAGAAGAATC TGAACATAAAAACAACAATTACGAACCAAACCTATTTAAAACTCCACAAAGGAAACCATCTTA TAATCAGCTGGCTTCAACTCCAATAATATTCAAAGAGCAAGGGCTGACTCTGCCGCTGTAC CAATCTCCTGTAAAAGAATTAGATAAATTCAAATTAGACTTAGGAAGGAATGTTCCCAATAGT AGACTAAAAGTCTTCGCACAGTGAAAT CGCCGGGAGAAGCGTGAGGGGACAGATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTC CGGCCAAAAAAGAACTGCACCTCTGGAGCGGACTTATTTACCAAGCATTGGAGGAATATCG TAGGTAAAAATGCCTATTGGATCCAAAGAGAGGCCAACATTACTAAAATGGATCAAGCAGAT GATGTTTCCTGTCCACTTCTAAATTCTTGTCTTAG AATTGGTTTGAAGAACTTTCTTCAGAAGCTCCACCCTATAATTCTGAACCTGCAGTGAAAGT CCTGTTGTTCTACAATGTACACATGTAACACCACAAAGAGATAAGTCA Ensembl – unlocking the code Regulation Conserved sequence Gene Allele • Splice variants, proteins, non-coding RNA • Small and large scale sequence variation, phenotype associations • Whole genome alignments, protein trees • Potential promoters and enhancers, DNA methylation • User upload, custom data Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/ 06 March 2014 9 This talk … Genome Sequencing and Browsers Ensembl Data Genes Variation Comparative Genomics Regulation Access Challenge: number of gene/protein sequences increases • UniProtKB/Swiss-Prot (e.g.Q8IU82) 542,258 • UniProtKB/TrEMBL 51,616,950 • NCBI RefSeq (e.g. NP_006570) 37,371,278 11 of 24 Is there a consensus? • Reaching a consensus coding sequence set for human and mouse. • Human 29,045 CCDS IDs -18,683 Ensembl Gene IDs (e74) • Mouse 23,093 CCDS IDs- 19,988 Ensembl GeneIDs (e72) The GENCODE set www.gencodegenes.org • Ensembl has long been respected for its high- quality gene sets • GENCODE genes = Ensembl Automatic Pipeline + Havana Manual Annotation (+ Yale pseudogenes) • GENCODE is used by ENCODE, 1000 Genomes, and other projects. 13 of 24 This talk … Genome Sequencing and Browsers Ensembl Data Genes Variation Comparative Genomics Regulation Access Ensembl Variation Aims: • Collect, integrate and annotate all known variants • Provide tools for comparison to other genomic data • Provide a framework for access and to improve understanding Practical applications of variation Molecular and clinical medicine • Diagnosis, detection and treatment: – e.g. myotonic dystrophy, fragile X syndrome, inherited colon cancer, familial breast cancer • Pharmacogenomics "custom drugs" DNA forensics • Identification of suspects • catastrophe victims • endangered species Agriculture, livestock breeding • Disease-, insect-, and drought-resistant crops • Healthier, disease-resistant animals • Marker-assisted breeding • More nutritious produce • Reducing the costs of agriculture Anthropology, evolution, and human migration Variation Sources . dbSNP (1000 Genomes, ClinVar, etc) . ESP (Exome Sequencing Project) . UniProt . COSMIC . HGMD_Public . NHGRI-GWAS . & more … www.ensembl.org/info/genome/variation/sources_documentation Variation in the Browser Ensembl Variant Effect Predictor . Uses an Ensembl gene set to annotate: . SNPs . Indels New Interface! . Variants in regulatory regions . Structural variants Web interface Perl script REST API XM L . Publication: McLaren et al. 2010 (Bioinformatics) o v a p s o l o l s h a u c l g l n a _ t a s y i a g h t r r _ t g y s u s a t M g u s u a l e i _ S l l l o n a s M p i a a e i n n t _ r g a e o G M s a c y n d c a i o D n p l s r e n i C p o a o o a i s l r _ A h h s p p n a n o s i e y u h e c l u n p l u _ o i s a i s s P u h T s s e _ s i i r _ l _ l s e c o s _ u L p d o s a E _ h n c o c u c i s u n u i i x a c n o t d a y c i g A p lo r o v h o s m s i i r e h i o v d i i v _ e r n o r a o a n r d t _ r h e n s e i n i ig _ _ o o n o s l n s o s s t c m i m m _ e p i i p a a f t h e u u o n i s i l r u t r o t s c f i P p b a _ p i h d u l _ m c a r u a e i o h c o _ ac f t n a n n c o ra u m g _ r n e a r e _ e t u s E ic s l c r Te if ru f t n X ia O k o r a i O a h s in s a u r T op tipe S n n e h _la a o i s ip as c a r i X yzi e re i im Or P u x t uleatus te s _ a steus_ac ro _e a L Gastero p u ra us ro n M _ p e yo va a u Gadus_m orhua C tis m eu s Must ani _lu py s ela_p s_fa cif ru Da utor m i ug s nio_rerio Ailur ius_fu liar us opoda_m e ro is lanoleuca Felis_catus Petro caballus m yzo Equus_ n_m a rinus crofa Sus_s s uru us os Cio _ta cat c C na_ Bos un pa io sav _tr a_ na ign ps gn _in yi sio cu tes Tur Vi tin s s al ep lu s is c u u i n t r ri ic a Ca D p n e e e ro _ u g n s a c in s Ensembl o o l n _ n p o s lu rh h t m l a T il u l i a a o a e e i b h g i s _ c c e t d m c a i r l e r t u S i e C s t O o d o d b a is la e t r n i _ i n a c _ i c r p o u e o a n c y t _ r g l l _ i r r _ s h e a l a s i s u i a g s O a _ a y v y s t r a te s p s h N o n r u g a m M m s m u y m c u _ r o o C i l _ i y r o T r a P d u s x m c i g i d u o t c c e _ e c u c o a n G s h a I s j v b _ p m a s H c g o c i r u t e P o e e c c o a a o r D l a m r c i c u e n m _ _ l o l n o v _ h s a a o _ t m r _ i s _ b _ _ s s u t s i O c l u u g r a u i e a s e o t o p e l u l g t M a i M i r l i a c i o e t ll n R t o a d s a g y t e e n s y s Image obtained using Dendroscope (D.H. Huson and C Scornavacca, Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks, Syst ematic Biology, 2012 ) Comparative Genomics Whole genome alignments Homo sapiens ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---ACTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Pan troglodytes ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAAGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCACTG-CTGCGCCTCGGGTGTCTTTTGCGGCG Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TCTGCTGCGCCTCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Gorilla gorilla gorilla ........................................................................................................................ Great apes Ancestral sequences ........................................................................................................................ Old world Pongo abelii ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAC-TAGGCG-GCAGAGGCGGAGC--CGCTG-TGGC---TGTGCTGCACCTGTG-CTGCGCCTCGGGTCTCTTTTGCGGCG monkeys Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-TAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC--------------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Macaca mulatta ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCTTCTGAAAT-CAGGCG-GCAGAGGTGGAAC--TGCTGCTGGC--------------TCTG-CTGCGCCTCGGGTCTCTTTTGCGGCG Primates Ancestral sequences ACGT-GG--CCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGAGC--TGCTG-TGGC--------------TCTG-CCGCGCCTCGGGTCTTTTCTGCGGCG Callithrix jacchus ACGT-GG--TCAGCGCGGGCTTGTGGCGCGAGCGTCTGAAAT-GAGGCG-GCAGAGGCGGACC--TGCTG-TGTC--------------TCTG-CCGCGCCTCCGGTCTTTTCTGCGACG Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA Mus musculus ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGGGAGCCGT-G----------AGGCGTTGCCGTCAGT-CAGCT-----------------ACCGCTGC------------------- Rodents Ancestral sequences ACGG-GC--AGAGCGCGGGCTTTTCGCGGGAGCGTGAGAAGT-G----------AGGCGGTGCCGTCCGT-CAGCT-----------------ACCGCAAC------------------- Rattus norvegicus ACGGCGC--AGAGCGCGGGCTTTTCGCAGGAGCGTGAGAAGT-G----------AGGCGGCGCCGTCCGT-CAGCG-----------------GCCGCAAC------------------- Glires Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TCCTT-CAGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA Boroeutherian Oryctolagus cuniculus ACGT-GC--CCAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAA-AAGGCT-ATGGAGGCGGAGC--TCCTT-CAGCT------------------CCGCGTCTGGGGTCTTGCCTAGGGCA Ancestral sequences ACGT-GC--CGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCG-GCGGAGGCGGAGC--TGCTG-CGGCT------------------CCGCGTCTCGGGTCTTTTCTGCGGCA Bos taurus ACAT-ATCCCGAGAGCAGGCTTTTGGCGCGAGAATCTGAAAC-CCGGTGGGCGGAGGTGCGGC--TGCTG-AAGTTTG----------------C--TGTCTCGGGCGG-T--------- Laurasiatheria Ancestral sequences ACGT-GCTCCGAGAGCGGGCTTTTGGCGCGAGCGTCTGAAAT-AAGGCGAGCGGAGGCGGAGC--TGCTG-GGGCTCC----------------C--TGTCTCGGGTGG-TTCTGTGGCA Canis lupus familiaris ........................................................................................................................ Ancestral sequences .......................................................................................................................