Browsing Genomes with Ensembl Annotation
Total Page:16
File Type:pdf, Size:1020Kb
Browsing genomes with EnsEMBL Annotation • During recent years release of large amounts of sequence data • Raw sequence data are not so useful on its own. They are most valuable when provided with comprehensive good quality annotation CCCAACAAGAATGTAAAATCTTTAAGTGCCTGTTTTCATACTTATTTGACCACCCTATCTCTAGAATCTTGCATGATG TCTAGCCCTAGTAGGATCAAAAAATACTTACAAAGCAACTGAATAGCTACATGAATAGATGGATGAATAAATGCATG GGTGGATGGATGGATTAATGAAATCATTTATATGACTTAAAGTTTGCAGAGGAGTATCATATTTGGAAGGCAGTAAG GAAGTCTGTGTAGTCGATGGTAAAGGCAATTGGGAAGTTTGTTAGGCACAATAGGTCAAAATTTGTTTTTGAAGTCC TGTTACTTCACGTTTCTTTGTTTCACTTTCTTAAAACAGGAAACTCTTTTCTATGATCATTCTTCCAGGGCCTGGCTCT TCATCTGCAACCCAGTAATATCCCTAATGTCAAAAAGCTACTGGTTTAATTCGTGCCATTTTCAAAGAGGACTACTGA ATTCTGATGTGGCTTCAAACATTTAGGTTAGGCATATCTAATGGAGAACTTGCAGCCACACTGACTTGTAGTGAAAT ATCTATTTTGAGCCTGCCCAGTGTTGCTTAAATTGTAGTTTTCCTTGCCAGCTATTCATACAAGAGATGTGAGAAGCA CCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGGTTGAAAAGCTTATTAGCTGCTGTACGG TAAAAGTGAGCTCTTACGGGAATGGGAATGTAGTTTTAGCCCTCCAGGGATTCTATTTAGCCCGCCAGGAATTAACC TTGACTATAAATAGGCCATCAATGACCTTTCCAGAGAATGTTCAGAGACCTCAACTTTGTTTAGAGATCTTGTGTGGG TGGAACTTCCTGTTTGCACACAGAGCAGCATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAG GGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAACATGGGGAAGAACAAACTCCTTCATCC AAGTCTGGTTCTTCTCCTCTTGGTCCTCCTGCCCACAGACGCCTCAGTCTCTGGAAAACCGTGAGTTCCACACAGAG AGCGTGAAGCATGAACCTAGAGTCCTTCATTTATTGCAGATTTTTCTTTATATCATTCCTTTTTCTTTCCTATGATACT GTCATCTTCTTATCTCTAAGATTCCTTCCAGATTTTACAAATCTAGTTTACTCATTACTTGCTTACTTTTAATCATTCT TCCCCAACTCTCTGAAGCTCTAATATGCAAAGCCTTCCTAAGGGGTGTCAGAAATTTTTAGCTTTTTAAAAGAATAAA TTTTAGATATTCACATTCATATTGATCTACTTGAGACCATGCTATTTATCTTTTCTTATTTCCTCTTTCTCAAGGGTCC The Ensembl project • The Ensembl project was started in 1999, some years before the draft human genome was completed • Joint project between the Sanger Institute and the EBI Goals: • To provide automated but accurate gene annotation • Open source • Integrate this annotation with other available biological data (directly or via Distributed Annotated System DAS) • Both web and programmatic interface Ensembl team Ensembl Paul Flicek (EBI), Steve Searle (Sanger Institute) Software Glenn Proctor, Andreas Kähäri, Stephen Keenan, Rhoda Kinsella, Eugene Kulesha, Ian Longden, Iliana Toneva Comparative Genomics Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato , Miguel Pignatelli Functional Genomics Ian Dunham, Nathan Johnson, Daniel Sobral, Steven Wilder Variation Fiona Cunningham, Laurent Gil, Pontus Larsson, Will McLaren, Graham Ritchie Analysis and Annotation Jan-Hinnerck Vogel, Bronwen Aken, Susan Fairley, Thibaut Hourlier, Magali Ruffier, Simon White, Amy Tang, Amonida Zadissa Web Team Anne Parker, Ridwan Amode, Simon Brent, Maurice Hendrix, Bethan Pritchard, Steve Trevanion (VEGA) Outreach Xosé M Fernández, Jeff Almeida-King, Bert Overduin, Michael Schuster (QC), Giulietta Spudich, Jana Vandrovcova Systems & Support Guy Coates, James Beal, Gen-Tao Chiang, Peter Clapham, Simon Kelley, Shelley Goddard, Tracy Mumford, Kerry Smith Benoît Ballester, Petra Catalina Schwalie, André Faure, Markus Fritz, Damian Keefe, Alison Meynert, Dace Ruklisa, Mikhail Spivakov, Research David Thybert, Sander Timmer, Albert Vilella Chao-Kung Chen, Laura Clarke, Jonathan Hinton, Zam Iqbal, Vasudev Kumanduri, Ilkka Lappalainen, Edoardo Marcora, Pablo Marín, Vertebrate Genomics Damian Smedley, Richard Smth, Phil Wilkinson, Holly Zheng-Bradley Paul Kersey, Paul Derwent, Matthias Haimel, Alan Horne, Arnaud Kerhornou, Uma Maheswari, Michael Nuhn, Dan Staines, Ensembl Genomes Andy Yates VectorBase Dan Lawson, Gautier Koscielny, Karyn Megy Zebrafish Kerstin Howe, Kim Brugger, Will Chow, Britt Reimholz, James Torrance Ensembl Strategy Ewan Birney, Richard Durbin, Tim Hubbard 4 Species in Ensembl • Ensembl focuses on vertebrates, more than 50 species available Extending the taxonomic space 6 Data in Ensembl Core data: • Genomic sequence • Gene / transcript / protein models • External references • Mapped cDNAs, proteins, microarray probes, BAC clones, cytogenetic bands, repeats, markers etc. Comparative data: • orthologs and paralogs, protein families, whole genome alignments, syntenic regions Variation data: • sequence variants, structural variants, phenotypes, linkage disequilibrium Regulatory data: • “best guess” set of regulatory elements Gene models • Ensembl genes - Automatic annotation • Genome-wide determination using Ensembl genebuild pipeline, based on biological evidence • Highly consistent, rather frequently updated • Protein sequences UniProt Knowledge Base (UniProtKB) • cDNA sequences INSDC (ENA, GenBank and DDBJ) • Havana genes - Manual curation • Reviewed determination, hand-checked, on case-by-case basis • Highly accurate, but very labour intensive • Protein sequences UniProt Knowledge Base (UniProtKB) • cDNA sequences INSDC (ENA, GenBank and DDBJ) • EST sequences INSDC (ENA, GenBank and DDBJ) GENCODE gene set • Ensembl-Havana merge the super-set of both approaches • Ensembl transcripts get merged into the Havana transcripts, require perfect splice site overlap of each exon boundary • Both, high quality and high accuracy • Distributed as GENCODE gene set of the Encyclopedia Of DNA Elements (ENCODE) project (http://www.gencodegenes.org/) • Participate in the Consensus Coding Sequence Consortium (CCDS) • The basis of gene trees, orthologue and paralogue predictions Gene annotation – graphical view Chromosomal location Experimental evidence RNA sequences Protein sequences Annotated transcripts for SMAD2 gene Assembly Pop-up window with more Genome sequence build from information and links overlapping contigs Gene models based on RNA-Seq data • In-house pipeline development, based on BWA and Exonerate • Pioneered for Gorilla and Zebrafish • Illumina® Human Body Map 2.0 Data • For the moment only one transcript per gene created • Intron-features illustrate alternative splicing events • But … Further analysis and validation is required • Future Goal: Merge with conventional cDNA and protein-based annotation SLC25A3 gene 8.58 Kb Forward strand 98,988,000 98,989,000 98,990,000 98,991,000 98,992,000 98,993,000 98,994,000 98,995,000 Chromosome bands q23.1 skeletal muscle lung liver brain blood CCDS set CCDS9065.1 > CCDS set CCDS9066.1 > CCDS set Ensembl/Havana ... SLC25A3-008 > protein coding SLC25A3-201 > protein coding SLC25A3-001 > protein coding SLC25A3-005 > protein coding SLC25A3-013 > protein coding SLC25A3-004 > protein coding SLC25A3-015 > protein coding SLC25A3-014 > protein coding SLC25A3-016 > protein coding SLC25A3-006 > nonsense mediated decay SLC25A3-007 > protein coding SLC25A3-002 > protein coding 98,988,000 98,989,000 98,990,000 98,991,000 98,992,000 98,993,000 98,994,000 98,995,000 Reverse strand 8.58 Kb There are currently 387 tracks turned off. Ensembl Homo sapiens version 64.37 (GRCh37) Chromosome 12: 98,987,369 - 98,995,946 How to access the data? www.ensembl.org uswest.ensembl.org useast.ensembl.org archive.ensembl.org pre.ensembl.org Location view Additional Information is information can be displayed in added using views different views: specific displays • Species and “Configure summary this page” button • Location • Gene • Transcript • Variation • Regulation Individual Click and drag “tracks” mouse to select a region or use the navigation buttons M R Comparative H genomics Gene trees - homologues prediction Gene tree Multiple Sequence Alignment Ideograms Gene trees Gene (red) Duplication node (red) Speciation node (blue) Paralogue (blue) Collapsed sub-tree Location view Conservation Scores Multiple Sequence Alignment BLASTZ Conservation Tracks TBLAT Conservation Tracks Multi-species view Location view - Synteny Mouse chromosomes Human chromosome 18 List of orthologous genes located within the syntenic region Functional genomics Credits: Darryl Leja (NHGRI), Ian Dunham (EBI) Regulatory features An example of a promoter associated region in a region in detail Good predictors of promoter regions • Dnase I hypersensitivity sites (marks of accessible chromatin) • H3K4me3 (histone 3 trimetylation) • RNA polymerase II recruitment TBFS annotation • position of putative TF binding sites within the annotated regulatory regions • for transcription factor (TF) which has both a ChIP-seq data and a publicly available position weight matrix (PWM) • PMW are taken from JASPAR database Externally curated data • cisRED – regulatory motifs • miRanda – miRNA tragets • VISTA – enhancers • MeDIP and Reduced representation bisulphite sequencing (RRBS) from ENCODE – methylation data • eQTL data Variation Types of Variation • Germline • Somatic • Large scale (structural) • Many show variation in Copy Number (CNV) variants • Small scale (sequence) • Single Nucleotide Polymorphisms (SNP) • Deletion/Insertions (DIPs or indels) Variation sources Structural • Database of Genomic Variants Archive (DGVa) / Database of genomic structural variation (dbVar) • Several Affymetrix and Illumina arrays DAS sources: • DECIPHER (DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources) • DGV loci (Database of Genomic Variants) Sequence • Most sequence variants imported from dbSNP repository (imported data: alleles, flanking sequences, frequencies, calculated data: position, synonymous status, amino acid change) • Several microarrays • Ensembl variations (from comparison to other individuals/strains/breeds) • HGMD (Human Gene Mutation Database) Public • Leiden Open Variation Database • UniProtKB • COSMIC (Catalogue of Somatic Mutations in Cancer) Variation display Phenotype Consequence . Ensembl