Browsing with EnsEMBL Annotation

• During recent years release of large amounts of sequence data • Raw sequence data are not so useful on its own. They are most valuable when provided with comprehensive good quality annotation

CCCAACAAGAATGTAAAATCTTTAAGTGCCTGTTTTCATACTTATTTGACCACCCTATCTCTAGAATCTTGCATGATG TCTAGCCCTAGTAGGATCAAAAAATACTTACAAAGCAACTGAATAGCTACATGAATAGATGGATGAATAAATGCATG GGTGGATGGATGGATTAATGAAATCATTTATATGACTTAAAGTTTGCAGAGGAGTATCATATTTGGAAGGCAGTAAG GAAGTCTGTGTAGTCGATGGTAAAGGCAATTGGGAAGTTTGTTAGGCACAATAGGTCAAAATTTGTTTTTGAAGTCC TGTTACTTCACGTTTCTTTGTTTCACTTTCTTAAAACAGGAAACTCTTTTCTATGATCATTCTTCCAGGGCCTGGCTCT TCATCTGCAACCCAGTAATATCCCTAATGTCAAAAAGCTACTGGTTTAATTCGTGCCATTTTCAAAGAGGACTACTGA ATTCTGATGTGGCTTCAAACATTTAGGTTAGGCATATCTAATGGAGAACTTGCAGCCACACTGACTTGTAGTGAAAT ATCTATTTTGAGCCTGCCCAGTGTTGCTTAAATTGTAGTTTTCCTTGCCAGCTATTCATACAAGAGATGTGAGAAGCA CCATAAAAGGCGTTGTGAGGAGTTGTGGGGGAGTGAGGGAGAGAAGAGGTTGAAAAGCTTATTAGCTGCTGTACGG TAAAAGTGAGCTCTTACGGGAATGGGAATGTAGTTTTAGCCCTCCAGGGATTCTATTTAGCCCGCCAGGAATTAACC TTGACTATAAATAGGCCATCAATGACCTTTCCAGAGAATGTTCAGAGACCTCAACTTTGTTTAGAGATCTTGTGTGGG TGGAACTTCCTGTTTGCACACAGAGCAGCATAAAGCCCAGTTGCTTTGGGAAGTGTTTGGGACCAGATGGATTGTAG GGAGTAGGGTACAATACAGTCTGTTCTCCTCCAGCTCCTTCTTTCTGCAACATGGGGAAGAACAAACTCCTTCATCC AAGTCTGGTTCTTCTCCTCTTGGTCCTCCTGCCCACAGACGCCTCAGTCTCTGGAAAACCGTGAGTTCCACACAGAG AGCGTGAAGCATGAACCTAGAGTCCTTCATTTATTGCAGATTTTTCTTTATATCATTCCTTTTTCTTTCCTATGATACT GTCATCTTCTTATCTCTAAGATTCCTTCCAGATTTTACAAATCTAGTTTACTCATTACTTGCTTACTTTTAATCATTCT TCCCCAACTCTCTGAAGCTCTAATATGCAAAGCCTTCCTAAGGGGTGTCAGAAATTTTTAGCTTTTTAAAAGAATAAA TTTTAGATATTCACATTCATATTGATCTACTTGAGACCATGCTATTTATCTTTTCTTATTTCCTCTTTCTCAAGGGTCC The Ensembl project

• The Ensembl project was started in 1999, some years before the draft human was completed • Joint project between the Sanger Institute and the EBI

Goals: • To provide automated but accurate annotation

• Open source

• Integrate this annotation with other available biological data (directly or via Distributed Annotated System DAS)

• Both web and programmatic interface Ensembl team

Ensembl Paul Flicek (EBI), Steve Searle (Sanger Institute)

Software Glenn Proctor, Andreas Kähäri, Stephen Keenan, Rhoda Kinsella, Eugene Kulesha, Ian Longden, Iliana Toneva

Comparative Genomics Javier Herrero, Kathryn Beal, Stephen Fitzgerald, Leo Gordon, Matthieu Muffato , Miguel Pignatelli

Functional Genomics Ian Dunham, Nathan Johnson, Daniel Sobral, Steven Wilder

Variation Fiona Cunningham, Laurent Gil, Pontus Larsson, Will McLaren, Graham Ritchie

Analysis and Annotation Jan-Hinnerck Vogel, Bronwen Aken, Susan Fairley, Thibaut Hourlier, Magali Ruffier, Simon White, Amy Tang, Amonida Zadissa

Web Team Anne Parker, Ridwan Amode, Simon Brent, Maurice Hendrix, Bethan Pritchard, Steve Trevanion (VEGA)

Outreach Xosé M Fernández, Jeff Almeida-King, Bert Overduin, Michael Schuster (QC), Giulietta Spudich, Jana Vandrovcova

Systems & Support Guy Coates, James Beal, Gen-Tao Chiang, Peter Clapham, Simon Kelley, Shelley Goddard, Tracy Mumford, Kerry Smith

Benoît Ballester, Petra Catalina Schwalie, André Faure, Markus Fritz, Damian Keefe, Alison Meynert, Dace Ruklisa, Mikhail Spivakov, Research David Thybert, Sander Timmer, Albert Vilella Chao-Kung Chen, Laura Clarke, Jonathan Hinton, Zam Iqbal, Vasudev Kumanduri, Ilkka Lappalainen, Edoardo Marcora, Pablo Marín, Vertebrate Genomics Damian Smedley, Richard Smth, Phil Wilkinson, Holly Zheng-Bradley Paul Kersey, Paul Derwent, Matthias Haimel, Alan Horne, Arnaud Kerhornou, Uma Maheswari, Michael Nuhn, Dan Staines, Ensembl Genomes Andy Yates

VectorBase Dan Lawson, Gautier Koscielny, Karyn Megy

Zebrafish Kerstin Howe, Kim Brugger, Will Chow, Britt Reimholz, James Torrance

Ensembl Strategy Ewan Birney, Richard Durbin, Tim Hubbard

4 Species in Ensembl

• Ensembl focuses on vertebrates, more than 50 species available Extending the taxonomic space

6 Data in Ensembl

Core data: • Genomic sequence • Gene / transcript / models • External references • Mapped cDNAs, , microarray probes, BAC clones, cytogenetic bands, repeats, markers etc.

Comparative data: • orthologs and paralogs, protein families, whole genome alignments, syntenic regions

Variation data: • sequence variants, structural variants, phenotypes, linkage disequilibrium

Regulatory data: • “best guess” set of regulatory elements

Gene models

• Ensembl - Automatic annotation • Genome-wide determination using Ensembl genebuild pipeline, based on biological evidence • Highly consistent, rather frequently updated • Protein sequences UniProt Knowledge Base (UniProtKB) • cDNA sequences INSDC (ENA, GenBank and DDBJ)

• Havana genes - Manual curation • Reviewed determination, hand-checked, on case-by-case basis • Highly accurate, but very labour intensive • Protein sequences UniProt Knowledge Base (UniProtKB) • cDNA sequences INSDC (ENA, GenBank and DDBJ) • EST sequences INSDC (ENA, GenBank and DDBJ) GENCODE gene set

• Ensembl-Havana merge the super-set of both approaches • Ensembl transcripts get merged into the Havana transcripts, require perfect splice site overlap of each boundary

• Both, high quality and high accuracy

• Distributed as GENCODE gene set of the Encyclopedia Of DNA Elements (ENCODE) project (http://www.gencodegenes.org/)

• Participate in the Consensus Coding Sequence Consortium (CCDS)

• The basis of gene trees, orthologue and paralogue predictions Gene annotation – graphical view

Chromosomal location Experimental evidence RNA sequences Protein sequences

Annotated transcripts for SMAD2 gene

Assembly Pop-up window with more Genome sequence build from information and links overlapping contigs Gene models based on RNA-Seq data

• In-house pipeline development, based on BWA and Exonerate • Pioneered for Gorilla and Zebrafish • Illumina® Human Body Map 2.0 Data

• For the moment only one transcript per gene created • Intron-features illustrate alternative splicing events • But … Further analysis and validation is required

• Future Goal: Merge with conventional cDNA and protein-based annotation SLC25A3 gene

8.58 Kb Forward strand

98,988,000 98,989,000 98,990,000 98,991,000 98,992,000 98,993,000 98,994,000 98,995,000 bands q23.1 skeletal muscle lung liver brain blood CCDS set CCDS9065.1 > CCDS set

CCDS9066.1 > CCDS set Ensembl/Havana ... SLC25A3-008 > protein coding

SLC25A3-201 > protein coding

SLC25A3-001 > protein coding

SLC25A3-005 > protein coding

SLC25A3-013 > protein coding

SLC25A3-004 > protein coding

SLC25A3-015 > protein coding

SLC25A3-014 > protein coding

SLC25A3-016 > protein coding

SLC25A3-006 > nonsense mediated decay

SLC25A3-007 > protein coding

SLC25A3-002 > protein coding

98,988,000 98,989,000 98,990,000 98,991,000 98,992,000 98,993,000 98,994,000 98,995,000 Reverse strand 8.58 Kb There are currently 387 tracks turned off. Ensembl Homo sapiens version 64.37 (GRCh37) Chromosome 12: 98,987,369 - 98,995,946 How to access the data?

www.ensembl.org uswest.ensembl.org useast.ensembl.org

archive.ensembl.org

pre.ensembl.org Location view

Additional Information is information can be displayed in added using views different views: specific displays • Species and “Configure summary this page” button • Location • Gene • Transcript • Variation • Regulation

Individual Click and drag “tracks” mouse to select a region or use the navigation buttons M

R Comparative

H genomics Gene trees - homologues prediction

Gene tree

Multiple Ideograms Gene trees

Gene (red)

Duplication node (red)

Speciation node (blue)

Paralogue (blue)

Collapsed sub-tree Location view Conservation Scores

Multiple Sequence Alignment

BLASTZ Conservation Tracks

TBLAT Conservation Tracks Multi-species view Location view - Synteny

Mouse

Human chromosome 18 List of orthologous genes located within the syntenic region Functional genomics

Credits: Darryl Leja (NHGRI), Ian Dunham (EBI)

Regulatory features

An example of a promoter associated region in a region in detail

Good predictors of promoter regions

• Dnase I hypersensitivity sites (marks of accessible chromatin) • H3K4me3 (histone 3 trimetylation) • RNA polymerase II recruitment TBFS annotation

• position of putative TF binding sites within the annotated regulatory regions • for factor (TF) which has both a ChIP-seq data and a publicly available position weight matrix (PWM) • PMW are taken from JASPAR database Externally curated data

• cisRED – regulatory motifs • miRanda – miRNA tragets • VISTA – enhancers • MeDIP and Reduced representation bisulphite sequencing (RRBS) from ENCODE – methylation data • eQTL data Variation Types of Variation

• Germline • Somatic

• Large scale (structural) • Many show variation in Copy Number (CNV) variants • Small scale (sequence) • Single Nucleotide Polymorphisms (SNP) • Deletion/Insertions (DIPs or indels)

Variation sources

Structural • Database of Genomic Variants Archive (DGVa) / Database of genomic structural variation (dbVar) • Several Affymetrix and Illumina arrays DAS sources: • DECIPHER (DatabasE of Chromosomal Imbalance and Phenotype in Humans using Ensembl Resources) • DGV loci (Database of Genomic Variants)

Sequence • Most sequence variants imported from dbSNP repository (imported data: alleles, flanking sequences, frequencies, calculated data: position, synonymous status, amino acid change) • Several microarrays • Ensembl variations (from comparison to other individuals/strains/breeds) • HGMD (Human Gene Mutation Database) Public • Leiden Open Variation Database • UniProtKB • COSMIC (Catalogue of Somatic Mutations in Cancer) Variation display Phenotype Consequence

. . .

Ensembl Tools Variant effect predictor Custom Annotation

• Upload of (conventional) annotation exchange formats • GFF, GTF, BED, WIG • Indexed binary file formats, BAM, BigBED, BigWIG • Attachment of indexed files to genome browser via URLs • Fast random access via HTTP range requests Individual reads showing a G/T variant that co- localises with a known SNP Users can visualise results from next generation sequencing experiments in Ensembl DAS

• Data export • Data upload to location view, gene view

Attached DAS tracks How to access the data?

www.ensembl.org/downloads.html www.ensembl.org/info/docs/api/index.html Ftp download http://www.ensembl.org/info/data/ftp/index.html ftp://ftp.ensembl.org/pub/

• Genomic, cDNA, ncRNA and protein sequence (FASTA) • Annotated sequence (EMBL / GenBank) • Gene sets (GTF) • Resequencing alignments individuals / strains (EMF) • Variants (GVF) • Whole-genome multiple alignments (EMF) • Gene-based multiple alignments (EMF) • Constrained elements (BED) • Regulatory features (GFF) • Database dumps (MySQL) Biomart

• Data mining tool • Originally developed for Ensembl (EnsMart), Joint project between the European Bioinformatics Institute (EBI) and the Ontario Institute for Cancer Research (OICR), Central portal: http://www.biomart.org, used by many large data resources

• Feature-focused • Mix and match queries • “Instant” refresh of selected set • Flexible output to HTML table, FASTA, CSV, TSV, Excel … • All Ensembl genes on chromosome 5 in GTF format, etc… Biomart Ensembl support Ensembl helpdesk Blog: http://www.ensembl.info • Private mailing list YouTube: • General enquiries, feedback and http://www.youtube.com/user/EnsemblHe support lpdesk [email protected] Facebook: http://www.facebook.com/Ensembl.org Ensembl announcements • Public mailing list Twitter: http://twitter.com/Ensembl • Low-volume, announcement of new releases [email protected]

Ensembl developers • Public mailing list • Good for technical support [email protected]