Ensembl Tools

EBI is an Outstation of the European Molecular Biology Laboratory. Questions?

• We’ve muted all the mics • Ask questions in the Chat box in the webinar interface • I will check the Chat box periodically for questions • There’s no threading so please respond with @name Objectives

• What is Ensembl? • What tools are available in Ensembl? • How to use the online tools in Ensembl. • Where to go for help and documentation. Overview

• Introduction to Ensembl • BLAST/BLAT • Sequence searching • Assembly Converter • Convert files between assemblies • Data Slicer • Pull out sections of VCF and BAM files • File Chameleon • Custom download of reference files for NGS analysis • Variant Effect Predictor (VEP) • Analyse your own variants Introduction

Why do we need genome browsers?

1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 Gb) CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATCTGAAATTTCTTGGAA ACACGATCACTTTAACGGAATATTGCTGTTTTGGGGAAGTGTTTTACAGCTGCTGGGCACGCTGTATTTGCCTTACTTAAGC CCCTGGTAATTGCTGTATTCCGAAGACATGCTGATGGGAATTACCAGGCGGCGTTGGTCTCTAACTGGAGCCCTCTGTCCCC ACTAGCCACGCGTCACTGGTTAGCGTGATTGAAACTAAATCGTATGAAAATCCTCTTCTCTAGTCGCACTAGCCACGTTTCG AGTGCTTAATGTGGCTAGTGGCACCGGTTTGGACAGCACAGCTGTAAAATGTTCCCATCCTCACAGTAAGCTGTTACCGTTC CAGGAGATGGGACTGAATTAGAATTCAAACAAATTTTCCAGCGCTTCTGAGTTTTACCTCAGTCACATAATAAGGAATGCAT CCCTGTGTAAGTGCATTTTGGTCTTCTGTTTTGCAGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTA TTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGGTATTGACAAATTTTATATAAC TTTATAAATTACACCGAGAAAGTGTTTTCTAAAAAATGCTTGCTAAAAACCCAGTACGTCACAGTGTTGCTTAGAACCATAA ACTGTTCCTTATGTGTGTATAAATCCAGTTAACAACATAATCATCGTTTGCAGGTTAACCACATGATAAATATAGAACGTCT AGTGGATAAAGAGGAAACTGGCCCCTTGACTAGCAGTAGGAACAATTACTAACAAATCAGAAGCATTAATGTTACTTTATGG CAGAAGTTGTCCAACTTTTTGGTTTCAGTACTCCTTATACTCTTAAAAATGATCTAGGACCCCCGGAGTGCTTTTGTTTATG TAGCTTACCATATTAGAAATTTAAAACTAAGAATTTAAGGCTGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGA GGCCGAGGTGGGCGGATCACTTGAGGCCAGAAGTTTGAGACCAGCCTGGCCAACATGGTGAAACCCTATCTCTACTAAAAAT ACAAAAAATGTGCTGCGTGTGGTGGTGCGTGCCTGTAATCCCAGCTACACGGGAGGTGGAGGCAGGAGAATCGCTTGAACCC TGGAGGCAGAGGTTGCAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCCACATAGCATGACTCTGTCTCAAAACAA ACAAACAAACAAAAAACTAAGAATTTAAAGTTAATTTACTTAAAAATAATGAAAGCTAACCCATTGCATATTATCACAACAT TCTTAGGAAAAATAACTTTTTGAAAACAAGTGAGTGGAATAGTTTTTACATTTTTGCAGTTCTCTTTAATGTCTGGCTAAAT AGAGATAGCTGGATTCACTTATCTGTGTCTAATCTGTTATTTTGGTAGAAGTATGTGAAAAAAAATTAACCTCACGTTGAAA AAAGGAATATTTTAATAGTTTTCAGTTACTTTTTGGTATTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGAT ATACCATAGGTCTTTCCCATGTCGCAACATCATGCAGTGATTATTTGGAAGATAGTGGTGTTCTGAATTATACAAAGTTTCC AAATATTGATAAATTGCATTAAACTATTTTAAAAATCTCATTCATTAATACCACCATGGATGTCAGAAAAGTCTTTTAAGAT TGGGTAGAAATGAGCCACTGGAAATTCTAATTTTCATTTGAAAGTTCACATTTTGTCATTGACAACAAACTGTTTTCCTTGC AGCAACAAGATCACTTCATTGATTTGTGAGAAAATGTCTACCAAATTATTTAAGTTGAAATAACTTTGTCAGCTGTTCTTTC AAGTAAAAATGACTTTTCATTGAAAAAATTGCTTGTTCAGATCACAGCTCAACATGAGTGCTTTTCTAGGCAGTATTGTACT TCAGTATGCAGAAGTGCTTTATGTATGCTTCCTATTTTGTCAGAGATTATTAAAAGAAGTGCTAAAGCATTGAGCTTCGAAA TTAATTTTTACTGCTTCATTAGGACATTCTTACATTAAACTGGCATTATTATTACTATTATTTTTAACAAGGACACTCAGTG GTAAGGAATATAATGGCTACTAGTATTAGTTTGGTGCCACTGCCATAACTCATGCAAATGTGCCAGCAGTTTTACCCAGCAT CATCTTTGCACTGTTGATACAAATGTCAACATCATGAAAAAGGGTTGAAAAAAGGAATATTTTAATAGTTTTCAGTTACTTT We need to make the data mean something…

http://www.ncbi.nlm. http://www. nih.gov/mapview ensembl.org http://genome.ucsc.edu Ensembl Features

builds for ~70 species • Gene trees • Regulatory build (ENCODE) • Variation display and VEP • Display of user data • BioMart (data export) • Programmatic access via the APIs • Completely Open Source Access scales One by one Main browser Mobile site

BioMart REST API API VEP MySQL

FTP Groups Whole genome

Vertebrate species on Ensembl

Image obtained using Dendroscope:

Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks

D.H. Huson and C Scornavacca Systematic Biology, 2012 Non-vertebrates on Ensembl

Bacteria Protists Fungi

Metazoa Plants www.ensemblgenomes.org Ensembl and Ensembl Genomes

Ensembl EnsemblGenomes Released 2000 2009 Species Vertebrates (fly, worm and Non-vertebrates (protists, yeast as outgroups) plants, fungi, metazoa, bacteria)

Annotation by Ensembl in collaboration with the scientific communities URL www.ensembl.org www.ensemblgenomes.org Release cycle New/updated interfaces 9089 JulyMay 2017 2017 Updated New regulation genome assemblies data 2-3 months Updated Underlying variation software data updates Compara on new and genomes Updated gene sets Ensembl Tools

Tools allow: • Interpretation and processing of your own data • Custom download of Ensembl data for further analysis BLAST/BLAT for sequence searching

• Find Ensembl sequences that match your sequence using BLAST/BLAT • Search: • sequences • Protein sequences • Short sequences (eg primers, morpholinos, siRNAs) • Search against • Genomic sequences • cDNA sequences • Protein sequences Hands on – BLAST/BLAT

• I’ve designed a pair of primers for RT-PCR against human BRCA2 • I want to make sure they don’t have any non-specific hits that will mess up my RT-PCR results • The sequences are: >fwd GAGGACTCCTTATGTCCAAATTT

>rev GAGAATCAGCTTCTGGGGTAATAA Assembly converter

• You have data mapped to an old genome assembly • You want to update your data to map it to a new one What is a genome assembly? Sequence reads CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA CAGCTGTCCCAGATGAC ACTTAACTTCCCTCCCAGCTGTCC GGGCTCCGCCTTCAGCTC TCCCAGCTGTCCCAGATGACGCCATC AACTTCCCTCCCAGCT CGGCCTTTGGGCTCC TCCGCCTTCAGCTCAAGACTTAACTTC CAGATGACGCC

Match up overlaps

CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA AACTTCCCTCCCAGCT CAGATGACGCC TCCGCCTTCAGCTCAAGACTTAACTTC TCCCAGCTGTCCCAGATGACGCCATC GGGCTCCGCCTTCAGCTC ACTTAACTTCCCTCCCAGCTGTCC CGGCCTTTGGGCTCC CAGCTGTCCCAGATGAC

Genome assembly CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATC Genome contigs BL102

BL AL476

AL CM553

CM IM768

IM Reference alleles BL102 BL102

BL AL476

AGTCGTAGCTAGC TAGGCCATAGGCGA

AL Frequency T = 0.05, frequency G = 0.95

CM553 G is the allele in all primates T causes disease susceptibility

CM Perhaps G should be the reference IM768 allele? We can replace the region with a new

IM contig Genome Gaps BL102 BL102 Gap in the genome caused

by: AL476 BL AL476 ● Poor sequencing at this region

● No contig was ever cloned AL CM553

CM

IM768 We can fill in the gap with a new contig

IM Incorrectly assembled contigs BL102 BL102

CM553 BL AL476 BL

CM AL CM553 AL476

AL CM IM768 IM768

IM IM New genome assemblies

• Fixing errors in the genome produces a new genome assembly • New genome assemblies mean re-mapping of all genome features • Ensembl will stop updating the old assembly when a new one is brought in • You’ve got data mapped to the old assembly and you want to compare to the up-to-date Ensembl annotation Assembly converter

• Converts genome coordinates to a different genome assembly. • Works with: • BED (simple coordinates) • GFF (gene, transcript and exon coordinates) • GTF (gene, transcript and exon coordinates) • WIG (values plotted against the genome) • VCF (variants) Hands-on – Assembly converter

• We’re going to convert a small BED file from the human genome assembly GRCh37 to the more recent GRCh38 • BED is a simple features format which lists the start and end coordinate of the feature.

5 36821734 37091336 P1 5 36731578 36978408 P2 5 36908654 37108773 P3 Data Slicer for variants

• Whole genome VCF files are unwieldy • They contain all variants in the genome • They contain all genotypes from all individuals studied • Sometimes you just want to analyse a small region and one population • The Data Slicer allows you to take a slice of a VCF and narrow down to only individuals and populations of interest

• Data Slicer currently only accesses the 1000 Genomes data • It is only available for human and only on GRCh37 Hands on – Data Slicer

• I want to get a VCF of the region containing the MC1R gene for the British population • MC1R is found at 16:89978527-89987385 in GRCh37 • The three-letter code for the British population in 1000 Genomes is GBR FTP

• Files of our complete database: • Genomic, cDNA, CDS, ncRNA and protein sequence (FASTA) • Annotated sequence (EMBL, GenBank) • Gene sets (GTF, GFF) • Whole-genome multiple and gene-based multiple alignments (MAF) • Variants (VCF, GVF) • Constrained elements (BED) • Regulatory features (BED, BigWig) • RNA-Seq files (BAM, BigWig) • MySQL database Access FTP

Your favourite FTP client

FTP site ftp://ftp.ensembl.org/pub/

FTP downloads page http://www.ensembl.org/info/data/ftp/index.html FTP files are big

• Multiple Mb/Gb • Lots of time to download/unzip

• Do you really need this data? • Make sure it’s the right file before you download. File chameleon for NGS analysis

• Although files on the Ensembl FTP site are in a standard format, different tools define the standards differently (sigh!) • Your NGS analysis tool might need files that are slightly different to the Ensembl formats • File chameleon allows you to download files with these adjustments Hands on – File Chameleon

• I need a GFF3 file of cat for my RNA-seq analysis. • My tool requires: • UCSC-style naming like chr1 • Only genes shorter than 4 Mb • Transcript IDs in every line • We will use File Chameleon to download this customised file. Analyse your own variants with the VEP

• Find out the effects of your own variants on Ensembl genes • Analyse whole genome variant calls • Filter variants to find those that might be interesting Your own variant data

Variant coordinates 1 881907 881906 -/C + 5 140532 140532 T/C + 12 1017956 1017956 T/A + 2 946507 946507 G/C + 14 19584687 19584687 C/T -

HGVS notation ENST00000285667.3:c.1047_1048insC 5:g.140532T>C NM_153681.2:c.7C>T ENSP00000439902.1:p.Ala2233Asp NP_000050.2:p.Ile2285Val

VCF #CHROM POS ID REF ALT 20 14370 rs6054257 G A 20 17330 . T A 20 1110696 rs6040355 A G,T 20 1230237 . T .

Variant IDs rs41293501 COSM327779 rs146120136 FANCD1:c.475G>A rs373400041 Variation types

1) Small scale in one or few of a gene • Small insertions and deletions (DIPs or indels) • Single nucleotide polymorphism (SNP) A G A C T T G A C C T G T C T - A A C T G G A T G A C T T G A C - T G T C T G A A C G G G A

2) Large scale in chromosomal structure (structural variation) • Copy number variations (CNV) • Large deletions/duplications, insertions, translocations

deletion duplication insertion translocation Variation consequences

CODING CODING Regulatory Synonymous Non-synonymous

ATG AAAAAAA

5’ Upstream 5’ UTR Splice site Intronic 3’ UTR 3’ Downstream Consequence terms

http://www.ensembl.org/info/docs/variation/predicted_data.html Predicting missense effects – SIFT and PolyPhen

SIFT and PolyPhen score changes in amino acid sequence based on: • How well conserved the protein is • The chemical change in the amino acid • 3D structure and domains (PolyPhen only)

• SIFT and PolyPhen are predictions, not facts • A prediction will never be as good as experimental validation SIFT PolyPhen

1 1 Probably damaging 0.1 Possibly damaging 0.2

Tolerated

Benign

0.05 Deleterious 0 0 Use the VEP

http://www.ensembl.org/info/docs/tools/vep/index.html Species that work with the VEP

?

+ everything in Plants, Fungi, Metazoa, Protists and Bacteria Set up a cache

- Speed up your VEP script with an offline cache. - Use prebuilt caches for Ensembl species. - Or make your own from GTF and FASTA files - even for genomes not in Ensembl.

http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html VEP plugins

• Plugins add extra functionality to the VEP • They may extend, filter or manipulate the output of the VEP. • Plugins may make use of external data or code. • Available on the web tool and with the script. Hands on

• We’re going to look at a set of four variants to find out what genes they hit and what effect they have on them. 9 128328461 128328461 A/- + var1 9 128322349 128322349 C/A + var2 9 128323079 128323079 C/G + var3 9 128322917 128322917 G/A + var4 Questions?

• We’ve muted all the mics • Ask questions in the Chat box in the webinar interface • I will check the Chat interface • There’s no threading so please respond with @name Host an Ensembl course

We can teach an Ensembl course Browser course at your institute for free (except ½-2 day course on the trainers’ expenses). Ensembl browser, aimed at wet-lab scientists. Email us: [email protected] One trainer.

REST API course 1-2 day course on the Ensembl Perl API, aimed at bioinformaticians. 1-2 trainers.

http://training.ensembl.org/ Help and documentation

Course online http://www.ebi.ac.uk/training/online/subjects/11 Tutorials www.ensembl.org/info/website/tutorials

Flash animations www.youtube.com/user/EnsemblHelpdesk http://u.youku.com/Ensemblhelpdesk

Email us [email protected] Ensembl public mailing lists [email protected], [email protected] Follow us www.facebook.com/Ensembl.org

@Ensembl www.ensembl.info Publications

http://www.ensembl.org/info/about/publications.html

Aken, B. et al Ensembl 2017 Nucleic Acids Research http://europepmc.org/articles/PMC5210575

Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244

Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295 Ensembl Acknowledgements The Entire Ensembl Team

Funding

Co-funded by the European Union