Ensembl Tools
EBI is an Outstation of the European Molecular Biology Laboratory. Questions?
• We’ve muted all the mics • Ask questions in the Chat box in the webinar interface • I will check the Chat box periodically for questions • There’s no threading so please respond with @name Objectives
• What is Ensembl? • What tools are available in Ensembl? • How to use the online tools in Ensembl. • Where to go for help and documentation. Overview
• Introduction to Ensembl • BLAST/BLAT • Sequence searching • Assembly Converter • Convert files between genome assemblies • Data Slicer • Pull out sections of VCF and BAM files • File Chameleon • Custom download of reference files for NGS analysis • Variant Effect Predictor (VEP) • Analyse your own variants Introduction
Why do we need genome browsers?
1977: 1st genome to be sequenced (5 kb) 2004: finished human sequence (3 Gb) CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATCTGAAATTTCTTGGAA ACACGATCACTTTAACGGAATATTGCTGTTTTGGGGAAGTGTTTTACAGCTGCTGGGCACGCTGTATTTGCCTTACTTAAGC CCCTGGTAATTGCTGTATTCCGAAGACATGCTGATGGGAATTACCAGGCGGCGTTGGTCTCTAACTGGAGCCCTCTGTCCCC ACTAGCCACGCGTCACTGGTTAGCGTGATTGAAACTAAATCGTATGAAAATCCTCTTCTCTAGTCGCACTAGCCACGTTTCG AGTGCTTAATGTGGCTAGTGGCACCGGTTTGGACAGCACAGCTGTAAAATGTTCCCATCCTCACAGTAAGCTGTTACCGTTC CAGGAGATGGGACTGAATTAGAATTCAAACAAATTTTCCAGCGCTTCTGAGTTTTACCTCAGTCACATAATAAGGAATGCAT CCCTGTGTAAGTGCATTTTGGTCTTCTGTTTTGCAGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTA TTGGATCCAAAGAGAGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGGTATTGACAAATTTTATATAAC TTTATAAATTACACCGAGAAAGTGTTTTCTAAAAAATGCTTGCTAAAAACCCAGTACGTCACAGTGTTGCTTAGAACCATAA ACTGTTCCTTATGTGTGTATAAATCCAGTTAACAACATAATCATCGTTTGCAGGTTAACCACATGATAAATATAGAACGTCT AGTGGATAAAGAGGAAACTGGCCCCTTGACTAGCAGTAGGAACAATTACTAACAAATCAGAAGCATTAATGTTACTTTATGG CAGAAGTTGTCCAACTTTTTGGTTTCAGTACTCCTTATACTCTTAAAAATGATCTAGGACCCCCGGAGTGCTTTTGTTTATG TAGCTTACCATATTAGAAATTTAAAACTAAGAATTTAAGGCTGGGCGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGA GGCCGAGGTGGGCGGATCACTTGAGGCCAGAAGTTTGAGACCAGCCTGGCCAACATGGTGAAACCCTATCTCTACTAAAAAT ACAAAAAATGTGCTGCGTGTGGTGGTGCGTGCCTGTAATCCCAGCTACACGGGAGGTGGAGGCAGGAGAATCGCTTGAACCC TGGAGGCAGAGGTTGCAGTGAGCCAAGATCATGCCACTGCACTCTAGCCTGGGCCACATAGCATGACTCTGTCTCAAAACAA ACAAACAAACAAAAAACTAAGAATTTAAAGTTAATTTACTTAAAAATAATGAAAGCTAACCCATTGCATATTATCACAACAT TCTTAGGAAAAATAACTTTTTGAAAACAAGTGAGTGGAATAGTTTTTACATTTTTGCAGTTCTCTTTAATGTCTGGCTAAAT AGAGATAGCTGGATTCACTTATCTGTGTCTAATCTGTTATTTTGGTAGAAGTATGTGAAAAAAAATTAACCTCACGTTGAAA AAAGGAATATTTTAATAGTTTTCAGTTACTTTTTGGTATTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGAT ATACCATAGGTCTTTCCCATGTCGCAACATCATGCAGTGATTATTTGGAAGATAGTGGTGTTCTGAATTATACAAAGTTTCC AAATATTGATAAATTGCATTAAACTATTTTAAAAATCTCATTCATTAATACCACCATGGATGTCAGAAAAGTCTTTTAAGAT TGGGTAGAAATGAGCCACTGGAAATTCTAATTTTCATTTGAAAGTTCACATTTTGTCATTGACAACAAACTGTTTTCCTTGC AGCAACAAGATCACTTCATTGATTTGTGAGAAAATGTCTACCAAATTATTTAAGTTGAAATAACTTTGTCAGCTGTTCTTTC AAGTAAAAATGACTTTTCATTGAAAAAATTGCTTGTTCAGATCACAGCTCAACATGAGTGCTTTTCTAGGCAGTATTGTACT TCAGTATGCAGAAGTGCTTTATGTATGCTTCCTATTTTGTCAGAGATTATTAAAAGAAGTGCTAAAGCATTGAGCTTCGAAA TTAATTTTTACTGCTTCATTAGGACATTCTTACATTAAACTGGCATTATTATTACTATTATTTTTAACAAGGACACTCAGTG GTAAGGAATATAATGGCTACTAGTATTAGTTTGGTGCCACTGCCATAACTCATGCAAATGTGCCAGCAGTTTTACCCAGCAT CATCTTTGCACTGTTGATACAAATGTCAACATCATGAAAAAGGGTTGAAAAAAGGAATATTTTAATAGTTTTCAGTTACTTT We need to make the data mean something…
http://www.ncbi.nlm. http://www. nih.gov/mapview ensembl.org http://genome.ucsc.edu Ensembl Features
• Gene builds for ~70 species • Gene trees • Regulatory build (ENCODE) • Variation display and VEP • Display of user data • BioMart (data export) • Programmatic access via the APIs • Completely Open Source Access scales One by one Main browser Mobile site
BioMart REST API Perl API VEP MySQL
FTP Groups Whole genome
Vertebrate species on Ensembl
Image obtained using Dendroscope:
Dendroscope 3: An interactive tool for rooted phylogenetic trees and networks
D.H. Huson and C Scornavacca Systematic Biology, 2012 Non-vertebrates on Ensembl genomes
Bacteria Protists Fungi
Metazoa Plants www.ensemblgenomes.org Ensembl and Ensembl Genomes
Ensembl EnsemblGenomes Released 2000 2009 Species Vertebrates (fly, worm and Non-vertebrates (protists, yeast as outgroups) plants, fungi, metazoa, bacteria)
Annotation by Ensembl in collaboration with the scientific communities URL www.ensembl.org www.ensemblgenomes.org Release cycle New/updated interfaces 9089 JulyMay 2017 2017 Updated New regulation genome assemblies data 2-3 months Updated Underlying variation software data updates Compara on new genes and genomes Updated gene sets Ensembl Tools
Tools allow: • Interpretation and processing of your own data • Custom download of Ensembl data for further analysis BLAST/BLAT for sequence searching
• Find Ensembl sequences that match your sequence using BLAST/BLAT • Search: • Nucleotide sequences • Protein sequences • Short sequences (eg primers, morpholinos, siRNAs) • Search against • Genomic sequences • cDNA sequences • Protein sequences Hands on – BLAST/BLAT
• I’ve designed a pair of primers for RT-PCR against human BRCA2 • I want to make sure they don’t have any non-specific hits that will mess up my RT-PCR results • The sequences are: >fwd GAGGACTCCTTATGTCCAAATTT
>rev GAGAATCAGCTTCTGGGGTAATAA Assembly converter
• You have data mapped to an old genome assembly • You want to update your data to map it to a new one What is a genome assembly? Sequence reads CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA CAGCTGTCCCAGATGAC ACTTAACTTCCCTCCCAGCTGTCC GGGCTCCGCCTTCAGCTC TCCCAGCTGTCCCAGATGACGCCATC AACTTCCCTCCCAGCT CGGCCTTTGGGCTCC TCCGCCTTCAGCTCAAGACTTAACTTC CAGATGACGCC
Match up overlaps
CGGCCTTTGGGCTCCGCCTTCAGCTCAAGA AACTTCCCTCCCAGCT CAGATGACGCC TCCGCCTTCAGCTCAAGACTTAACTTC TCCCAGCTGTCCCAGATGACGCCATC GGGCTCCGCCTTCAGCTC ACTTAACTTCCCTCCCAGCTGTCC CGGCCTTTGGGCTCC CAGCTGTCCCAGATGAC
Genome assembly CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATC Genome contigs BL102
BL AL476
AL CM553
CM IM768
IM Reference alleles BL102 BL102
BL AL476
AGTCGTAGCTAGC TAGGCCATAGGCGA
AL Frequency T = 0.05, frequency G = 0.95
CM553 G is the allele in all primates T causes disease susceptibility
CM Perhaps G should be the reference IM768 allele? We can replace the region with a new
IM contig Genome Gaps BL102 BL102 Gap in the genome caused
by: AL476 BL AL476 ● Poor sequencing at this region
● No contig was ever cloned AL CM553
CM
IM768 We can fill in the gap with a new contig
IM Incorrectly assembled contigs BL102 BL102
CM553 BL AL476 BL
CM AL CM553 AL476
AL CM IM768 IM768
IM IM New genome assemblies
• Fixing errors in the genome produces a new genome assembly • New genome assemblies mean re-mapping of all genome features • Ensembl will stop updating the old assembly when a new one is brought in • You’ve got data mapped to the old assembly and you want to compare to the up-to-date Ensembl annotation Assembly converter
• Converts genome coordinates to a different genome assembly. • Works with: • BED (simple coordinates) • GFF (gene, transcript and exon coordinates) • GTF (gene, transcript and exon coordinates) • WIG (values plotted against the genome) • VCF (variants) Hands-on – Assembly converter
• We’re going to convert a small BED file from the human genome assembly GRCh37 to the more recent GRCh38 • BED is a simple features format which lists the start and end coordinate of the feature.
5 36821734 37091336 P1 5 36731578 36978408 P2 5 36908654 37108773 P3 Data Slicer for variants
• Whole genome VCF files are unwieldy • They contain all variants in the genome • They contain all genotypes from all individuals studied • Sometimes you just want to analyse a small region and one population • The Data Slicer allows you to take a slice of a VCF and narrow down to only individuals and populations of interest
• Data Slicer currently only accesses the 1000 Genomes data • It is only available for human and only on GRCh37 Hands on – Data Slicer
• I want to get a VCF of the region containing the MC1R gene for the British population • MC1R is found at 16:89978527-89987385 in GRCh37 • The three-letter code for the British population in 1000 Genomes is GBR FTP
• Files of our complete database: • Genomic, cDNA, CDS, ncRNA and protein sequence (FASTA) • Annotated sequence (EMBL, GenBank) • Gene sets (GTF, GFF) • Whole-genome multiple and gene-based multiple alignments (MAF) • Variants (VCF, GVF) • Constrained elements (BED) • Regulatory features (BED, BigWig) • RNA-Seq files (BAM, BigWig) • MySQL database Access FTP
Your favourite FTP client
FTP site ftp://ftp.ensembl.org/pub/
FTP downloads page http://www.ensembl.org/info/data/ftp/index.html FTP files are big
• Multiple Mb/Gb • Lots of time to download/unzip
• Do you really need this data? • Make sure it’s the right file before you download. File chameleon for NGS analysis
• Although files on the Ensembl FTP site are in a standard format, different tools define the standards differently (sigh!) • Your NGS analysis tool might need files that are slightly different to the Ensembl formats • File chameleon allows you to download files with these adjustments Hands on – File Chameleon
• I need a GFF3 file of cat for my RNA-seq analysis. • My tool requires: • UCSC-style chromosome naming like chr1 • Only genes shorter than 4 Mb • Transcript IDs in every line • We will use File Chameleon to download this customised file. Analyse your own variants with the VEP
• Find out the effects of your own variants on Ensembl genes • Analyse whole genome variant calls • Filter variants to find those that might be interesting Your own variant data
Variant coordinates 1 881907 881906 -/C + 5 140532 140532 T/C + 12 1017956 1017956 T/A + 2 946507 946507 G/C + 14 19584687 19584687 C/T -
HGVS notation ENST00000285667.3:c.1047_1048insC 5:g.140532T>C NM_153681.2:c.7C>T ENSP00000439902.1:p.Ala2233Asp NP_000050.2:p.Ile2285Val
VCF #CHROM POS ID REF ALT 20 14370 rs6054257 G A 20 17330 . T A 20 1110696 rs6040355 A G,T 20 1230237 . T .
Variant IDs rs41293501 COSM327779 rs146120136 FANCD1:c.475G>A rs373400041 Variation types
1) Small scale in one or few nucleotides of a gene • Small insertions and deletions (DIPs or indels) • Single nucleotide polymorphism (SNP) A G A C T T G A C C T G T C T - A A C T G G A T G A C T T G A C - T G T C T G A A C G G G A
2) Large scale in chromosomal structure (structural variation) • Copy number variations (CNV) • Large deletions/duplications, insertions, translocations
deletion duplication insertion translocation Variation consequences
CODING CODING Regulatory Synonymous Non-synonymous
ATG AAAAAAA
5’ Upstream 5’ UTR Splice site Intronic 3’ UTR 3’ Downstream Consequence terms
http://www.ensembl.org/info/docs/variation/predicted_data.html Predicting missense effects – SIFT and PolyPhen
SIFT and PolyPhen score changes in amino acid sequence based on: • How well conserved the protein is • The chemical change in the amino acid • 3D structure and domains (PolyPhen only)
• SIFT and PolyPhen are predictions, not facts • A prediction will never be as good as experimental validation SIFT PolyPhen
1 1 Probably damaging 0.1 Possibly damaging 0.2
Tolerated
Benign
0.05 Deleterious 0 0 Use the VEP
http://www.ensembl.org/info/docs/tools/vep/index.html Species that work with the VEP
?
+ everything in Plants, Fungi, Metazoa, Protists and Bacteria Set up a cache
- Speed up your VEP script with an offline cache. - Use prebuilt caches for Ensembl species. - Or make your own from GTF and FASTA files - even for genomes not in Ensembl.
✓
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html VEP plugins
• Plugins add extra functionality to the VEP • They may extend, filter or manipulate the output of the VEP. • Plugins may make use of external data or code. • Available on the web tool and with the script. Hands on
• We’re going to look at a set of four variants to find out what genes they hit and what effect they have on them. 9 128328461 128328461 A/- + var1 9 128322349 128322349 C/A + var2 9 128323079 128323079 C/G + var3 9 128322917 128322917 G/A + var4 Questions?
• We’ve muted all the mics • Ask questions in the Chat box in the webinar interface • I will check the Chat interface • There’s no threading so please respond with @name Host an Ensembl course
We can teach an Ensembl course Browser course at your institute for free (except ½-2 day course on the trainers’ expenses). Ensembl browser, aimed at wet-lab scientists. Email us: [email protected] One trainer.
REST API course 1-2 day course on the Ensembl Perl API, aimed at bioinformaticians. 1-2 trainers.
http://training.ensembl.org/ Help and documentation
Course online http://www.ebi.ac.uk/training/online/subjects/11 Tutorials www.ensembl.org/info/website/tutorials
Flash animations www.youtube.com/user/EnsemblHelpdesk http://u.youku.com/Ensemblhelpdesk
Email us [email protected] Ensembl public mailing lists [email protected], [email protected] Follow us www.facebook.com/Ensembl.org
@Ensembl www.ensembl.info Publications
http://www.ensembl.org/info/about/publications.html
Aken, B. et al Ensembl 2017 Nucleic Acids Research http://europepmc.org/articles/PMC5210575
Xosé M. Fernández-Suárez and Michael K. Schuster Using the Ensembl Genome Server to Browse Genomic Sequence Data. Current Protocols in Bioinformatics 1.15.1-1.15.48 (2010) www.ncbi.nlm.nih.gov/pubmed/20521244
Giulietta M Spudich and Xosé M Fernández-Suárez Touring Ensembl: A practical guide to genome browsing BMC Genomics 11:295 (2010) www.biomedcentral.com/1471-2164/11/295 Ensembl Acknowledgements The Entire Ensembl Team
Funding
Co-funded by the European Union