Training materials
• Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/
• If you wish to re-use these materials, please credit Ensembl for their creation
• If you use Ensembl for your work, please cite our papers http://www.ensembl.org/info/about/publications.html Variation data in Ensembl
Erin Haskell
@ensembl /@ensemblgenomes Questions?
○ We’ve muted all of your microphones
○ Join our Slack workspace and ask questions (link in your registration confirmation email)
○ My Ensembl colleagues will respond during the talk
Emily Perry Astrid Gall
○ Please reply @username to reply to a specific person Course exercises
All materials and exercises located here: http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016
This text will be replaced by a YouTube (link to YouKu too) video of the webinar and a pdf of the slides. A link to exercises and their solutions will appear in the page hierarchy The “next page” will be the exercises Get help with the exercises
• Use the exercise solutions in the online course
• Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link)
• Email us [email protected] This webinar course
Date Webinar topic Instructor
4th Sept Introduction to Ensembl ✔ Astrid Gall Ensembl genes ✔ Emily Perry
6th Sept Variation data in Ensembl and the Ensembl VEP Erin Haskell Comparing genes and genomes with Ensembl Compara Astrid Gall
11th Sept Finding features that regulate genes – the Ensembl Emily Perry Regulatory Build Erin Haskell Data export with BioMart
13th Sept Uploading your data to Ensembl Astrid Gall Introduction to the Ensembl REST APIs Emily Perry Variation data in Ensembl
Erin Haskell
@ensembl /@mycoacia Session structure
Presentation: Part 1: Ensembl variation data Part 2: The Ensembl Variant Effect Predictor (VEP)
Demo: Part 1: Viewing variation data in the browser Part 2: Using the VEP
Exercises: Available on the train online site Session Overview
Ensembl variation data - What types of variants are in Ensembl? - Where does the data come from? - What are the biological consequences of variants? - Things to watch out for
The Ensembl Variant Effect Predictor (VEP) tool - What data can I use with the VEP? - Identifying known variants - Predicting consequences for novel variants What types of variant are in Ensembl?
Two broad categories:
1. Sequence variants (small alterations ≤50bp)
2. Structural variants (larger alterations ≥50bp)
ensembl.org/info/genome/variation/index.html Variant type 1: Sequence variants
● Single nucleotide polymorphisms (SNP/SNV) ref...TTGACGTA... alt...TTGGCGTA...
● Small insertions & deletions ref...TTGACGTA... ins...TTGAGCGTA... del...TTG-CGTA... indel...TTGGCTCGTA...
http://www.ensembl.org/info/genome/variation/prediction/classification.html Variant type 2: Structural variants
● Copy number variation (CNV)
Ref Gain Loss ● Inversion - nucleotide sequence inverted at same position
Ref > > >
Invert > > >
● Translocation - nucleotide sequence moved to a new position
Ref Translocated: same chromosome Translocated: diff chromosome http://www.ensembl.org/info/genome/variation/prediction/classification.html Where does the data come from?
The Ensembl variation process
Variant Quality Linked Ensembl import control data analysis Ensembl variation process: Import
Variant Quality Linked Ensembl import control data analysis
Import variant data from publicly available archives and data repositories. EVA
http://www.ensembl.org/info/genome/variation/species/sources_documentation.html ...and many many more Data import: 23 species with variation data
http://www.ensembl.org/info/genome/variation/species/species_data_types.html Data import: 27 species with variation data
Division Number of species with variation data
Bacteria 0
Fungi 8
Metazoa 4
Plants 12
Protists 3
http://ensemblgenomes.org/info/genomes?variation=1 Ensembl variation process: QC
Variant Quality Linked Ensembl import control data analysis
● Mapping to reference assembly ○ GRCh37 GRCh38 ● Checks on alleles ● Checks for IUPAC ambiguity codes ● Excluding ‘suspect’ variants
http://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#quality_control Ensembl variation process: Linked data
Variant Quality Linked Ensembl import control data analysis
Import ‘accessory’ data ● Phenotype/disease ● Allele frequencies ● Publication data http://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html Linked data: 1000 genomes project Sequencing 2,500 individuals at 4X coverage
ITU STU FIN GBR CHB CEU IBR GIH JPT TSI CDX MXL PUR ASW GW PJL BEB CHS MSL YRI LWK CLM ACB D ESN KHV
PEL
America Africa Europe East Asia Central-South Asia http://www.internationalgenome.org Linked data: GnomAD allele frequencies The Genome Aggregation Database provides allele frequency data from 7 different populations
Sample number
macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/ Ensembl variation process: Analysis
Variant Quality Linked Ensembl import control data analysis
Ensembl predicts: ● Variant consequences ● Protein function prediction ● Linkage disequilibrium data ● Variant conservation across species http://www.ensembl.org/info/genome/variation/prediction/index.html Analysis: Variant consequence terms Standardised variant consequence terms as defined by http://www.sequenceontology.org
http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html Analysis: Variant consequence terms Standardised variant consequence terms as defined by http://www.sequenceontology.org
http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html Analysis: Pathogenicity scores
- For missense variants only - Two prediction algorithms: - SIFT (Sorting Intolerant From Tolerant) - PolyPhen (Polymorphism Phenotyping)
Score changes in amino acid sequence based on: - How conserved the amino acid is - The chemical change in the amino acid
ensembl.org/info/genome/variation/predicted_data.html#sift Analysis: Pathogenicity scores
SIFT PolyPhen 1 1 Probably damaging 0.2 Possibly damaging 0.1
Tolerated
Benign
0.05 Deleterious 0 0 Analysis: Linkage disequilibrium
Linkage Disequilibrium (LD) “the non-random association of alleles at 2 or more loci within a given population”
or
“how often two variants or specific sequences are inherited together” Analysis: Linkage disequilibrium
The Linkage Disequilibrium (LD) calculator
Within a genomic region...
For a list of variants...
For an defined area surrounding your variant... Ensembl variation process
Variant Quality Linked Ensembl import control data analysis
Where can I find this data?
● Website www.ensembl.org ● Variant Effect Predictor (VEP) ● BioMart ● Programmatically: ○ Perl API (including VEP) ○ REST API Note: Reference & alternate alleles BL102 AL476
BL AL476
AGTCGTAGCTAGCAAGGCCATAGGCGA
A G AL Frequency = 0.01, frequency = 0.99
CM553 G is the ancestral allele A causes disease susceptibility
CM A is allele in the contig used IM768 ⸫ A is the reference allele ⸫ G is the alternate allele A G IM ⸫ Alleles are / Note: Reference & alternate alleles
http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=12:120999079-121000079;v=rs1169305;vdb=variation;vf=829489 Note: Allele strand
AGTCGTAGCTAGC T/GAGGCCATAGGCGA
A/C GCTAGCTACGACT TCGCCTATGGCCT
Exon sequence: TATGGCCTA/CGCTAGC
Alleles in database = T/G Alleles in gene = A/C
Alleles = A/C -ve strand or T/G +ve strand
Alleles = A/C or T/G Often lack further info Demonstration
- Finding variants in a gene of interest, MCM6
- Finding variants at a genomic location of interest
- Finding out more information about a specific variant, rs4988235 The Variant Effect Predictor
McLaren et al 2016 europepmc.org/abstract/MED/27268795 What does the VEP do?
A tool to predict and annotate the functional consequences of variants
Your variant data • Affected gene, transcript and protein sequence • Splicing consequences • Regulatory consequences • Known variants:
+ Pathogenicity /
+ Frequency data
+ Literature citations What does the VEP do? Variant data input formats
Variant coordinates 1 881907 881906 -/C + 5 140532 140532 T/C + (Ensembl default) 12 1017956 1017956 T/A + 2 946507 946507 G/C + 14 19584687 19584687 C/T -
HGVS notation ENST00000285667.3:c.1047_1048insC 5:g.140532T>C NM_153681.2:c.7C>T ENSP00000439902.1:p.Ala2233Asp NP_000050.2:p.Ile2285Val
VCF #CHROM POS ID REF ALT 20 14370 rs6054257 G A 20 17330 . T A 20 1110696 rs6040355 A G,T 20 1230237 . T .
Variant IDs rs41293501 COSM327779 rs146120136 FANCD1:c.475G>A rs373400041 http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input VEP features: finding known variants Are your variants are already known?
○ dbSNP ○ COSMIC ○ Clinvar ○ ESP ○ HGMD-Public ○ Phencode How common are your variant alleles in different populations?
○ 1000 Genomes ○ ESP ○ ExAC projects ○ GnomAD
Phenotype/disease, clinical significance ○ OMIM ○ Orphanet ○ GWAS catalog ○ ClinVar VEP features: consequence prediction Consequence predictions (choose multiple databases) ○ Ensembl ○ RefSeq ○ Merged ○ GENCODE basic
Does your variant overlap regulatory regions? ○ ENCODE ○ BLUEPRINT ○ NIH Epigenomics Roadmap ○ Can be limited to regulatory regions observed in specific cell types.
Pathogenicity predictions ○ SIFT ○ PolyPhen ○ via plugins: CADD, FATHMM, LRT, MutationTaster, and many more!
Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/ VEP features: plugins
● Plugins add extra functionality to the VEP ● They may extend, filter or manipulate the output of the VEP. ● Plugins may make use of external data or code. ● Available on the web tool and with the script.
Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/ Use VEP with any species
● Access through the web browser, REST API or Perl API
● Use prebuilt caches for Ensembl species.
...and for all species in
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html Use VEP with any species
● Speed up your VEP script with an offline cache.
● Or make your own from GTF and FASTA files - even for genomes not in Ensembl.
http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html Using VEP
ensembl.org/info/docs/tools/vep/index.html Demonstration
We have identified four variants on human chromosome nine: - A deletion at 128328461 - C->A at 128322349 - C->G at 128323079 - G->A at 128322917
We will use the Ensembl VEP to find out: - Are any of my variants already known? - What genes are affected by my variants? - Do any of my variants affect gene regulation? Questions?
○ We’ve muted all of your microphones
○ Join our Slack workspace and ask questions (link in your registration confirmation email)
○ My Ensembl colleagues will respond during the talk
Emily Perry Astrid Gall
○ Please reply @username to reply to a specific person Course exercises
All materials and exercises located here: http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016
This text will be replaced by a YouTube (link to YouKu too) video of the webinar and a pdf of the slides. A link to exercises and their solutions will appear in the page hierarchy The “next page” will be the exercises Get help with the exercises
• Use the exercise solutions in the online course
• Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link)
• Email us [email protected] This webinar course
Date Webinar topic Instructor
4th Sept Introduction to Ensembl ✔ Astrid Gall Ensembl genes ✔ Emily Perry
6th Sept Variation data in Ensembl and the Ensembl VEP ✔ Erin Haskell Comparing genes and genomes with Ensembl Compara Astrid Gall
11th Sept Finding features that regulate genes – the Ensembl Emily Perry Regulatory Build Erin Haskell Data export with BioMart
13th Sept Uploading your data to Ensembl Astrid Gall Introduction to the Ensembl REST APIs Emily Perry Coming up!
Comparing genes and genomes with Ensembl Compara
Ensembl Compara allows you to perform detailed analysis of gene models between species.
During this webinar we take a look at the gene trees and homologues of a set of genes, and at whole genome alignments between pairs and groups of species.
Starting in ∼5 minutes! Astrid Gall Training materials
• Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/
• If you wish to re-use these materials, please credit Ensembl for their creation
• If you use Ensembl for your work, please cite our papers http://www.ensembl.org/info/about/publications.html