<<

Training materials

• Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/

• If you wish to re-use these materials, please credit Ensembl for their creation

• If you use Ensembl for your work, please cite our papers http://www.ensembl.org/info/about/publications.html Variation data in Ensembl

Erin Haskell

[email protected]

@ensembl /@ensemblgenomes Questions?

○ We’ve muted all of your microphones

○ Join our Slack workspace and ask questions (link in your registration confirmation email)

○ My Ensembl colleagues will respond during the talk

Emily Perry Astrid Gall

○ Please reply @username to reply to a specific person Course exercises

All materials and exercises located here: http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016

This text will be replaced by a YouTube (link to YouKu too) video of the webinar and a pdf of the slides. A link to exercises and their solutions will appear in the page hierarchy The “next page” will be the exercises Get help with the exercises

• Use the exercise solutions in the online course

• Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link)

• Email us [email protected] This webinar course

Date Webinar topic Instructor

4th Sept Introduction to Ensembl ✔ Astrid Gall Ensembl genes ✔ Emily Perry

6th Sept Variation data in Ensembl and the Ensembl VEP Erin Haskell Comparing genes and with Ensembl Compara Astrid Gall

11th Sept Finding features that regulate genes – the Ensembl Emily Perry Regulatory Build Erin Haskell Data export with BioMart

13th Sept Uploading your data to Ensembl Astrid Gall Introduction to the Ensembl REST APIs Emily Perry Variation data in Ensembl

Erin Haskell

[email protected]

@ensembl /@mycoacia Session structure

Presentation: Part 1: Ensembl variation data Part 2: The Ensembl Variant Effect Predictor (VEP)

Demo: Part 1: Viewing variation data in the browser Part 2: Using the VEP

Exercises: Available on the train online site Session Overview

Ensembl variation data - What types of variants are in Ensembl? - Where does the data come from? - What are the biological consequences of variants? - Things to watch out for

The Ensembl Variant Effect Predictor (VEP) tool - What data can I use with the VEP? - Identifying known variants - Predicting consequences for novel variants What types of variant are in Ensembl?

Two broad categories:

1. Sequence variants (small alterations ≤50bp)

2. Structural variants (larger alterations ≥50bp)

ensembl.org/info/genome/variation/index.html Variant type 1: Sequence variants

● Single nucleotide polymorphisms (SNP/SNV) ref...TTGACGTA... alt...TTGGCGTA...

● Small insertions & deletions ref...TTGACGTA... ins...TTGAGCGTA... del...TTG-CGTA... ...TTGGCTCGTA...

http://www.ensembl.org/info/genome/variation/prediction/classification.html Variant type 2: Structural variants

(CNV)

Ref Gain Loss ● Inversion - nucleotide sequence inverted at same position

Ref > > >

Invert > > >

● Translocation - nucleotide sequence moved to a new position

Ref Translocated: same chromosome Translocated: diff chromosome http://www.ensembl.org/info/genome/variation/prediction/classification.html Where does the data come from?

The Ensembl variation process

Variant Quality Linked Ensembl import control data analysis Ensembl variation process: Import

Variant Quality Linked Ensembl import control data analysis

Import variant data from publicly available archives and data repositories. EVA

http://www.ensembl.org/info/genome/variation/species/sources_documentation.html ...and many many more Data import: 23 with variation data

http://www.ensembl.org/info/genome/variation/species/species_data_types.html Data import: 27 species with variation data

Division Number of species with variation data

Bacteria 0

Fungi 8

Metazoa 4

Plants 12

Protists 3

http://ensemblgenomes.org/info/genomes?variation=1 Ensembl variation process: QC

Variant Quality Linked Ensembl import control data analysis

● Mapping to reference assembly ○ GRCh37 GRCh38 ● Checks on ● Checks for IUPAC ambiguity codes ● Excluding ‘suspect’ variants

http://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#quality_control Ensembl variation process: Linked data

Variant Quality Linked Ensembl import control data analysis

Import ‘accessory’ data ● /disease ● frequencies ● Publication data http://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html Linked data: 1000 genomes project Sequencing 2,500 individuals at 4X coverage

ITU STU FIN GBR CHB CEU IBR GIH JPT TSI CDX MXL PUR ASW GW PJL BEB CHS MSL YRI LWK CLM ACB D ESN KHV

PEL

America Africa East Asia Central-South Asia http://www.internationalgenome.org Linked data: GnomAD allele frequencies The Genome Aggregation Database provides allele frequency data from 7 different populations

Sample number

macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/ Ensembl variation process: Analysis

Variant Quality Linked Ensembl import control data analysis

Ensembl predicts: ● Variant consequences ● function prediction ● data ● Variant conservation across species http://www.ensembl.org/info/genome/variation/prediction/index.html Analysis: Variant consequence terms Standardised variant consequence terms as defined by http://www.sequenceontology.org

http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html Analysis: Variant consequence terms Standardised variant consequence terms as defined by http://www.sequenceontology.org

http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html Analysis: Pathogenicity scores

- For missense variants only - Two prediction algorithms: - SIFT (Sorting Intolerant From Tolerant) - PolyPhen ( Phenotyping)

Score changes in sequence based on: - How conserved the amino acid is - The chemical change in the amino acid

ensembl.org/info/genome/variation/predicted_data.html#sift Analysis: Pathogenicity scores

SIFT PolyPhen 1 1 Probably damaging 0.2 Possibly damaging 0.1

Tolerated

Benign

0.05 Deleterious 0 0 Analysis: Linkage disequilibrium

Linkage Disequilibrium (LD) “the non-random association of alleles at 2 or more loci within a given population”

or

“how often two variants or specific sequences are inherited together” Analysis: Linkage disequilibrium

The Linkage Disequilibrium (LD) calculator

Within a genomic region...

For a list of variants...

For an defined area surrounding your variant... Ensembl variation process

Variant Quality Linked Ensembl import control data analysis

Where can I find this data?

● Website www.ensembl.org ● Variant Effect Predictor (VEP) ● BioMart ● Programmatically: ○ Perl API (including VEP) ○ REST API Note: Reference & alternate alleles BL102 AL476

BL AL476

AGTCGTAGCTAGCAAGGCCATAGGCGA

A G AL Frequency = 0.01, frequency = 0.99

CM553 G is the ancestral allele A causes disease susceptibility

CM A is allele in the contig used IM768 ⸫ A is the reference allele ⸫ G is the alternate allele A G IM ⸫ Alleles are / Note: Reference & alternate alleles

http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=12:120999079-121000079;v=rs1169305;vdb=variation;vf=829489 Note: Allele strand

AGTCGTAGCTAGC T/GAGGCCATAGGCGA

A/C GCTAGCTACGACT TCGCCTATGGCCT

Exon sequence: TATGGCCTA/CGCTAGC

Alleles in database = T/G Alleles in gene = A/C

Alleles = A/C -ve strand or T/G +ve strand

Alleles = A/C or T/G Often lack further info Demonstration

- Finding variants in a gene of interest, MCM6

- Finding variants at a genomic location of interest

- Finding out more information about a specific variant, rs4988235 The Variant Effect Predictor

McLaren et al 2016 europepmc.org/abstract/MED/27268795 What does the VEP do?

A tool to predict and annotate the functional consequences of variants

Your variant data • Affected gene, transcript and protein sequence • Splicing consequences • Regulatory consequences • Known variants:

+ Pathogenicity /

+ Frequency data

+ Literature citations What does the VEP do? Variant data input formats

Variant coordinates 1 881907 881906 -/C + 5 140532 140532 T/C + (Ensembl default) 12 1017956 1017956 T/A + 2 946507 946507 G/C + 14 19584687 19584687 C/T -

HGVS notation ENST00000285667.3:c.1047_1048insC 5:g.140532T>C NM_153681.2:c.7C>T ENSP00000439902.1:p.Ala2233Asp NP_000050.2:p.Ile2285Val

VCF #CHROM POS ID REF ALT 20 14370 rs6054257 G A 20 17330 . T A 20 1110696 rs6040355 A G,T 20 1230237 . T .

Variant IDs rs41293501 COSM327779 rs146120136 FANCD1:c.475G>A rs373400041 http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input VEP features: finding known variants Are your variants are already known?

○ dbSNP ○ COSMIC ○ Clinvar ○ ESP ○ HGMD-Public ○ Phencode How common are your variant alleles in different populations?

○ 1000 Genomes ○ ESP ○ ExAC projects ○ GnomAD

Phenotype/disease, clinical significance ○ OMIM ○ Orphanet ○ GWAS catalog ○ ClinVar VEP features: consequence prediction Consequence predictions (choose multiple databases) ○ Ensembl ○ RefSeq ○ Merged ○ GENCODE basic

Does your variant overlap regulatory regions? ○ ENCODE ○ BLUEPRINT ○ NIH Epigenomics Roadmap ○ Can be limited to regulatory regions observed in specific cell types.

Pathogenicity predictions ○ SIFT ○ PolyPhen ○ via plugins: CADD, FATHMM, LRT, MutationTaster, and many more!

Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/ VEP features: plugins

● Plugins add extra functionality to the VEP ● They may extend, filter or manipulate the output of the VEP. ● Plugins may make use of external data or code. ● Available on the web tool and with the script.

Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/ Use VEP with any species

● Access through the web browser, REST API or Perl API

● Use prebuilt caches for Ensembl species.

...and for all species in

http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html Use VEP with any species

● Speed up your VEP script with an offline cache.

● Or make your own from GTF and FASTA files - even for genomes not in Ensembl.

http://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html Using VEP

ensembl.org/info/docs/tools/vep/index.html Demonstration

We have identified four variants on chromosome nine: - A deletion at 128328461 - C->A at 128322349 - C->G at 128323079 - G->A at 128322917

We will use the Ensembl VEP to find out: - Are any of my variants already known? - What genes are affected by my variants? - Do any of my variants affect gene regulation? Questions?

○ We’ve muted all of your microphones

○ Join our Slack workspace and ask questions (link in your registration confirmation email)

○ My Ensembl colleagues will respond during the talk

Emily Perry Astrid Gall

○ Please reply @username to reply to a specific person Course exercises

All materials and exercises located here: http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016

This text will be replaced by a YouTube (link to YouKu too) video of the webinar and a pdf of the slides. A link to exercises and their solutions will appear in the page hierarchy The “next page” will be the exercises Get help with the exercises

• Use the exercise solutions in the online course

• Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link)

• Email us [email protected] This webinar course

Date Webinar topic Instructor

4th Sept Introduction to Ensembl ✔ Astrid Gall Ensembl genes ✔ Emily Perry

6th Sept Variation data in Ensembl and the Ensembl VEP ✔ Erin Haskell Comparing genes and genomes with Ensembl Compara Astrid Gall

11th Sept Finding features that regulate genes – the Ensembl Emily Perry Regulatory Build Erin Haskell Data export with BioMart

13th Sept Uploading your data to Ensembl Astrid Gall Introduction to the Ensembl REST APIs Emily Perry Coming up!

Comparing genes and genomes with Ensembl Compara

Ensembl Compara allows you to perform detailed analysis of gene models between species.

During this webinar we take a look at the gene trees and homologues of a set of genes, and at whole genome alignments between pairs and groups of species.

Starting in ∼5 minutes! Astrid Gall Training materials

• Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/

• If you wish to re-use these materials, please credit Ensembl for their creation

• If you use Ensembl for your work, please cite our papers http://www.ensembl.org/info/about/publications.html