Training materials • Ensembl training materials are protected by a CC BY license http://creativecommons.org/licenses/by/4.0/ • If you wish to re-use these materials, please credit Ensembl for their creation • If you use Ensembl for your work, please cite our papers http://www.ensembl.org/info/about/publications.html Variation data in Ensembl Erin Haskell [email protected] @ensembl /@ensemblgenomes Questions? ○ We’ve muted all of your microphones ○ Join our Slack workspace and ask questions (link in your registration confirmation email) ○ My Ensembl colleagues will respond during the talk Emily Perry Astrid Gall ○ Please reply @username to reply to a specific person Course exercises All materials and exercises located here: http://www.ebi.ac.uk/training/online/course/ensembl-browser-webinar-series-2016 This text will be replaced by a YouTube (link to YouKu too) video of the webinar and a pdf of the slides. A link to exercises and their solutions will appear in the page hierarchy The “next page” will be the exercises Get help with the exercises • Use the exercise solutions in the online course • Join our Slack workspace and discuss the exercises with everybody in dedicated channels (register to get sent a link) • Email us [email protected] This webinar course Date Webinar topic Instructor 4th Sept Introduction to Ensembl ✔ Astrid Gall Ensembl genes ✔ Emily Perry 6th Sept Variation data in Ensembl and the Ensembl VEP Erin Haskell Comparing genes and genomes with Ensembl Compara Astrid Gall 11th Sept Finding features that regulate genes – the Ensembl Emily Perry Regulatory Build Erin Haskell Data export with BioMart 13th Sept Uploading your data to Ensembl Astrid Gall Introduction to the Ensembl REST APIs Emily Perry Variation data in Ensembl Erin Haskell [email protected] @ensembl /@mycoacia Session structure Presentation: Part 1: Ensembl variation data Part 2: The Ensembl Variant Effect Predictor (VEP) Demo: Part 1: Viewing variation data in the browser Part 2: Using the VEP Exercises: Available on the train online site Session Overview Ensembl variation data - What types of variants are in Ensembl? - Where does the data come from? - What are the biological consequences of variants? - Things to watch out for The Ensembl Variant Effect Predictor (VEP) tool - What data can I use with the VEP? - Identifying known variants - Predicting consequences for novel variants What types of variant are in Ensembl? Two broad categories: 1. Sequence variants (small alterations ≤50bp) 2. Structural variants (larger alterations ≥50bp) ensembl.org/info/genome/variation/index.html Variant type 1: Sequence variants ● Single nucleotide polymorphisms (SNP/SNV) ref...TTGACGTA... alt...TTGGCGTA... ● Small insertions & deletions ref...TTGACGTA... ins...TTGAGCGTA... del...TTG-CGTA... indel...TTGGCTCGTA... http://www.ensembl.org/info/genome/variation/prediction/classification.html Variant type 2: Structural variants ● Copy number variation (CNV) Ref Gain Loss ● Inversion - nucleotide sequence inverted at same position Ref > > > Invert > > > ● Translocation - nucleotide sequence moved to a new position Ref Translocated: same chromosome Translocated: diff chromosome http://www.ensembl.org/info/genome/variation/prediction/classification.html Where does the data come from? The Ensembl variation process Variant Quality Linked Ensembl import control data analysis Ensembl variation process: Import Variant Quality Linked Ensembl import control data analysis Import variant data from publicly available archives and data repositories. EVA http://www.ensembl.org/info/genome/variation/species/sources_documentation.html ...and many many more Data import: 23 species with variation data http://www.ensembl.org/info/genome/variation/species/species_data_types.html Data import: 27 species with variation data Division Number of species with variation data Bacteria 0 Fungi 8 Metazoa 4 Plants 12 Protists 3 http://ensemblgenomes.org/info/genomes?variation=1 Ensembl variation process: QC Variant Quality Linked Ensembl import control data analysis ● Mapping to reference assembly ○ GRCh37 GRCh38 ● Checks on alleles ● Checks for IUPAC ambiguity codes ● Excluding ‘suspect’ variants http://www.ensembl.org/info/genome/variation/prediction/variant_quality.html#quality_control Ensembl variation process: Linked data Variant Quality Linked Ensembl import control data analysis Import ‘accessory’ data ● Phenotype/disease ● Allele frequencies ● Publication data http://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html Linked data: 1000 genomes project Sequencing 2,500 individuals at 4X coverage ITU STU FIN GBR CHB CEU IBR GIH JPT TSI CDX MXL PUR ASW GW PJL BEB CHS MSL YRI LWK CLM ACB D ESN KHV PEL America Africa Europe East Asia Central-South Asia http://www.internationalgenome.org Linked data: GnomAD allele frequencies The Genome Aggregation Database provides allele frequency data from 7 different populations Sample number macarthurlab.org/2017/02/27/the-genome-aggregation-database-gnomad/ Ensembl variation process: Analysis Variant Quality Linked Ensembl import control data analysis Ensembl predicts: ● Variant consequences ● Protein function prediction ● Linkage disequilibrium data ● Variant conservation across species http://www.ensembl.org/info/genome/variation/prediction/index.html Analysis: Variant consequence terms Standardised variant consequence terms as defined by http://www.sequenceontology.org http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html Analysis: Variant consequence terms Standardised variant consequence terms as defined by http://www.sequenceontology.org http://www.ensembl.org/info/genome/variation/prediction/predicted_data.html Analysis: Pathogenicity scores - For missense variants only - Two prediction algorithms: - SIFT (Sorting Intolerant From Tolerant) - PolyPhen (Polymorphism Phenotyping) Score changes in amino acid sequence based on: - How conserved the amino acid is - The chemical change in the amino acid ensembl.org/info/genome/variation/predicted_data.html#sift Analysis: Pathogenicity scores SIFT PolyPhen 1 1 Probably damaging 0.2 Possibly damaging 0.1 Tolerated Benign 0.05 Deleterious 0 0 Analysis: Linkage disequilibrium Linkage Disequilibrium (LD) “the non-random association of alleles at 2 or more loci within a given population” or “how often two variants or specific sequences are inherited together” Analysis: Linkage disequilibrium The Linkage Disequilibrium (LD) calculator Within a genomic region... For a list of variants... For an defined area surrounding your variant... Ensembl variation process Variant Quality Linked Ensembl import control data analysis Where can I find this data? ● Website www.ensembl.org ● Variant Effect Predictor (VEP) ● BioMart ● Programmatically: ○ Perl API (including VEP) ○ REST API Note: Reference & alternate alleles BL102 AL476 BL AL476 AGTCGTAGCTAGCAAGGCCATAGGCGA A G AL Frequency = 0.01, frequency = 0.99 CM553 G is the ancestral allele A causes disease susceptibility CM A is allele in the contig used IM768 ⸫ A is the reference allele ⸫ G is the alternate allele A G IM ⸫ Alleles are / Note: Reference & alternate alleles http://www.ensembl.org/Homo_sapiens/Variation/Population?db=core;r=12:120999079-121000079;v=rs1169305;vdb=variation;vf=829489 Note: Allele strand AGTCGTAGCTAGC T/GAGGCCATAGGCGA A/C GCTAGCTACGACT TCGCCTATGGCCT Exon sequence: TATGGCCTA/CGCTAGC Alleles in database = T/G Alleles in gene = A/C Alleles = A/C -ve strand or T/G +ve strand Alleles = A/C or T/G Often lack further info Demonstration - Finding variants in a gene of interest, MCM6 - Finding variants at a genomic location of interest - Finding out more information about a specific variant, rs4988235 The Variant Effect Predictor McLaren et al 2016 europepmc.org/abstract/MED/27268795 What does the VEP do? A tool to predict and annotate the functional consequences of variants Your variant data • Affected gene, transcript and protein sequence • Splicing consequences • Regulatory consequences • Known variants: + Pathogenicity / + Frequency data + Literature citations What does the VEP do? Variant data input formats Variant coordinates 1 881907 881906 -/C + 5 140532 140532 T/C + (Ensembl default) 12 1017956 1017956 T/A + 2 946507 946507 G/C + 14 19584687 19584687 C/T - HGVS notation ENST00000285667.3:c.1047_1048insC 5:g.140532T>C NM_153681.2:c.7C>T ENSP00000439902.1:p.Ala2233Asp NP_000050.2:p.Ile2285Val VCF #CHROM POS ID REF ALT 20 14370 rs6054257 G A 20 17330 . T A 20 1110696 rs6040355 A G,T 20 1230237 . T . Variant IDs rs41293501 COSM327779 rs146120136 FANCD1:c.475G>A rs373400041 http://www.ensembl.org/info/docs/tools/vep/vep_formats.html#input VEP features: finding known variants Are your variants are already known? ○ dbSNP ○ COSMIC ○ Clinvar ○ ESP ○ HGMD-Public ○ Phencode How common are your variant alleles in different populations? ○ 1000 Genomes ○ ESP ○ ExAC projects ○ GnomAD Phenotype/disease, clinical significance ○ OMIM ○ Orphanet ○ GWAS catalog ○ ClinVar VEP features: consequence prediction Consequence predictions (choose multiple databases) ○ Ensembl ○ RefSeq ○ Merged ○ GENCODE basic Does your variant overlap regulatory regions? ○ ENCODE ○ BLUEPRINT ○ NIH Epigenomics Roadmap ○ Can be limited to regulatory regions observed in specific cell types. Pathogenicity predictions ○ SIFT ○ PolyPhen ○ via plugins: CADD, FATHMM, LRT, MutationTaster, and many more! Plugin info: http://www.ensembl.info/ecode/category/vep-plugins/ VEP features: plugins ● Plugins add extra functionality to the VEP
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages49 Page
-
File Size-