Introduction to Bioinformatics online course: IBT

Practical Assignment

Module topic: Genomics Module Contact session title: Human variation Trainer: Dr Colleen J. Saunders

Participant name: Date:

Exploring Human Variation

Introduction

In this practical session, you will investigate the variant call format file in more detail. You will then annotate a VCF file using the ANNOVAR web server and investigate the annotated file. Lastly, you will explore some SNP’s using databases like OMIM and SNPedia.

Tools used in this session

ANNOVAR documentation: http://annovar.openbioinformatics.org/en/latest/ wANNOVAR web server: http://wannovar.usc.edu/ OMIM: http://www.omim.org/ SNPedia: https://www.snpedia.com/index.php/SNPedia

Please note  Hand-in information: If you are formally enrolled in the IBT course, please upload your completed assignment to the Vula ‘Practical Assignments’ tab. Take note of the final hand-in date, which will be indicated on Vula. Introduction to Bioinformatics online course: IBT

Task 1: Understanding the Variant Call Format File

Task 1: Instructions Download the text file named “H3ABioNet_Genomics_4_Task1.txt”. Open this file in a text editor or in Excel. This is a VCF file containing only 3 variants. Consult the lecture notes for todays lecture 3 on VCF file formats, together with the 1000 genomes project explanation of VCF file formatting (available at the link below). http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call- format-version-40/

Using these resources, answer the following questions about the variants in the VCF file.

1. Which version of the VCF file format is this file? 2. By looking at the meta-data, which human genome build was used as the reference genome to call these variants? 3. Only 1 of these variants passes the quality filters applied to the data. What is the dbSNP rsID for that variant? 4. For that variant (identified in Q3), is the genotype data provided phased or unphased? 5. What is the genotype of our sample for that variant? 6. What is the read depth for that variant in this sample?

Task 2: Variant annotation using wANNOVAR

Task 2: Instructions In this task you will be annotating a VCF file from a simulated data set (H3ABioNet_Genomics_4_GIAB.vcf_.txt). Please note that the file is in VCF format and has been saved as a .txt file to ensure that everyone can access the file irrespective of machine OS, text editor used etc. The file could be viewed on a Linux terminal using a command like less or more.

Open the lecture slides outlining the Practical Assignment instructions and work through the instructions for Task 2 step-by-step. Then answer the questions below. You will need to open the wANNOVAR web server, as well as the ANNOVAR documentation page in your browser (the links are in the information section at the beginning of this document).

1. What are the different kinds of information provided in the Func.refgene and Gene.refgene fields? 2. What do the GERP++, PhyloP and SiPhy fields indicate? 3. What information is provided in the ExonicFunc.refgene field? 4. Give the dbSNP rsIDs for the variants found within the F5 gene. 5. Are either of these SNPs found in the ClinVar database? What disease is it associated with? 6. Do any of the functional prediction algorithms (found under Filter-based annotation in the ANNOVAR documentation) suggest that this variant may be deleterious? 7. What is our samples genotype at these two loci? Introduction to Bioinformatics online course: IBT

8. How many variants are found in BRCA1? Where are they located within the gene? 9. Are any of these variants considered pathogenic? 10. Compare the minor allele frequencies of these variants in the 1000 genomes dataset and the ExAC dataset.

Task 3: Variant exploration and prioritisation

Task 3: Instructions Open the OMIM and SNPedia web browsers, and search for the interesting variants that we looked at in Task 2 (rs6025; rs1800595; rs80357064). For each of these three variants write a short paragraph describing:  Which specific diseases they have been implicated in?  Which allele is the disease-associated allele?  What protein does the associated gene code for?  By looking at our samples genotype, what does this mean for this individual?