Exome Sequencing Jong Kyoung Kim SNP

Exome sequencing Jong Kyoung Kim SNP

• A single-nucleotide polymorphism (SNP) is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. > 1%).

• For example, at a specific base position in the human genome, the base C may appear in most individuals, but in a minority of individuals, the position is occupied by base A. There is a SNP at this specific base position, and the two possible nucleotide variations – C or A – are said to be alleles for this base position. SNV

• A single-nucleotide variant (SNV) is a variation in a single nucleotide without any limitations of frequency and may arise in somatic cells. Types of SNPs

SNPs

Non-coding Coding

Synonymous Non-synonymous

Missense Nonsense Pileup format

• A plain-text format that summarizes reads’ bases at each chromosome position by stacking or “piling up” aligned reads.

• The per-base summary of the alignment data created in a pileup can then be used to identify variants, and determine sample individuals’ genotype.

• samtools’s mpileup subcommand creates pileups from BAM files, and this tool is the first step in samtools-based variant calling pipelines. Example: Ebola Ebola reference genome

• We will align ebola sequencing data against the 1976 Mayinga reference genome.

• We will hold the reference gnome and all indices: mkdir -p ~/reference/ebola

• Get the ebola genome in FASTA format: efetch -db=nuccore -format=fasta -id=AF086833 > ~/reference/ebola/1976.fa Build an index with bwa

bwa index ~/reference/ebola/1976.fa Align a paired-end dataset

$fastq-dump -X 10000 --split-files SRR1972739

$bwa mem -t 10 -R “@RG\tID:SRR1972739\tSM:ebola\tPL:Illumina” ~/reference/ebola/1976.fa SRR1972739_1.fastq SRR1972739_2.fastq > SRR1972739.sam Sort and index

• Sorting a large number of alignment can be very computationally intensive, so samtools sort has options that allow you to increase the memory allocation and parallelize sorting across multiple threads. Sort and index

• To index a position-sorted BAM file, we use:

• This creates a file named SRR1972739.sorted.bam.bai, which contain the index for the BAM file. Indexing the reference genome samtools mpileup samtools mpileup

• samtools mpileup requires an input BAM file (sorted and indexed).

• We also supply a reference genome in FASTA format through the –f option. line in the pileup format

AF086833.2 239 T 58 ...... ,,..,,...... ,....,....,,,,,...... ,, ., DDhlDlDmDDDDDDDCJIJeIJqJJJJEGJCJJJJEJJJo EEEFEHHHHHHHHHEEFE • Column 1 (AF086833.2): Reference sequence name • Column 2 (239): position in reference sequence, 1-indexed • Column 3 (T): Reference sequence base at this position • Column 4 (58): Depth of aligned reads at this position • Column 6: Base qualities line in the pileup format

• Column 4: This column encodes the match type • Period (.): a reference sequence match on the forward strand. • Comma (,): a reference sequence match to the reverse strand. • A,T,C,G, or N: a mismatch on the forward strand. • A,t,c,g, or n: a mismatch on the reverse strand. • ^,$: start and end of reads, mapping quality of each alignment is specified after ^. • Insertions: denoted with a plus sign (+), followed by the length of the insertion, followed by the sequence of the insertion. • Deletions: denoted with a plus sign (-), followed by the length of the deletion, followed by the sequence of the deletion. line in the pileup format

AF086833.2 236 T 59 CCCCCCCC CCCCCCCCCCCccCCccCCCCCCCCcCCCCcCCCCccccCCCCCCCCC ccC P..DDglDmDlDDDDDDDEIIJaHHpJIEJEFIBJJIJDJ JJmDDDDHGHFFFFFFDD;

• At this position, all reads disagree with the reference base (T) Variant Call Format (VCF)

• While pileups are simply per-position summaries of the data in aligned reads, variant and genotype calls require making inferences from noisy alignment data.

• Most variant calling approaches utilize probabilistic frameworks to make reliable inferences in spite of low coverage, poor base qualities, possible misalignments, and other issues.

• Additionally, methods can increase power to detect variants by jointly calling variants on many individuals simultaneously. Each individual simply needs to be identified through the SM tags in the @RG lines in the SAM header. Variant Call Format (VCF)

• The variant call format (VCF) is a data representation format used to describe variations in the genome. A VCF file may contain information on any number of samples and can be thought of as a single database that summarizes the final results of multiple experiments in a single file.

• VCF is composed of two sections: a header section and a record section.

• bcftools: a toolset to filter, combine and subselect VCF files. A two-step workflow

• Calling variants with samtools and bcftools is a two-step process.

• In the first step, samtools mpileup called with the –v or –g option will generate genotype likelihoods for every site in the genome. These results will be returned in either a VCF if –v is used, or BCF (binary analog of VCF) if –g is used.

• In the second step, bcftools call will filter these results so only variants sites remain, and call genotypes for all individuals at these sites. A two-step workflow

• Misalignments in low-complexity regions are a major cause of errorneous SNP calls.

• To address this, samtools mpileup enables Base Alignment Quality (BAQ), which uses a HMM to adjust base qualities to reflect not only the probability of an incorrect base calls, but also of a particular base being misaligned. Generating VCF VCF header VCF header

• VCF header sections are located at the beginning of the VCF files and consists of lines that start with the ## symbols.

• The VCF headers contain the specifications on various terms that used throughout the file. File format

• A single ‘fileformat’ field is always required, must be the first line in the file, and details the VCF format version number.

• For example, ##fileformat=VCFv4.2 Filter format

• FILTERs that have been applied to the data should be described as follows:

##FILTER= INFO field

• INFO fields should be described as follows (first four keys are required, source and version are recommended): ##INFO=

• Type: Integer, Float, Flag, Character, and String

• Number: Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value should be 1; if the INFO field describes a pair of numbers, then this value should be 2 and so on. INFO field

• Number: There are also certain special characters used to define special cases: • If the field has one value per alternate allele then this value should be ‘A’. • If the field has one value for each possible allele (including the reference), then this value should be ‘R’. • If the field has one value for each possible genotype (more relevant to the FORMAT tags) then this value should be ‘G’. • If the number of possible values varies, is unknown, or is unbounded, then this value should be ‘.’. Format field

• Genotype fields specified in the FORMAT field should be described as follows:

##FORMAT= Header line syntax

• The header line names the 8 fixed, mandatory columns. These columns are as follows: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO

• If genotype data is present in the file, these are followed by a FORMAT column header, then an arbitrary number of sample IDs. Data lines

AF086833.2 230 . C T 225 . DP=69;VDB=0.99941;SGB=-0.693147; MQSB=1;MQ0F=0;AC=2;AN=2;DP4=0,0,49,7;MQ=60 GT:PL 1/1:255,169,0

• Tab-delimited • Missing values are specified with a dot (.) Data lines: fixed fields

• CHROM: An identifier from the reference genome

• POS: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM.

• ID: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). If there is no identifier available, then the missing value should be used. Data lines: fixed fields

• REF: Reference allele on the forward strand. Each base must be one of A,C,G,T,N (case insensitive).

• ALT: Comma separated list of alternate non-reference alleles on the forward strand. Each base must be one of A,C,G,T,N,* (case insensitive). If there are no alternative alleles, then the missing value should be used. Data lines: fixed fields

• QUAL: Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be specified. Higher QUAL scores indicates the variant caller is more confident in a call.

• FILTER: filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon- separated list of codes for filters that fail. e.g. “q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total number of samples. Data lines: fixed fields

• INFO: additional information. INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format. Data lines: genotype fields

• The VCF format supports variable numbers of information about genotypes by concatenating many values together into a single column. Values are separated by colons (:), and the FORMAT column describes the order of each value in the genotype columns.

• For example, the data line mentioned above has a FORMAT column of GT:PL and the entry for ebola is 1/1:255,169,0. This means that the key GT has a value 1/1, and the key PL has a value 255,169,0. Data lines: genotype fields

• GT: genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1, 1 | 0, or 1/2, etc. For haploid calls, e.g. on Y, male nonpseudoautosomal X, or mitochondrion, only one allele value should be given; a triploid call might look like 0/0/1. Data lines: genotype fields

• PL: Phred-scaled genotype likelihoods. These are always in the order ref/ref, ref/alt, and alt/alt alleles for biallelic loci. All genotype likelihoods are rescaled so the most likely genotype is 1 (so it’s Phred-scaled likelihood is 0). A workflow for calling variants Overview

• A typical process involves the following interconnected steps: 1. Align reads to reference 2. Correct and refine alignments 3. Determine variants from the alignments 4. Filter the resulting variants for the desired characteristics 5. Annotate filtered variants

• There is no universal rule, method or protocol that would always produce correct answers or guarantee some range of optimality. fastq-dump sample="SRR1972739" RG="@RG\tID:$sample\tSM:$sample\tLB:$sample\tPL:Illumina “ REF=“home/jkkim/reference/ebola/1976.fa" fastq-dump --split-files $sample bwa mem -t 4 -R $RG $REF ${sample}_1.fastq ${sample}_2.fastq | samtools sort > ${sample}.sorted.bam samtools index ${sample}.sorted.bam bwa mem and samtools bwa mem -t 4 -R $RG $REF ${sample}_1.fastq ${sample}_2.fastq | samtools sort > ${sample}.sorted.bam samtools index ${sample}.sorted.bam samtools mpileup samtools mpileup -uvf $REF ${sample}.sorted.bam > ${sample}.mpileup.vcf bcftools call

/opt/genomics/tools/bcftools-1.4.1/bcftools call --ploidy 1 -vm -Ov ${sample}.mpileup.vcf > ${sample}.bcftools.vcf