<<

UNIVERSITY OF CALIFORNIA RIVERSIDE

SNP Calling Using Genotype Model Selection on High-Throughput Data

A Dissertation submitted in partial satisfaction of the requirements for the degree of

Doctor of Philosophy

in

Applied Statistics

by

Gabriel Hiroshi Murillo

December 2012

Dissertation Committee:

Dr. Xinping Cui, Chairperson Dr. Daniel Jeske Dr. Thomas Girke Copyright by Gabriel Hiroshi Murillo 2012 The Dissertation of Gabriel Hiroshi Murillo is approved:

Committee Chairperson

University of California, Riverside Acknowledgments

I am most grateful to my advisor, Dr. Xinping Cui, for her enthusiasm, dedi- cation and encouragement. Without her guidance, this dissertation would not exist.

I would also like to thank Dr. Na You, for the many helpful discussions we had.

iv To my parents.

v ABSTRACT OF THE DISSERTATION

SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data

by

Gabriel Hiroshi Murillo

Doctor of Philosophy, Graduate Program in Applied Statistics University of California, Riverside, December 2012 Dr. Xinping Cui, Chairperson

Recent advances in high-throughput sequencing (HTS) promise revolutionary impacts in science and technology, including the areas of disease diagnosis, pharmacogenomics, and mitigating antibiotic resistance. An important way to analyze the increasingly abundant HTS data, is through the use of single nucleotide polymorphism (SNP) callers. Considering a selection of popular HTS SNP calling procedures, it becomes clear that many rely mainly on base-calling and read mapping quality values. Thus there is a need to consider other sources of error when calling SNPs, such as those occurring during genomic sample preparation. Genotype Model

Selection (GeMS), a novel method of consensus and SNP calling which accounts for genomic sample preparation errors, is thus given. Simulation studies demonstrate that GeMS has the best balance of sensitivity and positive predictive value (PPV) among a selection of popular SNP callers. Real data analyses also support this conclusion.

As an extension to the aforementioned single sample GeMS, the multiple sample Geno- type Model Selection (multiGeMS) method is also given. A simulation study and a real data analysis demonstrate that multiGeMS has a good balance of sensitivity and PPV when compared to a selection of popular multiple sample SNP callers.

vi Contents

List of Figures ix

List of Tables x

1 Introduction 1

2 Background 4 2.1 Biological Background ...... 4 2.2 High Throughput (Next Generation) Sequencing ...... 7 2.3 File Format Specifications ...... 13

3 Literature Review 18 3.1 Single Sample SNP Callers ...... 18 3.1.1 MAQ ...... 19 3.1.2 gigaBayes ...... 21 3.1.3 Atlas-SNP2 ...... 21 3.1.4 SNVMix ...... 23 3.1.5 Method Comparison ...... 24 3.2 Multiple Sample SNP Callers ...... 25 3.2.1 SAMtools ...... 26 3.2.2 GATK ...... 29 3.2.3 “Cross-Sample” ...... 29

4 Single Sample Genotype Model Selection (GeMS) 32 4.1 Data Preparation ...... 33 4.2 Procedure ...... 35 4.3 Validation ...... 41 4.3.1 Simulation Analysis ...... 42 4.3.2 Real Data Analysis ...... 52 4.3.2.1 The Arabidopsis sup1ros1 dataset ...... 52 4.3.2.2 The Thermoanaerobacter sp. X514 Xw2010 dataset . . . 57 4.3.3 Computational Performance ...... 59 4.4 Discussion ...... 64 4.4.1 Haploid GeMS Analysis ...... 64 4.4.2 Prior Probabilities ...... 65

vii 5 Multiple Sample Genotype Model Selection (multiGeMS) 66 5.1 Data Preparation ...... 67 5.2 Procedure ...... 69 5.2.1 Single Sample GeMS Review ...... 71 5.2.2 EM Algorithm ...... 71 5.2.2.1 E-step ...... 72 5.2.2.2 M-step ...... 74 5.2.2.3 Initial Values ...... 75 5.2.2.4 Convergence ...... 76 5.2.3 SNP and Consensus Calling ...... 77 5.3 Validation ...... 78 5.3.1 Simulation Analysis ...... 78 5.3.2 Real Data Analysis ...... 81

6 Future Work 87 6.1 Refinements ...... 87 6.2 metaGeMS ...... 88

Bibliography 91

Appendix A Intuition Behind the EM Algorithm 96

Appendix B The Local False Discovery Rate 98 B.1 FDR, Positive FDR and Bayesian FDR ...... 98 B.2 Local FDR ...... 101

Appendix C Filtering Alignment Files 104 C.1 SAM/BAM Bitwise Flags ...... 104 C.2 Minimally Recommended Practices ...... 105

viii List of Figures

2.1 Description of some genetic variants ...... 6 2.2 HTS data analysis pipeline ...... 8 2.3 Aligned reads with variants ...... 10 2.4 FASTA file example ...... 14 2.5 FASTQ file example ...... 15

3.1 Multiple sample SNP calling data notation ...... 26

4.1 q and w parameter relationship ...... 39 4.2 Sensitivity and PPV plot of SNP caller performance ...... 48 4.3 Zoomed-in view of sensitivity and PPV plot of SNP caller performance . . 49 4.4 Arabidopsis SNP call Venn diagram ...... 54 4.5 X514 SNP call Venn diagram ...... 58

5.1 EM algorithm iterations ...... 76 5.2 multiGeMS simulation study samples ...... 81 5.3 50 samples of 1000 Genomes Project data, Part 1 ...... 83 5.4 50 samples of 1000 Genomes Project data, Part 2 ...... 84

C.1 Alignment File Filtering ...... 107

ix List of Tables

2.1 IUPAC Nucleotide Base Codes ...... 5 2.2 Simplified, relative comparison of Sanger and Illumina DNA sequencing technologies ...... 8 2.3 Some base-calling procedures ...... 10 2.4 Some alignment procedures ...... 11 2.5 Some SNP calling procedures ...... 12

3.1 Atlas-SNP2 prior probabilities ...... 23 3.2 Comparison of surveyed SNP calling methods ...... 26

4.1 Settings for GeMS SNP calling ...... 33 4.2 GeMS model notation ...... 36 g l g 4.3 Diploid q for Yj ∼ Categorical(q ) ...... 37 4.4 Options used in SNP calling of simulated single sample data ...... 43 4.5 Single sample simulation SNP caller sensitivity ...... 46 4.6 Single sample simulation SNP caller PPV ...... 47 4.7 GeMS simulation results summary ...... 47 4.8 3 model GeMS sensitivity ...... 51 4.9 3 model GeMS PPV ...... 51 4.10 Options used in SNP calling of the Arabidopsis dataset ...... 53 4.11 Arabidopsis SNP call proportions ...... 57 4.12 Options used in SNP calling of the X514 dataset ...... 59 4.13 Single sample simulated data SNP calling computer specifications . . . . . 60 4.14 Time to completion of single sample SNP calling procedures ...... 61 4.15 Average memory used by single sample SNP calling procedures ...... 62 4.16 Maximum memory used by single sample SNP calling procedures . . . . . 63 g l g 4.17 Haploid q for Yj ∼ Categorical(q ) ...... 64 5.1 Settings for multiGeMS SNP calling ...... 67 5.2 multiGeMS model notation ...... 70 5.3 Options used in SNP calling of simulated multiple sample data ...... 80 5.4 multiGeMS simulation results summary ...... 81 5.5 Options used in SNP calling of the 50 samples of 1000 Genomes Project data ...... 82 5.6 multiGeMS 1000 Genomes Project data results summary ...... 86

B.1 Possible outcomes from m hypothesis tests ...... 98

x B.2 lFDR prior and density notation ...... 102

C.1 Selected bitwise flags from SAM file FLAG field ...... 105

xi Chapter 1

Introduction

In the computing industry, the so-called “Moore’s Law” is a famous prediction that the number of transistors on a microprocessor will double approximately every

2 years. As a prominent growth benchmark, few technological trends have out-paced

Moore’s Law. However, since 2008 when Sanger-based DNA sequencing began to give way to ‘next generation’ technologies, high-throughput sequencing [49], [61] (HTS) has been one of those trends.

Data collected by the National Human Genome Research Institute illustrates the incredible speed that high-throughput sequencing is decreasing the cost of DNA sequencing. In 2001, the cost to sequence a human-sized genome was approximately

$100,000,000. In 2006, the cost decreased by a factor of 10 to $10,000,000. Similar decreases happened in 2008, 2009 and 2011 [65]. Currently two companies are now planning to release technology that enable a human-sized genome to be sequenced for

$1,000 by the end of 2012 [23], while at least one other company is shooting for the $100 genome [11]. At this rate, it can expected that within the coming years, getting one’s whole genome sequenced could become as commonplace as a blood test is today.

1 Without knowing what the exact costs of DNA sequencing will be in the next few years, one thing is clear, the genomic revolution is accelerating and will likely touch all our lives in a profound way. Since their inception, these HTS technologies were essentially limited to research organizations, however they are now expected to be accessible to all those in industrialized nations. Now many people will be able to accurately discover what diseases they and members in their families may be prone to. These individuals will then be able to live their lives in such a way that minimizes the harmful effects of those possible diseases. Further pharmacogenomics [50] holds the promise of accurately predicting which types of medical treatment, and pharmaceutical medication in particular, will provide the most good and least harm for every patient’s genetic profile.

Humans are not the only organisms that can be sequenced. In fact, all life is encoded in nucleic acid molecules such as DNA and RNA. Thus the applications of HTS go far beyond that of human genomics. Other areas of public health research that are being positively affected by HTS include the fight against antimicrobial resistance and the mitigation of infectious disease pathogens. Likewise, beyond public health, HTS is beginning to influence research in the agricultural sciences, environmental sciences and the science of alternative energy sources1.

Due to the massive decrease in the costs of DNA sequencing, there appears to be no shortage of HTS data available for scientists to analyze. Thus the bottleneck in these analyses is not the amount of data itself, but the group of statistical and computa- tional algorithms used to analyze this large amount of data. Indeed, more accurate and computationally efficient data analysis tools are needed to translate the raw sequencing data into findings useful for medical workers, patients and our global society in general.

1Due to all of these changes, legal and ethical concerns involved in insurance policies, employment requirements and genetic discrimination and disclosure in general, will need to be addressed. Thus this genomic revolution is expected to be a case of ‘technology outpacing morality’.

2 To this end, this dissertation chronicles the development, computational im- plementation and validation of the novel statistical algorithm Genotype Model Selection

(GeMS). GeMS is open source and freely available [52] to all who would like to explore single nucleotide polymorphisms [10] (SNPs) in sequencing data. Not only can GeMS be used by seasoned biologists, statisticians and other scientists to further research dis- coveries, but it can also be used to train future scientists with hands-on data analysis experience.

3 Chapter 2

Background

2.1 Biological Background

Before we begin a discussion of statistical methods used in single nucleotide polymorphism (SNP) calling, we must first understand what the data “looks like”. This requires that we first understand the basics of DNA. Deoxyribonucleic acid or DNA has often been compared to the blueprints of life, it is a molecular code that contains all of the instructions needed to describe our personal physical features, some personality traits, and how our bodies can maintain themselves.

In many living things, DNA is organized into chromosomes and has the often illustrated double-helix shape. The two strands of the double helix are composed of bases or nucleotides named Adenine, Cytosine, Guanine and Thymine. These bases are commonly identified with their abbreviations: A, C, G, and T, respectively. It is important to note that A always binds with T and C always binds with G. This means that both strands are essentially equivalent sequences and so we only need to consider

4 Symbol Meaning A Adenine C Cytosine G Guanine T Thymine M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T N A or C or G or T

Table 2.1: A listing of the 15 possible combinations of nucleotide base codes by the IUPAC [56]. one. Sometimes the identity of a nucleotide cannot be clearly determined. In this case we can use the IUPAC nucleotide base codes as displayed in table 2.1.

As we know, we received some characteristics from our father and some from our mother. This is because when we were conceived, we were given a set of 23 chromosomes from our father and another set of 23 chromosomes from our mother. Genomes, and thus organisms, that have two sets of chromosomes are called diploid (e.g. humans) and genomes with one set are called haploid (e.g. bacteria).

With much excitement in the scientific community, the Human Genome Project

(HGP) was finished in 2003 after an impressive collaboration of work from around the globe. It took $2.7 billion (1991 USD) and 13 years [27] to complete and it has been considered a triumph of human history. One of the main goals of the human genome project was to create a human reference genome.

A species’ reference genome is a representation of the genome sequence for any member of that species. It can be created by the (de novo) sequencing of an indi-

5 Reference Sequence ...CATCATCATCAT... Homozygous SNP ...CATCATGATCAT... Example ...CATCATGATCAT... Heterozygous SNP ...CATCATGATCAT... Example 1 ...CATCATCATCAT... Heterozygous SNP ...CATCATGATCAT... Example 2 ...CATCATTATCAT... Homozygous Insertion ...CATCATGCATCAT...... CATCATGCATCAT......

Figure 2.1: Here are a few of the many types of possible variations between a reference genome and that of a particular organism.

vidual organism’s genome or a pool of genomes from a group of organisms of the same species. Even if the organism is diploid, that is, it has two sets of chromosomes, the refer- ence genome generally represents only one set of chromosomes1. For multi-chromosomal organisms, the reference genome is constructed by concatenating each chromosome se- quence together. An index is used to track positions on this reference genome as well as where chromosomes start and finish.

As there are differences between individuals’ outward appearances, there are also differences between individuals’ genomes. It is often stated that the genomes of any two randomly selected people are 99.9% the same. Thus, scientists are interested in that

0.1% difference between individuals since they believe that some of these differences may be used to realize some of the public health benefits as described in Chapter 1.

It is important to discuss the types of genomic differences that can happen. See

figure 2.1 for some visual examples. First, imagine that we have the human reference genome and the genome of another person whom we will call Adam. Imagine that a particular site on the human reference genome, such as the one underlined in figure 2.1,

1The reason for this is that in the past, when sequencing was more costly, the benefit received from sequencing both sets of chromosomes was not worth the extra cost to do so. This is because at the vast majority of sites, the two sets of chromosomes are equal, i.e. they are homozygous.

6 contains the nucleotide C but Adam, in his two sets of chromosomes, might have the nucleotides G and G. Since there is a single nucleotide difference between the reference genome and Adam’s genome, this means that Adam has a single nucleotide polymorphism

(abbreviated SNP and pronounced “snip”) at this site. Moreover, in this example, Adam has a homozygous SNP since the nucleotides on both sets of chromosomes are identical.

If however, one of Adam’s sets of chromosomes has a C whereas the other has a G, then Adam would have a heterozygous SNP since the nucleotides between his sets of chromosomes do not match and there is still a difference between the reference genome and his genome.

Further, insertions and deletions are often seen as when compared to the ref- erence genome. An insertion in the sample genome means that an extra nucleotide is present in the sample’s genome as when compared to the reference genome. Likewise a deletion in the sample genome means that a nucleotide is missing from the sample’s genome as when compared to the reference genome. Collectively, insertions and deletions are known as ‘indels’. The ideas of heterozygosity and homozygosity can apply to indels as well. There are also other common genomic variations that are beyond the scope of this discussion.

2.2 High Throughput (Next Generation) Sequencing

As described in Chapter 1, since the middle to late 2000s, there has been a shift from the ‘first generation’ of sequencing, using Sanger technology, to newer ‘next generation’ methods [49]. Though the Sanger method dominated for nearly two decades and gave us the human reference, it’s very costly, it’s not very fast and thus, there wasn’t too much data to analyze. Next generation sequencing (NGS), which falls under

7 Technology Sanger Illumina Technical name Chain termination Sequencing by synthesis Technology generation moniker ‘First generation’ ‘Next generation’ Popular commercially 1980s - mid 2000s Late 2000s - present Amount of data produced Low Much higher Read length Long Shorter Accuracy High Not as high Speed Slow Much faster Cost Expensive Much cheaper Statistical Methods Mostly developed In development

Table 2.2: A very simplified, relative comparison of Sanger and Illumina DNA sequencing technologies.

1. Acquisition, Fragmentation, Amplification w w  2. Sequencing (raw intensity data) w w  3. Base-Calling (quality scores) w w  4. Alignment (quality scores) w w  5. Consensus and Variant (SNP) Calling

Figure 2.2: This figure identifies the major steps undertaken in most HTS data analyses.

the category of high throughput sequencing (HTS), however, has both advantages and disadvantages. In particular, this discussion will be focusing on the ‘next generation’

Illumina sequencing technology [26]. Though it is much cheaper, faster and generates a massive amount of data, Illumina sequencing is less accurate than the Sanger method, produces shorter reads and requires new statistical methods to handle the massively increased data throughput. See table 2.2 for a relative comparison between these 2 sequencing technologies.

The following is a very simplified storyline explaining how one is to understand how HTS data, from the Illumina sequencing hardware, can be used to find variants

8 between an organism’s genome and its species’ reference genome. Please see figure 2.2 for a summary of the following steps.

1. First, many genome samples are acquired from the organism. This can be done in

a variety of ways, one of which is swabbing some skin cells from inside the mouth.

This step is known as acquisition.

2. We realize it would be hard to sequence the genome from one end to the other since

the genome sequence is usually very long (> 3 billion base pairs in humans). So

we then break up the genome samples into small sections called reads. In existing

technologies, the read size often equals 50, 100 or an even larger amount of base

pairs (bp). This step is known as fragmentation.

3. Though it is now easier to sequence the individual reads than the entire genome, it

is still difficult to get a clear picture of the actual bases on each read since they are

at the molecular level in terms of size. So we attach these reads to different parts

of a slide and amplify the reads by making colonies of read copies since it is easier

to sequence a colony of read copies than it is to sequence an individual read. This

step is called amplification. Taken together, the acquisition, fragmentation and

amplification steps are also known as the genomic sample preparation procedure.

4. To sequence a colony of amplified reads, each base pair or cycle, is photographed

4 times. Each of these photographs yields a raw image which filters for one of the

4 bases A, C, G and T. Thus the intensity values for each base is recorded at for

each cycle. This step is called sequencing.

5. The base-calling step happens when the raw intensity data is processed by a base-

calling algorithm. The simplest base-calling algorithm just chooses the highest

9 Name Year Author(s) Notes BING [33] 2010 Kriseman et al. NGS Pipeline Srfim [9] 2009 Corrada-Bravo et al. Model-based Ibis [29] 2009 Kircher et al. Machine Learning BayesCall [28] 2009 Kao et al. PIQA [47] 2009 Martinez-Alcantara NGS Pipeline Swift [66] 2009 Whiteford et al. Rolexa [57] 2008 Rougemont et al. Alta-Cyclic [18] 2008 Erlich et al. Bustard [25] N/A Illumina

Table 2.3: Some base-calling procedures. Srfim is perhaps the most statistically inter- esting, though Illumina’s Bustard is the most popular as it is the default choice for those with Illumina hardware.

GAGTTATATCGCTTCCATGA GAGTTTTATCGCTTCCATGACGCACAAGTT GAGTTTTATCGCTTCCATGACGCAGAAGTT GAGTTTTATGGCTTCCATGACGCACAAGTTAACACTTTCG GCTTCCATGACGCACAAGTTAACACTTTCGGATATTTCTG CGCACAAGTTAACACTTTCGGATTTTTCTGATGAGTCGAA CGCACAAGTTAACACTTTCGGATTTTTCTGATGAGTCGAA AACACTTTCGGATATTTCTGATGAGTGGAA... GATTTTTCTGATGAGTCGAA... GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAA... 123456789012345678901234567890123456789012345678901234567890...

Figure 2.3: This is an illustration of some reads that have been aligned to a reference, along with possible SNPs. Non-reference alleles are in red.

raw intensity value as the base-call, though many would choose to use a more

statistically sophisticated model-based base-calling algorithm. See table 2.3 for a

partial listing of available base-callers. Most researchers use the default Illumina

Bustard base-caller that comes with the Illumina hardware. In addition to the ac-

tual determination of each base, most base-calling algorithms provide a base-calling

quality score for each base. Essentially this quality score gives the probability of

an incorrectly called base.

6. Now the problem we have is that breaking up the original genome samples in

the fragmentation stage results in the loss of information as to where each read

10 Name Year Author(s) Notes Bowtie [34] 2009 Langmead et al. RNA-Seq friendly SOAP2 [42] 2009 Li, R. et al. Mosaik [63] 2009 Strömberg et. al. BWA [38] 2009 Li, H. et al. Burrows-Wheeler transform Novoalign [64] N/A Novocraft For purchase MAQ [40] 2008 Li, H. et al. Precursor to BWA

Table 2.4: Some alignment procedures. The theory behind the alignment methods re- quires a strong computer science background.

goes. But since we have the reference genome, we can just match the reads to the

reference, which is called the alignment step. See figure 2.3 which demonstrates

what reads aligned to a reference genome would look like. Also, see table 2.4 which

gives a partial listing of alignment software packages. Since the organism will have

some differences with the reference, we understand we can’t be too perfect with each

alignment. For example, if a particular read matches all but 1 of the nucleotides,

we can still align the read by assuming that there might be an base-calling error,

SNP or indel at the mismatched location. Now that we have aligned our reads,

each read will have an alignment quality score which indicates how well the read

matches its aligned location. A base-call that is aligned to a specific location on

the reference genome is frequently termed an allele aligned to that position. The

set of alleles aligned to a location is often called the pileup at that location.

7. Generally the base-calling quality and alignment quality scores are considered along

with the allele pileup in the consensus and variant calling step. Based on these

scores, we can now determine the consensus genotype of each position on the

sample organism’s genome. The consensus genome sequence is the sequence we

decide belongs to the organism based on the aligned reads. Again, the most simple

way to determine the consensus genotype is to just output the mode of the aligned

11 Name Year Author(s) Notes VarScan 2 [31] 2012 Koboldt et al. Bambino [15] 2011 Edmonson et al. piCALL [4] 2011 Bansal et al. Based on SNIP-Seq GATK [48], [13] 2010 McKenna et al. Popular FreeBayes [19] 2010 Garrison et al. Based on gigaBayes SNIP-Seq [3] 2010 Bansal et al. SNVMix2 [21] 2010 Goya et al. Mixture model Slider II [44] 2010 Malhis Jones Coverage independent Atlas-SNP2 [59] 2010 Shen et al. Logistic reg priors inGAP [54] 2010 Qi et al. Based on POLYBAYES SAMtools [39] 2009 Li, H. et al. Popular VarScan [30] 2009 Koboldt et al. SNVMix1 [58] 2009 Shah et al. Mixture model SOAPsnp [41] 2009 Li, R. et al. Similar to MAQ Slider [43] 2009 Malhis et al. Merge-sort approach MAQ [40] 2008 Li, H. et al. Error dependency gigaBayes [46] N/A Marth et al. Based on POLYBAYES POLYBAYES [45] 1999 Marth et al. Pre-NGS

Table 2.5: Some SNP calling procedures. This is a partial listing of the SNP callers considered during the research phase of this dissertation.

alleles. For instance, looking at figure 2.3, it will be safe to say from the allele

pileups that the first 5 consensus genotypes are GAGTT. However, we encounter an

issue at the 6th location on the reference genome, as there is one non-reference allele

in the pileup. Specifically, the reference at this location is T and the allele pileup

contains 3 T nucleotides and 1 A nucleotide. Naturally one would gravitate to

calling the consensus genotype as T but there are situations where this may not be

prudent. For instance, it is possible that the T nucleotides can have low base-calling

and alignment quality scores associated to them and the A nucleotide has high

base-calling and alignment quality scores. In this case, we might call the consensus

nucleotide as W (A or T). Thus it is clear that we need to incorporate more than

just the allele pileup to get an accurate consensus call. The Genotype Model

Selection (GeMS) procedure, as explained in chapter 4, will give more information

on calling consensus genotypes.

12 8. Variant calling builds on consensus calling and adds the step of determining whether

there is enough evidence to report a variant. There are many possible variants,

such as those shown in figure 2.1, but we will focus on SNP calling. Looking back to

figure 2.3, we considered the fact that there might be a heterozygous SNP at loca-

tion 6 on the reference genome. Based on our model, if W has by far the strongest

genotype likelihood compared to the other genotype likelihoods then we should be

inclined to call a SNP here. Given non-extreme base-calling and alignment quality

scores, the sites with just one non-reference allele on figure 2.3 are less likely to

harbor a SNP than the sites that have a higher proportion of non-reference alleles.

For instance, without knowing the base-calling and alignment quality values, it

would be easier to label sites 25 and 44 as homozygous and heterozygous SNPs,

respectively. The quality scores and other information allows GeMS, as explained

in chapter 4, to accurately call SNPs.

2.3 File Format Specifications

When taking part in the analyses involved in figure 2.2, an HTS data analyst will likely need to work with a variety of file formats, just as such an analyst would want to work with many of the software packages listed in tables 2.3, 2.4 and 2.5. Six of such useful file formats are explained below:

• FASTA [7] The FASTA file specification is the most simple of the 6 useful file

formats described in section 2.3. FASTA files serve as text based containers of

nucleotide strings. With respect to SNP calling, FASTA files will most commonly

be used to contain reference genome sequences. For a single nucleotide string

FASTA file, such as a single chromosome reference genome, the first line is reserved

13 > Alphanumeric | sequence | code | sequence name keywords ACGATGCGATACAAAAAAAAAAAAAAGATAACGATAGATTTTTTAGACAGATACCCAGAC ACCCAGATAGCAGACCCCCCCAGTACAGCAATGACCCGGGGCATACGACCCCCCCCTACT CCCCCCCCCTCAGACCATGGATGGCGGGGGGGGGGGGGGACACGATGAGATCCAGAGTCA GCCCCCAGAGATCACGACATACGATCAGACTACGTTTTTTTTTACATCACGACATCAGAC CAGCATATTTTAGAGGGGGAAACGACACACAGCAGACCCCCCCCCCCCCCCAGACAGCAG

Figure 2.4: The first six lines of a FASTA file example.

for the sequence information and usually begins with the greater-than symbol (“>”).

The remaining lines contain the nucleotide sequence in the correct order as given

by the physical nucleotide string. It is generally recommended that no lines are

longer than 80 characters and sometimes other line lengths, such as 60, are used.

The first 6 lines of a toy example FASTA file are given in figure 2.4.

• FASTQ [35] The FASTQ file specification adds some more information to the

FASTA file specification. Instead of a general nucleotide sequence, FASTQ files

account for short reads, specifically those defined in section 2.2. Information about

short read sequences are given in every 4 line section of a FASTQ file, i.e. lines

1-4, 5-8, 9-12, etc. Thus FASTQ files will always have 4 times the number of

lines as the number of short reads described by the FASTQ file. The first line

of these 4 line sections always gives the read identifier and begins with an “at”

(‘@’) symbol. These sequence identifiers usually have a specific format depending

on the type of reads represented by the FASTQ file. The second of these 4 lines

always gives the base-calls of the read sequence. The third line always begins with

a “plus” (‘+’) symbol and is optionally followed by the same sequence identifier in

the first line. The fourth line gives the base-calling quality values associated with

the corresponding base-calls in the second line. Since these base-call quality values

are encoded in the Phred ASCII character-integer scheme, the user needs to be

14 @INSTRUMENT:LANE:TILE:X_POS:Y_POS:#0/1 ACGATGCGATACAAAAAAAAAAAAAAGATAACGATAGATTTTTTAGACAGATACCCAGAC + OFPXQXZTa‘aTaa‘aJaVVJJV‘ZWZ‘aaaaaa‘aTXOFNOOGOONNZTZTBQXWSSaT

@INSTRUMENT:LANE:TILE:X_POS:Y_POS:#0/2 ACCCAGATAGCAGACCCCCCCAGTACAGCAATGACCCGGGGCATACGACCCCCCCCTACT + VZZRaaOYOY‘ZTTZQQaXXVaaNV‘OYO‘X‘WXWUXYYWZXWVTZVVW‘aaTTBBBYBZ

Figure 2.5: The first reads of a pair of FASTQ files that contain paired end data.

aware of the exact Phred offset value used (usually 33 or 64). Additionally, HTS

read data can be “pair-end” or “mate-pair”. This means that the beginning and

end of a long nucleotide string are sequenced but the exact length of the center

nucleotide string is not known. If this is the case then the data is usually given

with a pair of FASTQ files, and a read at a certain location in one FASTQ file

will be paired with the read in the same location of the paired FASTQ file. If the

data is known to be a part of multiple samples, then generally each FASTQ file

will represent no more than 1 sample. Figure 2.5 gives a toy example that shows

the first reads of a pair of FASTQ files that contain pair-end data. The example

uses the Illumina sequence identifier format.

• SAM [22] The SAM file format is one of the most complicated out of these 6 com-

monly used file formats. SAM files contain the alignment information output after

running a FASTQ file through an alignment procedure that supports SAM format

output. Usually SAM format files have extensive header sections that give details

about the alignment. Following this header section is the read section. The order

of this read section is generally by default the order as given in the FASTQ file,

however users have some flexibility in reordering this section. Each line represents a

15 read from the FASTQ file and gives much read alignment information including the

read sequence, the base-call quality scores and mapping quality score. If the data is

known to be a part of multiple samples, then generally each SAM file will represent

no more than 1 sample. There is generally a wealth of information contained in

these files and it takes much experience to fully utilize all this information.

• BAM [22] The BAM file format is a binary version of the text-based SAM format.

Though the SAM format describes the layout of the BAM format, it is the BAM

format that is used most often. The reason for the ubiquity of the binary BAM

format is that BAM files are usually much smaller in size than SAM files, but they

contain the same amount of information. Even so, BAM files can easily be found in

the gigabyte range with respect to size. Additionally, BAM files have an advantage

with respect to computational performance when compared to the SAM format.

The premier software suite to work with these SAM/BAM files is SAMtools [39].

• PILEUP [36] Most SNP calling procedures now support BAM file input, however

prior to this, the PILEUP format was more popular. One reason could be the

ease with PILEUP files are parsed. Since many SNP calling procedures consider

each site independently, the PILEUP format can deliver the essential data without

any preprocessing. The main difference between the SAM/BAM formats and the

PILEUP format, is that the PILEUP gives a site’s allele pileup, the base-calling

quality values and optionally the alignment quality values on one line. In order to

get this information from a SAM/BAM file, the SNP caller would need to parse

many lines to gather all the alleles piled up at a site from all the reads that covered

the site. As with SAM/BAM files, if the data is known to be a part of multiple

samples, then generally each PILEUP file will represent no more than 1 sample.

16 The advantage that the SAM/BAM file formats have over the PILEUP format is

the wealth of information available about all the alignments. In this way, we see

that the PILEUP format is truly a gutted alignment file made from a SAM/BAM

file.

• VCF [53] The VCF file format is also one of the most complicated out of these 6

commonly used file formats. In this way, it is very similar to the SAM file. The

VCF file is composed of 2 parts, an extensive header and a detailed site-by-site

listing of variant information. As with the SAM file, the information contained in

the header and in the variant listing will depend on the variant caller used. The

variant listing will include the reference sequence position, the reference allele and

the variant genotype. If the data is known to be a part of multiple samples, then

generally the one VCF file will be used to account for all the variants in all the

samples. In this way, the VCF format is in general different from the other file

formats, discussed above, which usually only account for just 1 sample. There is

an extensive amount of additional information contained in these files and it takes

much experience to fully utilize this information. The BCF file format is a binary

version of the VCF file format. It shares the advantages that the BAM file format

has over the SAM file.

17 Chapter 3

Literature Review

To begin this literature review, the distinction between single sample and mul- tiple sample data should be made. Single sample HTS data is generally assumed to be sequencing data of genomics reads that were taken only from 1 organism1. Broadly speaking ‘multiple sample HTS data’ can include samples from different species, but the term usually refers to sequencing data of genomics reads that were taken only from more than 1 organism of the same species.

3.1 Single Sample SNP Callers

As most single sample SNP callers work with similar types of data, there are quite a few similarities between the SNP callers. Thus instead of considering the details of all SNP callers introduced, we will consider a few representative SNP calling procedures, namely MAQ, gigaBayes, Atlas-SNP2 and SNVMix. Then we will compare major points of interest between them. 1If we work under the assumption that the samples were collected from different places on the organism which would harbor different genomes and/or during different time periods which would harbor different genomes, then accounting for these differences, a multiple genome sample framework would be preferable. This situation though, is beyond the scope of this document.

18 As covered in the simplified storyline in section 2.2, for a given genomic site,

SNP calling methods first try to determine the consensus genotype and then they call a SNP if there is a relatively high amount of confidence that the consensus genotype is non-reference. To determine this confidence level, as well as the consensus genotype,

SNP callers use a variety of information beyond the basics of the allele pileup and the base-calling and alignment quality scores. To incorporate all of this information into the

SNP calling decision, almost all of the available SNP calling procedures use some form of the Bayes’ theorem statement2. Given that B is some set such that P (B) > 0 and that A1,A2,A3,...,An form a partition of a sample space, a basic form of the Bayes’

theorem states the following.

P (B|Ai) P (Ai) P (B|Ai) P (Ai) P (Ai|B) = = P (3.1) P (B) j P (B|Aj) P (Aj)

A large difference between the SNP calling procedures, however, is just how the partition of the {Ai} in Bayes’ theorem is structured.

3.1.1 MAQ

As alluded to above, MAQ [40] uses Bayes’ theorem to call a consensus. MAQ considers only the 2 most frequently aligned alleles b and b0 covering each particular site3.

The reason for this is that, given a diploid genome, there really only should be up to two nucleotides represented in the reads as more than two nucleotides would generally

2There are some SNP callers which do not make use of Bayes’ theorem. One such example is the heuristic VarScan procedure [30] which in essence reports all variant alleles at genomic sites that meet certain criteria.

3Note that each of the profiled methods are described using different types of notation. Though notational unity is preferred, the ideas between the SNP calling methods can be both similar and diverse and thus attempting to combine all these ideas under a unified notational scheme leads to logical difficulties. To resolve this issue, the following method summaries will use the notation as described in their respective papers.

19 represent an error. From these two most frequent nucleotides, MAQ assumes that there are three possible genotypes, namely g ∈ {hb, bi, hb0, bi, hb0, b0i}, where b, b0 ∈ {A, C, G, T }

and the angled brackets, h i, denote an unordered set. Let n be the combined total of b

and b0 alleles covering the site and let k be the number of called b alleles.

−Q /10 The data D includes i = 10 i where Qi = min(mapping quality of the ith

read covering the site, base-calling quality for the base on the ith read covering the site).

The consensus is called to be gˆ = arg max P (g|D) where, g

P (g)P (D|g) P (g|D) = P . (3.2) s P (s)P (D|s)

The prior probabilities are fixed as follows: P (hb, b0i) = r and P (hb, bi) =

0 0 1−r P (hb , b i) = 2 where r = 0.2 at known SNP sites and r = 0.001 elsewhere. Thus,

MAQ works best, as intended, when prior knowledge of known SNP sites is available.

As for developing the conditional probability P (D|g), first let us consider g = hb0, bi. For

this genotype, the MAQ reference [40] simply states,

n “# of ways to choose the b bases” P (D|hb, b0i) ≈ k = . (3.3) 2n “# of subsets of n elements”

The other conditional probabilities, P (D|hb, bi) and P (D|hb0, b0i), have more

complicated formulae. First, the case of independent bases and uniform errors is con-

sidered using a binomial distribution to describe the total number of observed b bases

in n total bases covering the site. Then base dependencies are modeled and different

error rates are taken into account. The final conditional probabilities rely on the i and

the MAQ defined error dependency coefficient. After a consensus is called, a filter is

implemented to determine whether the confidence in a non-reference consensus is high

enough to call a SNP.

20 3.1.2 gigaBayes

gigaBayes is the basis for the FreeBayes [19] SNP caller. In turn, the gigaBayes website [46] mentions that the gigaBayes procedure is an HTS implementation of the

POLYBAYES [45] method. Regarding POLYBAYES, Bayes’ theorem is again used, but with a different partition as the following equation illustrates,

P (X|Y )P (Y ) P (Y |X) = X . (3.4) P (X|Y 0)P (Y 0) {Y 0} Here X represents the observed alleles covering a specific site and Y represents the set

of the true unknown alleles of X. Recall that MAQ partitioned Bayes’ theorem over g ∈

{hb, bi, hb0, bi, hb0, b0i}, where b, b0 ∈ {A, C, G, T }. In comparison gigaBayes partitions over

all the possible true Y covering a certain site. The conditional probabilities, P (X|Y ),

depend on the base-calling quality scores associated with X. Based on the Y with the

largest posterior probability, a consensus genotype is chosen. As with MAQ, a filter is

implemented to determine whether the confidence in a non-reference consensus is high

enough to call a SNP.

3.1.3 Atlas-SNP2

A large difference between Atlas-SNP2 [59] and many other SNP callers, is

that Atlas-SNP2 does not seek to determine a consensus sequence. Instead, first a list

of candidate SNP sites is established from the aligned read data. Then Bayes’ theorem

is used to calculate the posterior probabilities of a SNP existing at the candidate site

versus a SNP not existing at the candidate site. Thus the Atlas-SNP2 Bayesian partition

is essentially “SNP” and “No SNP” as the following posterior probability demonstrates.

21 P (S|SNP, c)P (SNP|c) P (SNP|S, c) = (3.5) P (S|SNP, c)P (SNP|c) + P (S|No SNP, c)P (No SNP|c)

Here “SNP” means that among the alleles covering the candidate SNP site, there is a significant portion non-reference alleles. Atlas-SNP2 considers all of the mapped reads to assess the probability of a variant genotype using S and c as seen in the above equation. S refers to the measured signal of the reference genotype and c refers to the variant base coverage. It is significant to note that S is a function of the

P (correct base call)i which is modeled with the following logistic regression model,

  P (correct base call)i log = α + b1RawQuality + b2NQS + b3Dist (3.6) 1 − P (correct base call)i

Where, i = 1, 2, . . . n refers to the reads mapped to the site of interest. RawQuality is the base-calling quality. NQS is a Boolean variable indicating whether the neighboring quality standard (NQS) passes the default requirement that the quality score of the substitution base call is greater than 20, and that the quality score of each of the five

flanking alleles on either side is greater than 15. Finally, Dist is the distance of the base from one end of the read and then normalized against the entire read length. This distance essentially gives the relative position of the base on the read.

This logistic model was trained on bacterial genome data. It initially had 9 predictors chosen a priori but a stepwise procedure was used to select these 3 predictors as the most significant. It is significant to note that the information contained within the

NQS and Dist variables are not utilized by any of the other three methods described here.

22 c ≤ 2 c > 2 P (SNP|c) 0.1 0.9 P (No SNP|c) 0.9 0.1

Table 3.1: Prior probabilities used in the Atlas-SNP2 procedure.

The priors of the Bayes’ theorem in equation 3.5 were also trained from the bacterial data. The fixed priors that resulted in the greatest differences in the posterior

SNP probabilities are listed in table 3.1. As imagined, a SNP is called when the posterior probability of “SNP” is higher than that of “No SNP”.

3.1.4 SNVMix

The Bayesian partition used in SNVMix1 [58] and SNVMix2 [21] (collectively referred to as SNVMix) is similar to MAQ but is slightly different. First, let us notate the target genotype at a particular position to be G = k ∈ {aa, ab, bb}. Here, a refers to the reference base and b refers to any non-reference base. Thus aa is homozygous reference, ab is heterozygous reference and bb is both non-reference. Note that the genotypes of ab and bb constitute SNPs.

Thus it is appropriate to assume that G ∼ Mult(1, π) where π refers to prob- abilities of the genotypes {aa, ab, bb}. We can also notate the random reference allele count, i.e. the number of times the reference allele is in the allele pileup, with X, and we

can assume that X|G = k ∼ Bin(N, µk). Here N = A + B is the sum of the observed

counts of the reference and non-reference bases, respectively. µk is the probability that,

given the target genotype is k, a randomly sampled base will match the reference base.

Thus intuitively, we know that µaa ≈ 1, µab ≈ 0.5 and µbb ≈ 0. Now using the law of

total probability we see,

23 X X N  P (X) = P (X|G = k)P (G = k) = π µA(1 − µ )B . (3.7) k A k k k k

Now, with equation 3.7, we can calculate,

h i π NµA (1 − µ )B k0 A k0 k0 P (G = k0|X, π, µ) = , (3.8) P hN A Bi k πk A µk (1 − µk) which gives us the posterior probability of each genotype. From this posterior probability one can call the consensus genotype and then determine if there is a SNP or not. It is noted that the mapping and base-calling qualities are not used here in the above formulation. This is where SNVMix2 improves on SNVMix1: it adds the usage of these quality scores.

It is also of interest to note how the above parameters where estimated. Given

T , the number of nucleotide positions being considered, the complete data log-likelihood is as follows,

T K   X X Ni log P (X ,...,X |π, µ) = log π µA(1 − µ )B. (3.9) 1 T k A k k i=1 k=1 The parameter θ = (π, µ) is fit to the data using a maximum a posteriori (MAP4) EM algorithm and a hierarchical Bayesian structure.

3.1.5 Method Comparison

Though there are many differences between the profiled single sample SNP callers, they are comparable in certain ways. Table 3.2 gives a summary of the following

4 comparisons.

4Related to Fisher’s MLE, a MAP estimate is a mode of the posterior distribution. It can be used to obtain a point estimate of an unobserved quantity based on the observed data.

24 1. All the surveyed methods use some form of Bayes’ Theorem and thus prior and con-

ditional probabilities must be supplied. The big difference between the 4 applica-

tions of Bayes’ Theorem is the choice of partition. MAQ partitions Bayes’ theorem

over g ∈ {hb, bi, hb0, bi, hb0, b0i} where b and b0 are the two most frequent observed

bases at a particular site. gigaBayes partitions over all the possible true bases,

modeled collectively as Y , covering a certain site. Atlas-SNP2 partitions over the

possibilities “SNP” and “No SNP”. SNVMix partitions over G ∈ {aa, ab, bb}, where

a refers to the reference base and b refers to any non-reference base.

2. MAQ assumes that the reads are correlated. gigaBayes appears to model all the

bases covering a site together. There is no indication that Atlas-SNP2 or SNVMix

assume that the reads are correlated.

3. For each allele covering a potential SNP site, Atlas-SNP2 takes into account the

quality of the flanking alleles and the position of the allele on the read. This may

be useful because there is an observable trend that base-calling qualities decrease

down a read. MAQ, gigaBayes and SNVMix do not indicate that they use this

information.

4. All four profiled SNP callers utilize base-calling quality values, but MAQ and

SNVMix2 also use alignment quality values.

3.2 Multiple Sample SNP Callers

As explained in the introduction of chapter 3, multiple sample HTS data usually refers to sequencing data of genomics reads that were taken only from more than 1

25 MAQ gigaBayes Atlas-SNP2 SNVMix2 Bayes’ Partition Combinations over all Y “SNP”, “No hom. ref., het. of top 2 alleles SNP” ref., non-ref. Read Correlation Yes Maybe No No Flanking & Distance No No Yes No Qualities Used Base-Calling, Base-calling Base-calling Base-Calling, Alignment Alignment

Table 3.2: This is a comparison of some of the significant details between the surveyed SNP calling methods.

Genotype Data Sample o G1 D1 Sample 1

o G2 D2 Sample 2 ...... o Gs Ds Sample s

Figure 3.1: Notation used in multiple sample SNP calling. organism of the same species. Within each sample or individual, the data structure is the same as that in the single-sample case. As shown in figure 3.1, we can notate Gi as the true genotype of sample i at a certain position and Di as the data about the

alleles covering this position after alignment. The following discussion assumes diploid

genomes.

3.2.1 SAMtools

The SAMtools [37] SNP calling algorithm uses the EM algorithm to estimate

parameters over multiple samples at a particular site. For an intuitive discussion on

the EM algorithm see the section A of the appendix. Letting hi ∈ {0, 1, 2} denote the

number of non-reference alleles in genotype Gi, we can then define the statistic H which

SAMtools uses to determine whether or not a particular site is a SNP.

26   s 0 1 2 ... 2s X   H = hi ∼   , (3.10)   i=1 θ0 θ1 θ2 . . . θ2s where θ is the probability parameter of the categorical distribution of H such that

P2s k=1 θk = 1. Here, θk essentially gives the probability that H = k. Thus if H is small then a SNP should not be called but if H is large then the associated site is definitely a potential SNP site.

The complete likelihood function including all the sample data and the missing value H, L(θ0, θ1, . . . , θ2s|D1,D2,...,Ds,H), can be abbreviated with the expression

L(θ|D,H) or the probability P (D,H|θ). We then express this likelihood for ease of use with the EM algorithm.

L(θ|D,H) = P (D,H|θ)

= P (D|H, θ)P (H|θ)

2s Y n oI(H=k) = P (D|H = k)P (H = k|θ) k=1 2s Y n oI(H=k) = P (D|H = k)θk (3.11) k=1

To explain the second step, we first note a consequence of conditional prob-

abilities, that P (A, B) = P (A|B)P (B). From this we can show that P (A, B, C) =

P (A,B,C) P (A|B,C)P (B|C)P (C) and then we see that P (A, B|C) = P (C) = P (A|B,C)P (B|C).

Q2s For the next step, P (H|θ) is a categorical density so that P (H|θ) = k=1 P (H =

k|θ)I(H=k) and since H is given, the conditional probability P (D|H, θ) does not depend

on θ.

The complete log-likelihood is then had from taking the log of L(θ|D,H).

27 2s Y n oI(H=k) l(θ|D,H) = log P (D|H = k)θk k=1 2s X n o = I(H = k) log θk + log P (D|H = k) (3.12) k=1

To accomplish the E-step of the EM algorithm, we first must determine the Q function at the tth iteration, namely Q(θ|θ(t)).

(t)   Q(θ|θ ) = EH|D,θ(t) l(θ|D,H) 2s  X n o = EH|D,θ(t) I(H = k) log θk + log P (D|H = k) k=1 2s X  n o = EH|D,θ(t) I(H = k) log θk + log P (D|H = k) k=1 2s X  (t)n o = E I(H = k)|D, θ log θk + log P (D|H = k) k=1 2s X n o = ek log θk + log P (D|H = k) (3.13) k=1

Likewise the M-step is handled by taking the argmax of θ over the Q function.

θ(t+1) = argmax Q(θ|θ(t)) θ 2s X n o = argmax ek log θk + log P (D|H = k) (3.14) θ k=1

Note that ek does not depend on θ (though it does depends on the constant

(t) θ ), so ek is treated as a constant when the Q function is maximized. After choosing

initial values, the EM algorithm specifies that we iterate through the E-Step and M-Step

until we reach convergence.

28 3.2.2 GATK

GATK [20] follows a model very similar to that of SAMtools. In particular, the same H statistic, as described in equation 3.10, is used by GATK to determine whether the site in question is a potential variant site. GATK partly does this by estimating the posterior probability of H given D, as follows.

P (D|H = k)P (H = k) P (H = k|D) = P (3.15) y P (D|H = y)P (H = y)

The prior probabilities, P (H = k), are given as a function of the population

specific heterozygosity and k. The conditional probabilities, P (D|H = k), are computed

as follows, given the sum of mutually exclusive event probabilities.

! s X P (D|H = k) = P D hi = k i=1 s X Y = P (Di|Gi) (3.16)

{G1,G2,...,Gs}∈Γ i=1 ( s ) X where Γ = G1,G2,...,Gs : hi = k i=1

3.2.3 “Cross-Sample”

The final multiple sample SNP caller that we consider is described in the article

entitled, “A cross-sample statistical model for SNP detection in short-read sequencing

data” [51]. Since this SNP caller does not appear to have been named, it will be referred

to as the “Cross-Sample” SNP caller. There are a couple of reasons why “Cross-Sample”

stands apart from other multiple sample SNP callers, including the fact that base-calling

and alignment scores are not considered. In particular, “Cross-Sample” really stands

out as one of the few SNP callers that attempts to address multiple testing issues. The

29 obvious multiple testing problem manifested in SNP calling is the risk that a SNP caller will call many more SNPs than is reasonable as indicated from the data. In statistical jargon, this can be restated as the inflation of the number of sites which are determined to be significant when a large number of statistical inferences are considered simultaneously.

The natural place to start when tackling a multiple testing problem is the def-

l inition of the statistical hypotheses. For “Cross-Sample”, H0: Location l is homozygous- reference (no SNP present) for all (i ∈ {1, 2, . . . , s}) samples. Many steps are taken when testing this hypothesis, such as calculating the estimated local false discovery rate

l l lFDRl = P (δi = 0 ∀ i|X) where δi = I(sample i not assigned to homozygous-reference for location l) and X is a vector of generated {A, C, G, T } counts given estimated model parameters. See section B in the appendix for a brief overview of the local false dis- covery rate (lFDR). In particular, that appendix section will explain why we begin the expression of lFDR with the word “uninteresting”, that is, with respect to its SNP status. lFDRl can then be expressed in the following way.

lFDRl = P (Location l is “uninteresting”|X)

l = P (δi = 0 ∀ i|X)

l l l = P (δ1 = 0, δ2 = 0, . . . , δs = 0|X)

Y l l = P (δi = 0|Xi ) i

Y l l = [1 − P (δi = 1|Xi )] (3.17) i

l l ˆl ˆl In the above, the estimator of P (δi = 1|Xi ) is notated as δi. δi is equivalent to

l the estimated P (δi = 1|X) probability which is also equivalent to 1 − P (the genotype of location l on sample i is homozygous reference). One issue that is encountered in the

30 ˆl above product is that each [1−δi] is weighted equally. This becomes problematic as some some samples have very low coverage, whereas others have very high coverage. Despite ignoring alignment and base-calling quality scores, “Cross-Sample” still acknowledges the influences of different coverage values. As a remedy, the following “arbitrary” coverage

l weights ci are introduced.

   l   0, if ni < 3      cl = l l (3.18) i ni, if 3 ≤ ni ≤ 23      l   23, if ni > 23  l ˆl We can thus add ci weights to the [1 − δi] terms using a weighted geometric mean, as is utilized in the following.

P l ( )1/ i ci Y l ˆ ˆl ci lFDRl = [1 − δi] i  P l  ( )1/ i ci Y l ˆl ci = exp log [1 − δi]  i " ( )# 1 Y l cl = exp log [1 − δˆ ] i P cl i i i i " ( )# 1 X l cl = exp log [1 − δˆ ] i P cl i i i i " # P cl log(1 − δˆl) = exp i i i (3.19) P l i ci

“Cross-Sample” calculates lFDRˆ for all sites and then marks those sites with lFDRˆ ≤ 0.1 as putative SNPs. All the putative sites are then genotyped with a separate

model and if the genotype is found to be non-reference then a SNP is called. The

“Cross-Sample” paper [51] also includes a simulation which shows that their method can

conservatively estimate the true lFDR when the data is generated from the model they

are fitting.

31 Chapter 4

Single Sample Genotype Model

Selection (GeMS)

In section 3.1 and in particular table 3.2, we considered the quality scores used by a select group of single sample SNP callers. Though not all available single sample

SNP callers were featured, it can be said that many single sample HTS SNP callers rely mainly on base-calling and read mapping or alignment quality scores. As noted in section 2.2, base-calling errors can occur when the raw HTS light intensity data is converted into short nucleotide sequences and mapping errors can occur when those short sequences are aligned to a reference genome. However, there is an apparent need to account for errors that can occur during the genomic sample preparation process. In particular, such errors can occur during the acquisition of a genomic sample, the random fragmentation of the sample into short read fragments and the amplification of those fragments.

In contrast to the aforementioned SNP callers, Genotype Model Selection (GeMS) [67], a novel SNP detection procedure which accounts for genomic preparation errors, is de-

32 Setting Default Description Reference 0.9 Only sites with a reference allele ratio less than this proportion value will considered for SNP calling analysis Deletion 0.05 Sites with a deletion placeholder proportion greater placeholder than this value will not be admitted for SNP calling analysis Base-calling 17 Alleles with base-calling quality Phred scores less quality than this value will be removed from the allele pileup Mapping 20 Alleles with mapping quality Phred scores less than quality this value will be removed from the allele pileup Maximum 255 The maximum number of alleles that will analyzed coverage at each site Ploidy Diploid Either haploid or diploid, the ploidy setting deter- mines which genotype likelihoods need to be com- puted at every site Priors Uniform Users can supply non-uniform genotype prior proba- bility values if this information is available

Table 4.1: The settings for single sample GeMS SNP calling.

scribed in this section. At sites whose read pileup data indicates a possible SNP, GeMS maximizes all possible genotype likelihoods over a parameter which is associated with such genomic preparation errors. The genotype associated with the largest likelihood is called by GeMS as the consensus genotype. Further, when the consensus genotype differs from the reference genotype, then the Dixon outlier test is used to determine whether a

SNP should be called at the site.

4.1 Data Preparation

When initiating the GeMS SNP calling procedure, the user first determines certain settings and then the data is prepared before the actual procedure takes place.

See table 4.1 for a summary of the available settings.

As mentioned in section 2.3, the standard file format is the

SAM format and its binary counterpart, the BAM format. These file formats are able to

33 hold much information about the alignment of the short reads to the reference sequence.

In particular, during alignment certain reads are flagged with undesirable characteris- tics. These characteristics and steps to remove these reads are given in appendix C.

The resulting BAM alignment file can then be converted into the PILEUP format for processing by GeMS.

The first step that GeMS does to prepare the data is to “clean” the allele pileups at each site. In particular, read start markers (with symbol “ˆ”) and their associated read mapping quality score, end read markers (with symbol “$”), complete insertions

(e.g. “+3AGT” indicates a 3bp insertion on the read of the previously listed allele) and complete deletions (e.g. “-4AGCT” indicates a 4bp deletion on the read of the previously listed allele) are given in the allele pileups but are not used by the GeMS procedure. After

GeMS removes these objects, the allele pileups should be solely composed of reference

(forward strand symbol “.” and reverse strand symbol “,”) and non-reference (forward strand symbols “ACGTN” and reverse strand symbols “acgtn”) alleles.

To save on computing resources, GeMS only considers sites that could possibly be SNPs. In particular, sites with allele pileups whose reference proportion is greater than the user specified amount (e.g. default 90%) are generally not considered to be potential

SNP sites. Likewise, GeMS next filters out sites with a high proportion (e.g. greater than 5%) of deletion placeholders (with symbol “*”), as these deletions, or misalignments that appear to be deletions, could result in false positive SNP calls.

Now that the list of potential SNP sites has been established, the allele pileups associated with these sites undergo additional filtering. First, all alleles with a base- calling or mapping quality Phred score lower than the user specified amounts (the GeMS’ default of 17 and 20, respectively, are shared by many other SNP callers) are discarded from the allele pileups. This is done to prevent low quality data from misguiding the

34 model. Second, all “N” or “n” base-calls and the remaining deletion placeholders are removed as these are not factored into the GeMS likelihood model. Third, any allele pileups longer than the user specified amount (the GeMS default is 255, which is also the value that SAMtools hardcodes into its model [37]) are randomly trimmed down to that amount to conserve computing resources. Finally the SNP candidate sites with their refined allele pileups and associated base-calling and mapping quality scores are then analyzed by the GeMS SNP calling procedure.

4.2 Procedure

As mentioned above, the GeMS procedure utilizes genotype likelihood maxi- mization and detects SNPs using Dixon’s outlier test. To begin a description of the details of the GeMS procedure, let us use the notation from table 4.2 and assume that we are working with HTS data from a diploid organism. Details on the GeMS procedure for a haploid organism are given in section 4.4.1.

In the following discussion, the location superscript l will be suppressed as the same procedure is applied to all possible SNP locations. GeMS calls the consensus genotype as argmaxg P (D|G = g)πg, where D gives the observed allele pileup data and

πg is the genotype prior probability for genotype g. Given that the true genotype is G,

we would expect that the observed alleles Xj are one of the haplotypes of G, but this

is not necessarily true because of the errors that can occur during the genomic sample

preparation, base-calling and alignment.

Given the assumption that the unobserved genotype is G = g, let Yj be defined

as the original allele, that is, the particular nucleotide present before base-calling, asso-

ciated with the observed allele Xj. As with Xj, we would expect that the Yj are one of

35 Notation Explanation l location on the reference genome Gl unobserved genotype at location l nl number of reads aligned to location l j ∈ {1, 2, . . . , nl} index of reads aligned to location l Dl observed allele data at location l, = n l l l l o Xj,Bj,Mj : j ∈ {1, 2, . . . , n } g ∈ {AA, . . . , GT } index for the 10 possible diploid genotypes l pg probability that g is the true genotype at location l l Xj observed allele on read j aligned to location l l l Yj unobserved original allele associated with Xj l l Bj base-calling quality score of Xj l l Mj mapping quality score of Xj l l l l −0.1 min{Bj ,Mj } wj “weight” of Xj,(= 1 − 10 ) k ∈ {A, C, G, T } index for the 4 DNA nucleotides g,l l g,l qk = P (Yj = k|g) where qk is a function of 1 q param- eter l πg genotype prior probability, uniform by default Table 4.2: GeMS model notation. The l superscript is suppressed in the following for convenience.

the haplotypes of G, but again this is not necessarily true. The reason for this is that

the Yj are subject to errors that can occur during the acquisition, fragmentation and amplification steps of the genomic sample preparation. Since Yj cannot be observed, it

is a latent random variable, but it is assumed to follow a discreet four point distribution

g as given by table 4.3. In particular, Yj ∼ Categorical(q ), where q is the small prob-

ability that Yj equals an allele different from the haplotypes of the assumed genotype.

g The qk variable, which is equal to P (Yj = k|g), is a function of 1 q parameter that is

estimated by maximizing the genotype likelihood over q. Thus this q parameter is what

helps GeMS to stand out from other SNP callers as Yj and q essentially model the im-

pact of genomic sample preparation errors. As an example of interpreting q considering

table 4.3, if the assumed genotype is AA, then there is a small probability q that Yj will

equal the nucleotides C, G or T. Similarly, if the assumed allele is CT, then there is a

small probability q that Yj will equal the nucleotides A or G.

36 g g g g Model g qA qC qG qT 1 AA 1 − 3q q q q 2 CC q 1 − 3q q q 3 GG q q 1 − 3q q 4 TT q q q 1 − 3q 1−2q 1−2q 5 AC 2 2 q q 1−2q 1−2q 6 AG 2 q 2 q 1−2q 1−2q 7 AT 2 q q 2 1−2q 1−2q 8 CG q 2 2 q 1−2q 1−2q 9 CT q 2 q 2 1−2q 1−2q 10 GT q q 2 2

g g g g g l g Table 4.3: The ten possible q = {qA, qC , qG, qT } for Yj ∼ Categorical(q ) given a diploid organism.

Beyond accounting for genomic sample preparation errors, the GeMS proce- dure also accounts for possible base-calling and alignment errors using the Phred-scaled

Bj and Mj scores for each observed allele Xj. Bj, the base-calling quality score, is as- sociated with the probability of a base-call being incorrect. Mj, the alignment quality score, is associated with the probability of a short read being misaligned. Explicitly, the probability of these errors are given by the following.

P (Incorrect Base-Call) = 10−0.1Bi and

P (Incorrect Alignment) = 10−0.1Mi (4.1)

From Bj and Mj, we can then define the accuracy or weight of an aligned allele, wj, in a couple of different ways. The GeMS model utilizes the following wj.

wj = min{P (Correct Base-Call),P (Correct Alignment)}

= 1 − 10−0.1 min{Bj ,Mj } (4.2)

If Xj is the result of a correct base-call and alignment, then we can assume that Xj = Yj. Thus GeMS uses the following conditional distribution for Xj given Yj.

37 P (X = Y |Y ) = w,

P (X 6= Y |Y ) = 1 − w and

1 − w P (X 6= Y,X = k|Y ) = for k ∈ {A, C, G, T}. (4.3) 3

The last line in equation (4.3) indicates that the probability that X equals a specific k, where Y 6= k, is one third of the probability that X 6= Y . The reason for this is that if Y is equal to a certain nucleotide, then there are 3 other nucleotides which X can equal, given that X 6= Y .

Figure 4.1 portrays the relationship between the q and w parameters. These parameters are assumed to be independent under the GeMS model. We first see that q is an important parameter which models the transition from the true unobserved genotype

G to the original alleles {Yj}. Likewise, the {wj} which combine the base-calling and mapping quality information, tell us much about the transition from the original alleles

{Yj} to the observed alleles {Xj}. For added interpretation, if there is a lot of variability in the observed alleles {Xj} and yet the {wj} values are high, then we can imagine that the estimated q parameter is relatively high, as q indicates the probability that the {Yj} equal alleles different from the haplotypes of the assumed genotype.

Assuming that the reads, and thus the observed alleles, {Xj}, are independent, the following constitutes the GeMS likelihood.

38 Unobserved Genotype G q  Original Alleles {Yj}

{wj }  Observed Alleles {Xj}

Figure 4.1: The relationship between the q and w parameters.

L(qg) = P (D|G = g, q) n Y = P (Xj|G = g, q) j=1 n Y X = [P (Xj|Yj = k)] P (Yj = k|G = g, q) j=1 k∈{A,C,G,T }

n "  I(Xj =6 k)# Y X I(X =k) 1 − wj = w j qg (4.4) j 3 k j=1 k∈{A,C,G,T }

Since the likelihood in equation (4.4) depends on the assumed genotype g,

GeMS calculates the estimated likelihood, L(qˆg), for each of the 10 genotypes. L(qˆg) is estimated by maximizing the likelihood over q ∈ (0, 0.25]. The range of q is so chosen because the q = 0.25 situation implies that all alleles are equally likely. Further, q > 0.25 indicates the problematic assumption that the alleles from the assumed genotype g are less likely than the alleles not from g.

With respect to genotype prior probabilities, GeMS implicitly uses non-informative or uniform priors by default. However, if prior information is known, then the GeMS model can facilitate this information in the form of π = {πg}. Given the priors {πg}, the GeMS consensus genotype is defined using the following posterior probability.

39 g argmax P (G = g|D) = argmax L(qˆ )πg (4.5) g g

For the GeMS SNP calling procedure, a necessary, though not sufficient, con- dition to call a SNP is satisfied when the consensus genotype differs from the reference genotype. It would be reasonable, however, to also require that the consensus genotype likelihood be significantly larger than the other genotype likelihoods. Thus, let us denote the 10 ordered posterior probabilities P (G = g|D) ∝ P (D|G = g)πg with the following

order statistics.

P = min P (G = g|D) ≤ P ≤ ... ≤ P ≤ P = max P (G = g|D) (4.6) (1) g (2) (9) (10) g

One way to determine if the consensus genotype likelihood is significantly larger than the other genotype likelihoods, is to use Dixon’s Q test [14]. Dixon’s Q test was de- veloped to detect outliers and thus can statistically detect the dominance of the consensus likelihood. Since each location of interest has a sample size of 10 posterior probability values, the appropriate Q test statistic is as follows1.

P − P Q = (10) (9) (4.7) P(10) − P(2)

This Q statistic tests if the maximum value is an outlier by considering the gap between the largest estimated posterior probability (P(10) = maxg P (G = g|D)) and the second largest (P(9)), standardized by what is effectively the range of the estimated posterior probabilities. The range in the denominator of equation (4.7) would normally be defined as P(10) − P(1), but P(10) − P(2) is more robust to very small values of P(1). The user

1See equation (4.11) for the Q test statistic given the haploid case.

40 can control how conservative this Q is by changing their predefined α value. A SNP is called if both the consensus genotype is different from the reference and Dixon’s Q test is significant at the α value (i.e. if the p-value < α). The above computational algorithm to compute the Dixon Q test p-value is based on the R [55] outliers package [32].

4.3 Validation

It is one thing to propose a SNP calling model and quite entirely a different thing to validate it as being computationally efficient and competitive, with respect to accuracy, when compared to other popular SNP callers. In the following, simulations and real data analyses indicate that, while being computationally efficient, GeMS has the best performance balance of sensitivity and positive predictive value among the tested

SNP callers.

It is to be understood that there are advantages and disadvantages to both simulation studies and real data analyses. For instance, though the true locations of all simulated SNPs are known in a simulation study, simulated data has the general appearance of being contrived or artificial. Beyond this, there are generally so many variables that come in to play with simulated data. Thus is difficult, if not impossible, to be objective when simulating datasets. On the other hand, though real data analyses, involve real data, it is usually very difficult to get reliable and complete information on the location of the SNPs. Further, as an analogue to the many variables found in simulating data, there are generally many datasets available to researchers, and it is often hard to determine which dataset should be used to convey fair validation results. Because of these issues, the following results are intended to be reproducible and transparent (i.e. necessary steps to attain the results will be given).

41 4.3.1 Simulation Analysis

To validate the GeMS method, extensive simulations were run which demon- strate that GeMS has the best balance of sensitivity and positive predictive value amongst a select group of other popular competing SNP callers. The simulated short read data was generated based on the reference genome of the haploid bacterial species Ther- moanaerobacter sp. X514 and real Illumina Genome Analyzer short read data from the same species. Full details on the options used during the simulation, alignment and SNP calling are found in table 4.4.

The MAQ [40] simulation tools simutrain and simulate were used to produce

FASTQ format short read data with differing amounts of reads. The read amounts were chosen such that the resulting alignment files would have the minimum average coverage levels of 5, 10, 20, 50, 100, 200, 500, 1000 and 2000. The higher simulated coverage levels are representative of current microbial sequencing data, whereas the lower simu- lated coverage levels are more indicative of whole genome sequencing on organisms with larger genomes. The simulation procedure was trained on existing HTS data using the

MAQ simutrain procedure, thus the simulated base-calling quality scores were realistic to a certain degree. The default settings of MAQ simulate were chosen and thus the simulated mutation rate was set to be 0.1%. 10% of these mutations were simulated as indels by default, the other 90% of the mutations were both homozygous and heterozy- gous SNPs. As mentioned earlier, this simulated data cannot perfectly represent the intricate details observed in real data, but it is sufficient for the purposes of controlling and evaluating the performances of various SNP callers.

The short read aligner BWA [38] was used to align the simulated reads. Af- ter the refined BAM and PILEUP files were gathered, then the following SNP callers

42 Simulation Study Specifications and non-default options used Reference organism Thermoanaerobacter sp. X514 (2.5Mbp) Simulation package MAQ-0.7.1 simulate and simutrain based on the Xw2010 dataset as related in table 4.12 Simulation options -N was choosen to simulate maximum average coverage values of 5, 10, 20, 50, 100, 500, 1000 and 2000 Sample size Single sample Ploidy X514 is haploid but MAQ simulate was implemented in diploid mode by default Read Length 40bp Mate pair Paired-end Quality encoding Illumina 1.6+ Alignment options BWA-0.5.9 aln -I, sampe and SAMtools-0.1.16 view -q 20 Pileup options SAMtools-0.1.16 view -q 20 | pileup -Q 17 -B -s - GeMS options α = 0.05 MAQ options SAMtools-0.1.9 pileup -vc FreeBayes options FreeBayes-0.9.0 -4 -m 20 -q 17 -R 20,17 SAMtools options SAMtools-0.1.16 mpileup -Q 17 -g and bcftools view -vcg GATK options GenomeAnalysisTK-1.0.5336 -T UnifiedGenotyper -bad_mates -mbq 17 -mmq 20 SOAPsnp options SAMtools-0.1.9 pileup -avc Atlas-SNP2 options Atlas-SNP2-1.1 -s -f 1000000 VarScan options VarScan-v2.2.5 –min-coverage 1 –min-reads2 1 –min-avg-qual 17 SNVMix2 options SNVMix2-0.11.8 -m shah_lobular_snvmix_model.txt -t MB -q 16 -Q 19

Table 4.4: Options used in the simulation, alignment and SNP calling of the simulated single sample data. SAMtools view is just one procedure for refining the alignment file for SNP calling or any other downstream analyses. For more information on using SAMtools view to filter alignment files, see appendix C. The program options were chosen in the interest of fairness, objectivity and reasonableness. The default options of a program were preferred, unless these options were unrealistic, incompatible with the analysis or gave an advantage to any SNP caller. The shah_lobular_snvmix_model.txt file is available at http://compbio.bccrc.ca/wp-content/uploads/2009/10/shah\ _lobular\_snvmix\_model.txt.

43 were run on the alignment files: GeMS, SAMtools mpileup which utilizes BCFtools, the SAMtools implementation of the MAQ SNP caller model, FreeBayes, GATK’s Uni-

fied Genotyper, the SAMtools implementation of the SOAPsnp procedure, Atlas-SNP2,

VarScan and SNVMix2. A few of the SNP callers offer optional filters, based on BAM alignment file information, that end-users can utilize to refine their SNP calling results.

It is theoretically possible to combine any of such SNP calling filters with the results of any SNP caller. As the current interest is in the SNP callers themselves and not SNP caller filters, these optional SNP caller filters were disabled where possible.

Impartiality is an ideal very important to this SNP caller comparison. Thus, though there are more than 8 SNP callers available for use, the 8 chosen SNP callers allow for a relatively fair comparison. Many other SNP callers were considered but not included because of a certain feature which would reduce the level of impartiality in the comparison. In particular, each represented SNP caller needed to be compatible with the specifications as listed in table 4.4. For example, the SNP caller Slider II [44] was designed to call SNPs after running its own alignment procedure, and hence Slider II was not included in the SNP caller comparison. Also, to a reasonable extent, the options of the alignment and SNP caller procedures were chosen to maximize the fairness of the comparison. Default options were preferred unless the default options were not applicable or gave an unfair advantage to any particular SNP caller. A summary of the options used in the simulation study are recorded in table 4.4.

The criteria used to evaluate the 9 SNP callers are sensitivity and positive predictive value (PPV). Sensitivity, which is identical to recall, can be described as

“SNP call detection power” and is defined as follows.

44 # True Positives # “true called SNPs” = (4.8) # True Positives + # False Negatives # “true SNPs”

PPV, which is identical to precision, can be described as “SNP call accuracy” and is defined as follows.

# True Positives # “true called SNPs” = (4.9) # True Positives + # False Positives # “called SNPs”

Typically, a 1 number summary of the sensitivity and PPV is given with the the harmonic mean (HM) of the sensitivity and PPV. The harmonic mean is preferred to the arithmetic mean because an “average” is desired of 2 rates or ratios. The harmonic mean of the n values x1, x2, . . . , xn is given by the following.

n !−1 1 X n HM = · x−1 = (4.10) n i 1 + 1 + ··· + 1 i=1 x1 x2 xn The complete sensitivity and PPV results, over all the simulated coverage levels, are displayed in tables 4.5 and 4.6. Further, the harmonic mean of sensitivity over all the coverage levels, the harmonic mean of PPV over all the coverage levels, and the harmonic mean of both sensitivity and PPV2 over all coverage levels, are listed in table 4.7.

Similar to the function of ROC curve charts, PPV and sensitivity can also be plotted in the form of a precision-recall (PR) chart. Thus each the performance of each

SNP caller can be viewed graphically. A PR chart plots the PPV or precision on the y-

2 In this case, the following fact about harmonic means is of interest. Given the n values x1, x2, . . . , xn and the n values y1, y2, . . . , yn, it can be shown that HM(HMx, HMy) = HMxy. In more verbose terms, the harmonic mean of, the harmonic means of {x1, x2, . . . , xn} and {y1, y2, . . . , yn}, is equal to the harmonic mean of {x1, x2, . . . , xn, y1, y2, . . . , yn}.

45 Sensitivity 5 10 20 50 100 VarScan 0.9261 0.9613 0.9643 0.9704 0.9749 SNVMix2 0.9261 0.9613 0.9643 0.9704 0.9749 FreeBayes 0.9216 0.9600 0.9621 0.9659 0.9712 MAQ 0.8353 0.9410 0.9630 0.9704 0.9749 GeMS 0.8264 0.9287 0.9607 0.9704 0.9749 SAMtools 0.7449 0.9045 0.9554 0.9672 0.9726 GATK 0.7070 0.9094 0.9594 0.9681 0.9735 Atlas-SNP2 0.6002 0.8720 0.9585 0.9690 0.9731 SOAPsnp 0.8353 0.9410 0.9630 0.9704 0.9735 Coverage HM 0.7984 0.9301 0.9612 0.9691 0.9737

Sensitivity 200 500 1000 2000 Caller HM VarScan 0.9684 0.9751 0.9676 0.9733 0.9644 SNVMix2 0.9684 0.9751 0.9676 0.9733 0.9644 FreeBayes 0.9656 0.9697 0.9626 0.9707 0.9608 MAQ 0.9684 0.9751 0.9676 0.9725 0.9499 GeMS 0.9684 0.9751 0.9676 0.9733 0.9471 SAMtools 0.9679 0.9728 0.9653 0.9716 0.9295 GATK 0.9679 0.9737 0.9676 0.9729 0.9242 Atlas-SNP2 0.9679 0.9742 0.9676 0.9729 0.8967 SOAPsnp 0.6439 0.6486 0.9604 0.9672 0.8550 Coverage HM 0.9319 0.9377 0.9660 0.9720 0.9311

Table 4.5: Single sample simulation SNP caller sensitivity. The simulated coverage values are indicated in the top rows. The column harmonic means (HM) at each coverage level are given in the bottom row. The other rows are sorted by the “Caller HM” column, which gives the harmonic means of the rows over all the coverage levels.

46 PPV 5 10 20 50 100 SAMtools 1.0000 1.0000 1.0000 1.0000 1.0000 GeMS 0.9784 0.9828 0.9958 0.9991 0.9967 SOAPsnp 0.9740 0.9696 0.9682 0.9747 0.9695 MAQ 0.9740 0.9696 0.9673 0.9717 0.9611 FreeBayes 0.9637 0.9454 0.9370 0.9611 0.9640 GATK 0.9956 0.9899 0.9890 0.9863 0.9731 Atlas-SNP2 0.9656 0.9580 0.9251 0.8563 0.7471 VarScan 0.8836 0.8327 0.7489 0.6797 0.6374 SNVMix2 0.8855 0.8362 0.7525 0.6856 0.5737 Coverage HM 0.9560 0.9386 0.9096 0.8817 0.8340

PPV 200 500 1000 2000 Caller HM SAMtools 1.0000 1.0000 1.0000 1.0000 1.0000 GeMS 0.9995 0.9968 0.9981 0.9991 0.9940 SOAPsnp 0.9674 0.9795 0.9912 0.9924 0.9762 MAQ 0.9725 0.9755 0.9777 0.9661 0.9706 FreeBayes 0.9665 0.9710 0.9771 0.9763 0.9623 GATK 0.9707 0.9467 0.8924 0.8499 0.9523 Atlas-SNP2 0.7049 0.7134 0.7237 0.7212 0.7996 VarScan 0.6373 0.6667 0.6871 0.6877 0.7094 SNVMix2 0.5183 0.4599 0.3579 0.2578 0.5105 Coverage HM 0.8149 0.8028 0.7622 0.6943 0.8357

Table 4.6: Single sample simulation SNP caller PPV. The simulated coverage values are indicated in the top rows. The column harmonic means (HM) at each coverage level are given in the bottom row. The other rows are sorted by the “Caller HM” column, which gives the harmonic means of the rows over all the coverage levels.

Sensitivity PPV Harmonic Mean GeMS 0.9471 0.9940 0.9700 SAMtools 0.9295 1.0000 0.9634 FreeBayes 0.9608 0.9623 0.9615 MAQ 0.9499 0.9706 0.9601 GATK 0.9242 0.9523 0.9380 SOAPsnp 0.8550 0.9762 0.9116 Atlas-SNP2 0.8967 0.7996 0.8454 VarScan 0.9644 0.7094 0.8175 SNVMix2 0.9644 0.5105 0.6676

Table 4.7: GeMS simulation results summary. Harmonic mean of sensitivity over all coverage levels, harmonic mean of PPV over all coverage levels, and the harmonic mean of both sensitivity and PPV over all coverage levels, listed in descending order of the harmonic mean of both sensitivity and PPV.

47 1.0 1 1 2 338464957 7 1 22 3898497 SNP Caller 1 6 1 2 12 36474595 237 GeMS 3 8 FreeBayes 1 49 MAQ 2 0.8 SAMtools GATK 35 8 SOAPsnp 697 849 Atlas−SNP2 47 VarScan 65 0.6 SNVMix2 5 6 Approx Coverage 7 0.4 '1' = 5 '2' = 10 8 '3' = 20 '4' = 50 PPV (Precision) = TP/(TP+FP) '5' = 100 9 0.2 '6' = 200 '7' = 500 '8' = 1000 '9' = 2000

0.0

0.0 0.2 0.4 0.6 0.8 1.0

Sensitivity (Recall) = TP/(TP+FN)

Figure 4.2: Sensitivity and PPV or precision-recall plot of SNP caller performance. SNP caller performance across 9 coverage levels and 9 SNP callers. See figure 4.3 for a zoomed-in view.

axis and sensitivity or recall on the x-axis. Figure 4.2 and its corresponding “zoomed-in” version in figure 4.3 display the simulation results on a precision-recall plot.

It is evident from table 4.7 that the GeMS SNP calling procedure has the best performance balance of the nine tested SNP callers. By “performance balance”, we mean precisely the last column of table 4.7, the harmonic mean of both sensitivity and PPV.

SAMtools, FreeBayes and MAQ (recall the SAMtools model is based on MAQ) offer the

48 1.00 2 3 846957 3 84957 2 8 9 3 4 2 1 8 849 7 1 6745 2 3 6 5 1 6 59 2 4 5 0.95 2 7 3 3

0.90 8 SNP Caller Approx Coverage 1 GeMS '1' = 5 FreeBayes '2' = 10 MAQ '3' = 20

PPV (Precision) = TP/(TP+FP) 4 0.85 SAMtools '4' = 50 9 GATK '5' = 100 2 SOAPsnp '6' = 200 2 Atlas−SNP2 '7' = 500 VarScan '8' = 1000 SNVMix2 '9' = 2000 0.80

0.80 0.85 0.90 0.95 1.00

Sensitivity (Recall) = TP/(TP+FN) 3 Figure 4.3: Zoomed-in view of sensitivity and PPV or precision-recall plot3 5 of SNP caller performance. SNP caller performance across 9 coverage levels and 9 SNP callers. See figure 4.2 for a full plot view.

49 next best performance levels after GeMS. However, appealing to tables 4.5 and 4.6, it is clear that at every coverage level, GeMS is more sensitive than SAMtools and offers more PPV than FreeBayes and MAQ.

As mentioned in section 3.1.4, SNVMix2 partitions over the genotype categories homozygous reference, heterozygous reference and non-reference. SAMtools uses the same genotype category partition. This is a large difference when compared with the

GeMS SNP calling model which considers all 10 possible genotypes at each site. It was of interest to determine exactly how the 10 model approach of GeMS compares to the 3 model approaches of both SNVMix2 and SAMtools, and thus a special version of

GeMS, which only considers the genotype categories homozygous reference, heterozygous reference and non-reference, was run. The 3 model GeMS analysis results, contained in tables 4.8 and 4.9, indicate that considering only the 3 stated models reduces the sensitivity at every coverage level when compared to considering the 10 possible genotype models. This finding holds true given the otherwise unchanged GeMS procedure and offers one reasonable explanation for the lower sensitivity of SAMtools when compared to the regular 10 model GeMS method. The results also show that the PPV of the 3 model GeMS is not always superior to the 10 model GeMS.

We can also examine the simulated SNPs that were not called by any of the

SNP callers. In total over the 9 simulated datasets, there are 706 of these false negative sites. It should be remembered that SNP callers work on the alignment data and not the HTS read data. It is thus understandable that 604 of these 706 sites actually had no coverage which means that SNPs could not be called. Among the 102 remaining sites,

94 were uniformly covered by the reference allele. As the 604 sites mentioned above, this is probably a result of alignment errors and thus again, SNPs could not be called.

50 Sensitivity 5 10 20 50 100 10 Model GeMS 0.8264 0.9287 0.9607 0.9704 0.9749 3 Model GeMS 0.3767 0.7536 0.9438 0.9668 0.9735 SAMtools 0.7449 0.9045 0.9554 0.9672 0.9726

Sensitivity 200 500 1000 2000 10 Model GeMS 0.9684 0.9751 0.9676 0.9733 3 Model GeMS 0.9679 0.9737 0.9671 0.9725 SAMtools 0.9679 0.9728 0.9653 0.9716

Table 4.8: Sensitivity of 10 model GeMS, 3 model GeMS and SAMtools. The average simulated coverage values are indicated in the first row. With the otherwise unchanged GeMS procedure, these results demonstrate that utilizing only 3 models reduces the sensitivity at every coverage level when compared to considering all the 10 possible genotype models.

PPV 5 10 20 50 100 10 Model GeMS 0.9784 0.9828 0.9958 0.9991 0.9967 3 Model GeMS 0.9976 0.9948 0.9981 0.9995 0.9967 SAMtools 1.0000 1.0000 1.0000 1.0000 1.0000

PPV 200 500 1000 2000 10 Model GeMS 0.9995 0.9968 0.9981 0.9991 3 Model GeMS 0.9995 0.9991 0.9991 0.9991 SAMtools 1.0000 1.0000 1.0000 1.0000

Table 4.9: PPV of 10 model GeMS, 3 model GeMS and SAMtools. The average simulated coverage values are indicated in the first row. With the otherwise unchanged GeMS procedure, these results demonstrate that the PPV of the 3 model GeMS is not always superior to the 10 model GeMS.

51 The 8 remaining sites had allele pileups consisting of exactly 1 non-reference allele that was associated with a low base-calling quality score (B < 17). Table 4.4 shows us that most of the SNP callers wouldn’t even consider this 1 non-reference allele due to quality

filters.

4.3.2 Real Data Analysis

As explained in section 4.3, both simulated and real dataset analyses have advantages and disadvantages. This is why both are complimentary to each other in that they help us understand how the different SNP callers compare to each other. In the following datasets, there is some indication as to the type of SNPs that should be present, but complete information on the SNPs are not known. However, the real data does give us a look into how each of the considered SNP callers work in a non-simulated situation.

4.3.2.1 The Arabidopsis sup1ros1 dataset

In this analysis of real data, the SNP calls of GeMS, SAMtools and GATK on

HTS data from an Arabidopsis sup1ros1 ecotype are compared. A description of this dataset and the options used in the analysis are given in table 4.10. Since it was desired to compare the SNP calls of each SNP caller individually, 3 SNP callers were chosen.

Both SAMtools and GATK were chosen because of their popularity as HTS SNP callers and their relatively good performance in the simulation study. GATK and SAMtools called 5,341 and 4,577 sites as putative SNPs respectively, on the region of interest which included sites 24,218,085–26,019,264 on chromosome 5. The GeMS α value was set to be 0.005 to get confident SNP calls in this SNP caller comparison.

52 Arabidopsis sup1ros1 Specifications and non-default options used Organism Reference Columbia TAIR9 (121Mbp) Sample size Single sample Ploidy Diploid Read Length 110bp Mate pair Single-end Quality encoding Illumina 1.5+ Alignment options BWA-0.5.9 aln -I, samse and SAMtools-0.1.16 view -q 20 Pileup options SAMtools-0.1.16 view -q 20 | pileup -Q 17 -B -s - GeMS options GATK priors, α = 0.005 GATK options GenomeAnalysisTK-1.0.5336 -T UnifiedGenotyper -bad_mates -mbq 17 -mmq 20 SAMtools options SAMtools-0.1.16 mpileup -Q 17 -g and bcftools view -vcg

Table 4.10: Options used in the alignment and SNP calling of the Arabidopsis sup1ros1 dataset. SAMtools view is just one procedure for refining the alignment file for SNP calling or any other downstream analyses. For more information on using SAMtools view to filter alignment files, see appendix C. The program options were chosen in the interest of fairness, objectivity and reasonableness. The default options of a program were preferred, unless these options were unrealistic, incompatible with the analysis or gave an advantage to any SNP caller.

The dataset, and in particular, the region of interest mentioned above, is known to be sequenced from a highly homozygous mutation. Thus the zygosity of the consensus calls associated with the SNP calls will be of interest to examine. In the interest of fairness, the prior probabilities used by GATK were employed in the GeMS model.

Specifically, GATK by default assigns the highest prior probability to the homozygous reference genotype and lesser prior probabilities to each of the 9 other homozygous and heterozygous genotypes. A description of the prior probabilities used are given in section 4.4.2. The following paragraphs will examine the different sections of the Venn diagram shown in figure 4.4, which was generated from the GeMS, GATK and SAMtools

SNP call sets.

Let us examine the zygosity of the consensus calls associated with each of the

SNP call sets. 1.6% of the GeMS SNP calls were heterozygous, compared with 6.8% from GATK and 2.9% from SAMtools. This result is in the favor of GeMS as it is known

53 Venn Diagram

GeMS GATK

319 10 477

4121

2 424

30

SAMtools

Figure 4.4: Venn diagram of the SNP call sets of GeMS, GATK and SAMtools. The GeMS prior probability values were set to those of GATK. that the vast majority of the mutations should be homozygous. Specifically, among the

Unique objects: All = 5383; S1 = 4452; S2 = 5341; S3 = 4577 4,452 SNP calls from GeMS, 73 were called as heterozygous. GATK called all of these

73 sites as heterozygous and SAMtools called 63 (of the 73) as heterozygous. Expanding to all the SNP calls, it can be seen that GeMS shares 99.7% of its SNPs with GATK and 92.6% of its SNPs with SAMtools. These findings indicate that the majority of the

GeMS SNP calls are supported by both GATK and SAMtools.

Considering the 319 sites identified as SNPs by both GeMS and GATK but not by SAMtools brings more insight. In particular, the allele pileup data demonstrates that these sites were called correctly as SNPs thus indicating that the SAMtools procedure exhibits low sensitivity in this dataset. First, 10 of these sites are called as heterozygous by both GeMS and GATK. These sites appear to be clear SNP calls as the percentage of the most frequently aligned non-reference alleles spanned between 43% and 75%. The

309 remaining sites feature a high percentage of the most frequently aligned non-reference

54 allele, which is at least 83.3%. 259 out of the 309 sites (or 83.8%), in fact, are covered uniformly by only one type of non-reference allele. The coverage at these 259 sites ranged from 8 to 50.

Considering the 424 sites which are called as SNPs by both GATK and SAM- tools, but not by GeMS, indicates the difference between the GATK priors and the default GeMS non-informative priors. A characteristic common to most of these sites is low coverage, in fact 87.7% of these sites have coverage levels less than 8. The prior probabilities of GATK also have a strong influence on the GeMS SNP calls. For exam- ple, 94.9% of these sites mentioned above are covered by just one type of non-reference allele but the GeMS likelihoods were such that the Dixon test was not significant i.e. the estimated GeMS likelihoods associated with the variant genotypes were not signif- icantly larger than the other possible genotypes. In this case, the bias of the GATK priors toward the reference genotypes may have prevented GeMS from calling SNPs at these sites. As alluded to in section 4.4.2, when the coverage is low, these GATK prior probability values may reduce a SNP caller’s sensitivity. Thus, when prior information is not available, non-informative or uniform prior probability values are recommended.

To see the change that can happen when we utilize non-informative priors compared with GATK’s priors, we can reevalulate the 424 sites mentioned above with the default uniform priors of GeMS. When doing so, 98.9% of the low coverage, that is coverage less than 8, sites are called by GeMS to be SNPs. These include all the sites covered uniformly by only one type of non-reference allele. When considering the sites with coverage ≥ 8, GATK and SAMtools call 98.1% and 92.3% of these sites as heterozygous SNPs respectively. Since we know that this dataset has a homozygous mutation bias, these SNP calls seem suspect.

55 The exclusive SNP calls by GATK and SAMtools are also of interest. There are 477 sites that are exclusive to the GATK SNP call set. Among these sites 73.8% have low coverage, i.e. coverage less than 8, and 98.4% of the sites with coverage ≥ 8 were called to be heterozygous. GeMS with non-informative priors calls 92.3% of these low coverage sites as SNPs.

There are 30 sites which are exclusive to the SAMtools SNP call set. 15 of these 30 sites have coverage less than 8. 2 of these 30 sites are called to be heterozygous

SNPs by SAMtools. The 13 remaining sites were excluded from the GeMS analysis because they are potential deletions. As mentioned above in section 4.1, sites where the allele pileup is over 5% deletions are excluded from the GeMS analysis since they are considered potential deletion mutations.

There are 10 exclusive GeMS SNP sites and 2 sites which are called as SNPs by

GeMS and SAMtools but not by GATK. The allele pileups of these 12 sites show clear

SNP characteristics. In particular, all of these 12 sites are covered with just one type of non-reference allele with coverage ≥ 8 (specifically 9-16). Thus we see the superior sensitivity of GeMS. An interesting artifact with the 10 exclusive GeMS SNP sites, is that these sites are composed of two 5bp MNPs or multiple nucleotide polymorphisms.

One large difference between the simulated dataset and this real dataset, is that multiple simulated datasets were generated for various coverage levels. Thus to explore different coverage levels in this real dataset, we can randomly remove short reads such that the average coverage levels become approximately 5, 10, 20 and 33, where 33 is the average coverage of the complete dataset. It is reasonable to randomly remove short reads not only because the data is single-end, but more importantly because the BWA aligner aligns reads independently [24].

56 33 20 10 5 GeMS E/T 0.0022 0.0010 0.0000 0.0000 GATK E/T 0.0893 0.0799 0.0915 0.1325 SAMtools E/T 0.0066 0.0145 0.0139 0.0429 I/OT 0.7656 0.7342 0.5230 0.1029

Table 4.11: SNP call E/T and I/OT proportions listed at coverage levels 33 (complete dataset), 20, 10 and 5. E/T represents the proportion of the Exclusive SNPs to Total SNPs for each SNP caller. I/OT represents the proportion of the SNP calls in the 3-way Intersection (of the GeMS, GATK and SAMtools SNP call sets) to the Overall Total unique SNP calls of all the 3 SNP callers.

As demonstrated in the Venn diagram shown in figure 4.4, a site that is called by all 3 SNP callers is generally viewed as a confident SNP call site. Similarly, if a site is only called by 1 SNP caller, then the SNP call is not viewed with as much confidence.

Since in real datasets the true SNP sites are not known, one validation metric could be the the proportion of Exclusive SNPs to Total SNPs (E/T) for each SNP caller. Another metric to consider could be the proportion of the SNP calls in the 3-way Intersection (of the GeMS, GATK and SAMtools SNP call sets) to the Overall Total unique SNP calls of all the 3 SNP callers (I/OT). Table 4.11 lists these E/T and I/OT proportions for the coverage levels of 5, 10, 20 and 33. It is clear that the GeMS E/T value is less than that of GATK and SAMtools at every coverage level. This fact indicates that the GeMS

SNP calls can be viewed as more accurate than that of the other SNP callers. Another observation is that the I/OT proportion drops dramatically as the coverage drops. This observation reaffirms the thought, which is also seen in the simulation study, that use users exercise caution when using SNP callers on low coverage data.

4.3.2.2 The Thermoanaerobacter sp. X514 Xw2010 dataset

It is important to note that many of the aforementioned SNP callers are only available for diploid data. However, the only SNP caller that preformed well in the

57 Venn Diagram

GeMS FreeBayes

24 21 10

Figure 4.5: Venn diagram of the SNP calls by GeMS and FreeBayes.

simulation study and that has a haploid option is FreeBayes. Thus only GeMS and

FreeBayes are compared in this (haploid) bacterial dataset. The bacteria species Ther- moanaerobacter sp. X514 was sequenced in 2008 and again in 2010 (hence the namesake for the Xw2010 dataset). With respect to SNPs, the 2008 sequencing data is presumed Unique objects: All = 55; S1 = 45; S2 = 31 to be largely consistent with the Thermoanaerobacter sp. X514 reference genome. In

contrast, it is expected that the Xw2010 data may exhibit some variants due to the short

life cycle of bacteria.

To mitigate the effect of inconclusive SNP calls, SNP calls with coverage values

less ≤ 3 were filtered out of the analysis. With the GeMS α value set to 0.05, GeMS

called 45 SNPs. As seen in the Venn diagram in figure 4.5, FreeBayes called a total of 31

SNPs, though 21 of the FreeBayes SNPs are also shared by GeMS. All of the alignment

and SNP call options used in this analysis are recorded in table 4.12.

Among the allele pileups at the 55 unique SNP call sites mentioned above,

there are generally only 2 major alleles, the reference and a particular alternative allele.

The 10 exclusive FreeBayes calls can be categorized into either the 2 sites which are

characterized by many deletions or the 8 sites which have reference allele coverage rates

58 X514 Xw2010 Specifications and non-default options used Organism Reference Thermoanaerobacter sp. X514 (2.5Mbp) Sample size Single sample Ploidy Haploid Read Length 40bp Mate pair Single-end Quality encoding Illumina 1.6+ Alignment options BWA-0.5.9 aln -I, samse and SAMtools-0.1.13 view -q 20 Pileup options SAMtools-0.1.16 view -q 20 | pileup -Q 17 -B -s - GeMS options α = 0.05 FreeBayes options FreeBayes-0.9.0 -4 -p 1 -m 20 -q 17 -R 20,17

Table 4.12: Options used in the alignment and SNP calling of the Thermoanaerobacter sp. X514 dataset. SAMtools view is just one procedure for refining the alignment file for SNP calling or any other downstream analyses. For more information on using SAMtools view to filter alignment files, see appendix C. The program options were chosen in the interest of fairness, objectivity and reasonableness. The default options of a program were preferred, unless these options were unrealistic, incompatible with the analysis or gave an advantage to any SNP caller.

that range from 45% to 75%. Thus GeMS compares favorably, as among the 24 exclusive

GeMS calls, one site has a reference coverage rate of 48% but the 23 other SNP call sites

have reference coverage rates of 12%-44%. Based on these reference allele coverage rates,

it is evident that the exclusive GeMS calls are more certainly variants than the FreeBayes

exclusive calls. Also, it can be inferred that GeMS has a higher sensitivity than FreeBayes

because of the relatively large number of strong exclusive GeMS SNPs.

4.3.3 Computational Performance

For the statistician, metrics such as sensitivity and PPV are very important.

For the bioinformatician, other metrics, such as procedure processing time and memory

usage are also very important. Thus the computational performance of the simulation

study in section 4.3.1 was recorded for each of the SNP callers on each of the 9 simulated

HTS datasets.

59 Hardware Specifications CPU Intel Xeon E5420 2.50GHz Quad Core RAM 16GB DDR2 ECC HDD NFS 7,500 rpm

Table 4.13: Computer hardware specifications for the single sample simulated data SNP calling.

For objectivity, each procedure was run using the same hardware without other tasks being run during the SNP calling procedures. The computer hardware specifica- tions are given in table 4.13.

There were three metrics that were used to evaluate the computational perfor- mance of the SNP callers: process completion time in seconds, and average and maximum memory usage in megabytes (MB) during process completion. These computation per- formance results are displayed in tables 4.14, 4.15 and 4.16. These tables show that the

VarScan procedure was not able to be completed with the above mentioned hardware for coverage levels 500, 1000 and 2000. These results for these coverage levels were gathered when running VarScan on a computer with greater specifications.

First appealing to table 4.14, we see that the time to process completion for

GeMS was considerably faster than that of GATK, SAMtools (the MAQ successor) and

FreeBayes. It is true that other SNP callers are faster than GeMS, such as SNVMix2,

SOAPsnp and MAQ, however recalling table 4.7, we note again that these SNP callers did not do as well as GeMS in the harmonic mean of sensitivity and PPV criterion.

It is of interest to note that the maximum memory used during the use of the 9

SNP callers over the 9 simulated datasets was 3136MB. This occurred while Atlas-SNP2 was running on the largest simulated dataset with a coverage level of 2000. Thus many

60 Time (Sec) 5 10 20 50 100 SNVMix2 2 3 4 9 15 SOAPsnp 2 3 6 13 25 MAQ 4 5 7 14 26 GeMS 12 17 28 60 112 GATK 33 41 53 99 154 FreeBayes 16 26 45 99 196 SAMtools 24 43 81 192 383 VarScan 572 687 929 1597 2728 Atlas-SNP2 236 336 510 1029 1954 Coverage Avg 100 129 185 346 621

Time (Sec) 200 500 1000 2000 Caller Avg SNVMix2 29 77 176 346 74 SOAPsnp 52 131 265 548 116 MAQ 54 137 268 562 120 GeMS 212 339 512 854 239 GATK 264 495 727 1208 342 FreeBayes 381 916 1828 3639 794 SAMtools 756 1872 3729 7444 1614 VarScan 5003 NA NA NA 1919 Atlas-SNP2 4041 13669 55554 267224 38284 Coverage Avg 1199 2204 7882 35228 4833

Table 4.14: Time to completion of the single sample SNP calling procedures in seconds. The simulated coverage levels are given in the first row. The column averages at each coverage level are given in the bottom row. The other rows are sorted by the “Caller Avg” column, which indicates the arithmetic means of the rows over all the 9 coverage levels. The VarScan procedure could not be completed for coverage levels 500, 1000 and 2000 given the computing setup as described in table 4.13.

61 Avg Memory (MB) 5 10 20 50 100 SOAPsnp 0* 0* 0* 0* 0* SNVMix2 0* 0* 0* 0* 0* FreeBayes 0* 0* 0* 0* 0* MAQ 32 32 32 32 32 SAMtools 32 32 32 32 32 VarScan 48 48 48 48 48 GeMS 163 232 210 205 201 GATK 262 338 176 368 1209 Atlas-SNP2 104 99 105 145 182 Coverage Avg 71 87 67 92 189

Avg Memory (MB) 200 500 1000 2000 Caller Avg SOAPsnp 0* 0* 0* 0* 0* SNVMix2 0* 0* 0* 0* 0* FreeBayes 0* 5 13 16 4 MAQ 32 32 32 32 32 SAMtools 32 32 32 32 32 VarScan 48 NA NA NA 48 GeMS 200 200 203 203 202 GATK 220 258 234 321 376 Atlas-SNP2 275 593 1098 2444 561 Coverage Avg 90 140 202 381 139

Table 4.15: Average memory used by the single sample SNP calling procedures in megabytes (MB). The simulated coverage levels are given in the first row. The col- umn averages at each coverage level are given in the bottom row. The other rows are sorted by the “Caller Avg” column, which indicates the arithmetic means of the rows over all the 9 coverage levels. The VarScan procedure could not be completed for coverage levels 500, 1000 and 2000 given the computing setup as described in table 4.13. The notation “0*” indicates procedures that used less than 16MB of memory.

62 Max Memory (MB) 5 10 20 50 100 SOAPsnp 0* 0* 0* 0* 0* SNVMix2 0* 0* 0* 0* 0* FreeBayes 0* 0* 0* 0* 0* MAQ 32 32 32 32 32 SAMtools 32 32 32 32 32 VarScan 48 48 48 48 48 GeMS 336 384 368 384 384 GATK 304 384 256 400 1392 Atlas-SNP2 176 192 208 256 320 Coverage Avg 103 119 105 128 245

Max Memory (MB) 200 500 1000 2000 Caller Avg SOAPsnp 0* 0* 0* 0* 0* SNVMix2 0* 0* 0* 0* 0* FreeBayes 0* 16 16 16 5 MAQ 32 32 32 32 32 SAMtools 32 32 32 32 32 VarScan 48 NA NA NA 48 GeMS 400 400 400 400 384 GATK 304 480 320 528 485 Atlas-SNP2 464 848 1504 3136 789 Coverage Avg 142 226 288 518 197

Table 4.16: Maximum memory used by the single sample SNP calling procedures in megabytes (MB). The simulated coverage levels are given in the first row. The column averages at each coverage level are given in the bottom row. The other rows are sorted by the “Caller Avg” column, which indicates the arithmetic means of the rows over all the 9 coverage levels. The VarScan procedure could not be completed for coverage levels 500, 1000 and 2000 given the computing setup as described in table 4.13. The notation “0*” indicates procedures that used less than 16MB of memory.

63 g g g g Model g qA qC qG qT 1 A 1 − 3q q q q 2 C q 1 − 3q q q 3 G q q 1 − 3q q 4 T q q q 1 − 3q

g g g g g l g Table 4.17: The four possible q = {qA, qC , qG, qT } for Yj ∼ Categorical(q ) given a haploid organism.

personal computers should be able to handle all of the SNP calling analyses done in

section 4.3.1.

4.4 Discussion

4.4.1 Haploid GeMS Analysis

For analysis with haploid organisms, the general GeMS procedure as discussed

in section 4.2 holds except that there are only four possible genotypes (i.e. g ∈ {A, C, G, T }) rather than the 10 (i.e. g ∈ {AA, CC, GG, T T, AC, AG, AT, CG, CT, GT }) with diploid organisms. Thus there are only four models to consider for the model selection. See table 4.17 for the qg associated with each of the four possible genotypes of a haploid organism. Since only four genotype likelihoods are computed, the Dixon’s Q statistic also changes accordingly as seen in equation (4.11). It is of interest that the denominator of equation (4.11) is set up the same in principle but with an explicit difference from equation (4.7)).

P − P Q = (4) (3) (4.11) P(4) − P(1)

64 4.4.2 Prior Probabilities

Prior probabilities are generally used when some prior information is known.

Thus, when prior probabilities are consistent with the truth, then the results are ex- pected to be more accurate with the prior probabilities than without. Conversely, if unrealistic prior probabilities are used then the final results are likely to be misleading.

The homozygous reference genotype is assigned the largest prior probability in many

SNP callers. For example, we can clearly see this in the prior probability values used in

GATK. Given ε = 10−3,

P (heterozygous genotype) = ε,

ε P (homozygous non-reference genotype) = and 2 3ε P (homozygous reference genotype) = 1 − . (4.12) 2

The reason why the homozygous reference genotype is assigned the highest prior probability is because the vast majority of sites are going to be correctly genotyped with the homozygous reference genotype. Thus such a prior probability assignment works for the vast majority of sites. But this prior probability setup hinders sites, that actually do harbor SNPs, from being called by SNP callers. This is especially the case if the coverage is low. Thus, GeMS uses non-informative priors by default so as not to hinder

SNP call set sensitivity. The user of GeMS may provide useful prior information if it is available.

65 Chapter 5

Multiple Sample Genotype Model

Selection (multiGeMS)

As HTS data became more plentiful, it became clear that single sample analyses would not always be sufficient. Hence, few, if any, HTS analysis procedures currently being developed are limited to just single sample analyses. In this way, the GeMS procedure was extended to a multiple sample version, which is known as multiGeMS. It should be made known that multiGeMS is not just the trivial single sample application of GeMS to every sample in a multiple sample dataset. Rather multiGeMS estimates parameters both within each sample and over all the samples to call SNPs.

To briefly review the introduction to section 3.2 and figure 3.1, multiple - ple alignment data is simply an extension to single sample alignment data. Instead of information regarding one sample’s reads aligned to a reference genome, multiple sam- ple alignment data contains information regarding read alignments of multiple samples to the same reference genome. See table 5.2 for the notation used in connection with multiple sample HTS data.

66 Setting Default Description Sample reference 0.8 Only sites with at least one sample where the proportion reference allele ratio is less than this value will considered for SNP calling analysis Sample deletion 0.05 Sites with any sample that has a deletion place- placeholder holder proportion greater than this value will not be admitted for SNP calling analysis Base-calling 17 Alleles with base-calling quality Phred scores less quality than this value will be removed from the allele pileups Mapping quality 20 Alleles with mapping quality Phred scores less than this value will be removed from the allele pileups Sample maximum 255 The maximum number of alleles that will be an- coverage alyzed at each sample at each site ε 10−6 EM algorithm convergence criterion threshold Maximum EM 10,000 The maximum number of EM algorithm itera- iterations tions to be run at each site Ploidy Diploid Either haploid or diploid, the ploidy setting de- termines which genotype likelihoods need to be computed at every site Priors Uniform Users can supply non-uniform genotype prior probability values if this information is available Threads 1 The number of threads to be used in a paral- lelized analysis

Table 5.1: The settings for multiGeMS SNP calling.

5.1 Data Preparation

Just as with GeMS, when initiating the multiGeMS SNP calling procedure, the user first determines certain settings and then the data is prepared before the actual procedure takes place. Though the data preparation in multiGeMS is similar to that of

GeMS, there are key differences. See table 5.1 for a summary of the available settings.

Generally multiple sample alignment data is available in multiple SAM or BAM format alignment files. Usually the aligned reads of each sample are located in the corresponding alignment file i.e. usually each alignment file will contain alignment data for one and only one sample. Before converting these alignment files into the PILEUP

67 format needed by multiGeMS, it is wise to filter out reads with undesirable alignment characteristics as explained in appendix C. However in contrast to usual behavior of

SAM/BAM files, each PILEUP file must contain the alignment data for one and only one sample for multiGeMS to work properly.

As with GeMS, the first step that multiGeMS does to prepare the PILEUP data is to “clean” the allele pileups at each site. In particular, read start markers (with symbol “ˆ”) and their associated read mapping quality score, end read markers (with symbol “$”), complete insertions (e.g. “+3AGT” indicates a 3bp insertion on the read of the previously listed allele) and complete deletions (e.g. “-4AGCT” indicates a 4bp deletion on the read of the previously listed allele) are given in the allele pileups but are not used by the multiGeMS procedure. After multiGeMS removes these objects, the allele pileups of each sample in consideration should be solely composed of reference

(forward strand symbol “.” and reverse strand symbol “,”) and non-reference (forward strand symbols “ACGTN” and reverse strand symbols “acgtn”) alleles.

Then multiGeMS filters out sites where any sample has a high proportion (e.g. greater than 5%) of deletion placeholders (with symbol “*”), as these deletions, or mis- alignments that appear to be deletions, could result in false positive SNP calls. Also, any sample allele pileups longer than the user specified amount (the multiGeMS default is

255, which is also the value that SAMtools hardcodes into its model [37]) are randomly trimmed down to that amount. To further save on computing resources, multiGeMS only considers sites where any sample could possibly harbor a SNP. In particular, only sites, with at least one sample, where the allele pileups whose reference proportion is less than the user specified amount (e.g. default 80%), are considered during the multiGeMS procedure.

68 Now that the list of potential SNP sites has been established, the sample allele pileups associated with these sites undergo additional filtering. First, all “N” or “n” base-calls and the remaining deletion placeholders are removed as these are not factored into the multiGeMS likelihood model. Second, all alleles with a base-calling or mapping quality Phred score lower than the user specified amounts (the multiGeMS’ default of

17 and 20, respectively, are shared by many other SNP callers) are discarded from the sample allele pileups. This is done to prevent low quality data from misguiding the model. Finally, the SNP candidate sites with their refined sample allele pileups and associated base-calling and mapping quality scores are then analyzed by the multiGeMS

SNP calling procedure.

5.2 Procedure

To begin a description of the details of the multiGeMS procedure, let us use the notation from table 5.2 and assume that we are working with multiple samples of

HTS data from the same species of diploid organism. As before, the location superscript l will be suppressed in the following discussion as the same procedure is applied to all possible SNP locations.

Recall from section 3.2, that the SNP calling procedures of both SAMtools and

GATK revolve around the summary statistic defined in equation 3.10. This statistic simply estimates the non-reference alleles in the sample genotypes, hence sample geno- type estimates themselves are not used in calling SNPs. In contrast, multiGeMS uses genotype probabilities, both over all samples and with respect to each sample, to call

SNPs.

69 Notation Explanation s number of samples i ∈ {1, 2, . . . , s} sample index l location on the reference genome l ni number of reads of sample i aligned to location l l j ∈ {1, 2, . . . , ni} index of sample i reads aligned to location l l Xij sample i observed allele on read j aligned to location l l l Bij base-calling quality score of Xij l l Mij mapping quality score of Xij l l l l −0.1 min{Bij ,Mij } wij “weight” of Xij,(= 1 − 10 ) l Di observed allele data of sample i at location l, = n l l l l o Xij,Bij,Mij : j ∈ {1, 2, . . . , ni} l Gi unobserved genotype of sample i at location l l l Yij unobserved original allele associated with Xij g ∈ {AA, . . . , GT } index for the 10 possible diploid genotypes l l pg = P (Gi = g), the probability that g is the true geno- type at location l, this probability is the same for all samples i and is thus estimated over all the samples k ∈ {A, C, G, T } index for the 4 DNA nucleotides g,l l g,l qk = P (Yij = k|g) where qk is a function of 1 q parameter estimated over samples l l g,l θ set of 11 parameters, 10 pg and 1 q in the qk function, at location l (t) iteration index superscript on parameter estimates l,(t) l l l,(t) ei,g = P (Gi = g|Di, θ ), this probability is different from l l l,(t) pg in that it is conditioned on (Di, θ ) and thus is different for different samples i ∗ superscript for final iteration value l πi,g genotype prior probability, uniform by default lFDRl statistic which determines whether location l is to be called a SNP, borrowed from “Cross-Sample” l ci coverage weight of sample i at location l, used in cal- ˆ culation of lFDRl ˆ ϕ lFDR threshold value, call SNP if lFDRl ≤ ϕ Table 5.2: multiGeMS model notation. The l superscript is suppressed in the following for convenience.

70 5.2.1 Single Sample GeMS Review

For each diploid sample, Gi has a categorical distribution with possible out- comes Gi = g ∈ {AA, CC, GG, T T, AC, AG, AT, CG, CT, GT }. Without conditioning on the sample data {Di} or any estimated parameters in θ, each Gi for i = 1, 2, . . . , s

P has the same categorical probabilities p = {pg} = {P (Gi = g)}, where g pg = 1. Di|Gi follows the following distribution which was originally given, without reference to sample i, in the single sample GeMS model likelihood, given in equation (4.4).

g L(q ) = P (Di|Gi = g, q) n Yi = P (Xij|Gi = g, q) j=1 n Yi X = [P (Xij|Yij = k)] P (Yij = k|Gi = g, q) j=1 k∈{A,C,G,T }

ni "  I(Xij =6 k)# Y X I(X =k) 1 − wij = w ij qg (5.1) ij 3 k j=1 k∈{A,C,G,T }

Recall that in single sample GeMS, the called consensus genotype is defined in

g equation (4.5) to be argmaxg L(qˆ )πg. Furthermore, a SNP is called if the called con- sensus genotype differs from the reference genotype and the Dixon’s Q test is significant for P(10) = maxg P (Gi = g|Di).

5.2.2 EM Algorithm

As described in appendix section A, the EM algorithm is a powerful tool to

iteratively estimate unknown quantities. The SNP calling procedure of multiGeMS uses

the EM algorithm to estimate the sample genotype probabilities (ei,g = P (Gi = g|Di, θ)) and the q parameter at each potential SNP site. The following will contain a discussion of the E-step, M-step, initial values and convergence of the EM algorithm.

71 Assuming that the samples, and thus their respective genotypes ({Gi}) and sample data ({Di}), are independent, the complete multiGeMS log-likelihood with un-

observed data G = (G1,G2,...,Gn) is as follows.

l(θ|D, G) = log P (D, G|θ)

= log P (D|G, θ)P (G|θ) s Y = log P (D1,D2,...,Dn|G, θ) P (Gi|θ) i=1 s Y h i = log P (Di|Gi, q)P (Gi|θ) i=1 s Y Y h iI(Gi=g) = log P (Di|Gi = g, q)P (Gi = g|pg) i=1 g∈{AA,...,GT } s X X n o = I(Gi = g) log P (Di|Gi = g, q) + log pg (5.2) i=1 g∈{AA,...,GT }

To explain the second line of equation (5.2), we first note a consequence of conditional probabilities, that P (A, B) = P (A|B)P (B). From this we can show that

P (A,B,C) P (A, B, C) = P (A|B,C)P (B|C)P (C) and then we see that P (A, B|C) = P (C) =

P (A|B,C)P (B|C). For the fourth line, note that P (Di|G, θ) = P (Di|Gi, q), that is to say that Di depends on Gi and q in (G, θ), but not the other variables within. For the

fifth and sixth lines, recall that P (Gi) is a categorical density where pg = P (Gi = g).

Finally, and with great importance, P (Di|Gi = g, q) is given by equation (5.1).

5.2.2.1 E-step

The E-step begins with the definition of the Q function, which is conditioned on the data D and the tth parameter estimate of θ = {p, q} which is notated as θ(t).

The Q function is given in the following.

72 (t)   Q(θ|θ ) = EG|D,θ(t) l(θ|D, G) s  X X n o = EG|D,θ(t) I(Gi = g) log P (Di|Gi = g, q) + log pg i=1 g∈{AA,...,GT } s X X  n o = EG|D,θ(t) I(Gi = g) log P (Di|Gi = g, q) + log pg i=1 g∈{AA,...,GT } s  n o X X (t) = E I(Gi = g) D, θ log P (Di|Gi = g, q) + log pg i=1 g∈{AA,...,GT } s  n o X X (t) = E I(Gi = g) Di, θ log P (Di|Gi = g, q) + log pg i=1 g∈{AA,...,GT } s X X (t)n o = ei,g log P (Di|Gi = g, q) + log pg (5.3) i=1 g∈{AA,...,GT }

It is noted that the conditional expectation of the complete log-likelihood fo-

(t) cuses on the random I(Gi = g). This conditional expectation, notated with ei,g, is crucial to the E-step and is estimated in the following way.

  (t) (t) ei,g = E I(Gi = g) D, θ

(t) = P (Gi = g|Di, θ ) P (G = g, D |θ(t)) = i i (t) P (Di|θ ) P (D |G = g, q(t))P (G = g|p(t)) = i i i g P (t) (t) h P (Di|Gi = h, q )P (Gi = h|ph ) P (D |G = g, q(t))p(t) = i i g (5.4) P (t) (t) h P (Di|Gi = h, q )ph

To explain the third line, earlier it was shown that P (A, B|C) = P (A|B,C)P (B|C)

P (A,B|C) and thus P (A|B,C) = P (B|C) . On the fourth line, we see an application of P (A, B|C) =

P (B|A, C)P (A|C) and the law of total probability partitioning on the genotypes g. Fi-

P (t) nally, we note that g ei,g = 1 for i ∈ {1, 2, . . . s}.

73 5.2.2.2 M-step

The M-step is usually described with the following statement.

θ(t+1) = argmax Q(θ|θ(t)) (5.5) θ

Since θ = {p, q}, we can divide the overall M-step into 11 different statements

(10 for the {pg} and 1 for q) of the M-step. Starting with the M-step for q, we have the following.

q(t+1) = argmax Q(θ|θ(t)) q s X X (t) = argmax ei,g log P (Di|Gi = g, q) (5.6) q i=1 g∈{AA,...,GT }

Since maximizing the Q function over q is not affected by pg, listing the log pg part of the Q function is not necessary. As there is no apparent closed form for q(t+1), we can just maximize the Q function over q to determine q(t+1).

The M-step for the pg parameters have the same form which depends on the genotype g. As before, maximizing the Q function over pg is not affected by q and so listing the log P (Di|Gi = g, q) part of the Q function is not necessary. The M-step for the pg is as follows.

(t+1) (t) pg = argmax Q(θ|θ ) pg s X X (t) = argmax ei,g log pg (5.7) p g i=1 g∈{AA,...,GT }

One way to solve equation 5.7, is to use the Lagrange multipliers technique. By taking

P into account the constraint g pg = 1, we have the following Λ function.

74 s X X (t) X Λ(pg) = ei,g log pg + λ( pg − 1) (5.8) i=1 g∈{AA,...,GT } g

Differentiating equation (5.8) by λ and pg yields the following system of equations.

Ps (t) X i=1 ei,g p = 1 and p = (5.9) g g λ g Solving equation (5.9) for λ yields the following expression.

s X X (t) λ = ei,g (5.10) g i=1 (t+1) Thus finally, the closed form expression for pg is given in the following.

Ps e(t) p(t+1) = i=1 i,g g P Ps (t) h i=1 ei,h Ps e(t) = i=1 i,g Ps P (t) i=1 h ei,h Ps e(t) = i=1 i,g for g ∈ {AA, . . . , GT } (5.11) s

P (t) The last line of equation (5.11) is a consequence of the fact that g ei,g = 1

(t+1) (t) for i ∈ {1, 2, . . . s}. Also, this line indicates that pg is equal to the average of the eg

values over the s samples.

5.2.2.3 Initial Values

To determine a “smart” initial value that will quickly lead to convergence in

the EM algorithm, one can either begin in the E-step or the M-step. The following is a

demonstration as to how one can begin the EM algorithm on the E-step.

(1) (1) (0) It seems reasonable to base our q and {pg } parameters on the e = h i P (Gi = g|Di) array. Appealing to equation (5.4) and initially assuming that (i,g)

75 e(0) → θ(1) → e(1) → ... → e∗ → θ∗

Figure 5.1: A graphical sequence of iterations between the E-step, which updates the e(t), and the M-step, which updates the θ(t), until convergence, which is indicated with an asterisk (∗), is reached.

(0) (0) (0) all the genotype probabilities are equal (i.e. pAA = pCC = ··· = pGT ), we can set the following definition.

P (D |G = g, qˆg) e(0) = i i i (5.12) i,g P h h P (Di|Gi = h, qˆi )

Here we see that we are essentially applying the single sample GeMS model to each

g sample. As in the single sample GeMS model, qˆi is determined by maximizing P (Di|Gi = g, q) over q ∈ (0, 0.25). From equation (5.12), we can get the following initial parameter values for q and pg.

s (1) X X (0) q = argmax ei,g log P (Di|Gi = g, q) (5.13) q i=1 g∈{AA,...,GT } Ps e(0) p(1) = i=1 i,g for g ∈ {AA, . . . , GT } (5.14) g s

5.2.2.4 Convergence

Given the initial values in equations (5.13) and (5.14), we can continue to iterate between the E-step and the M-step until convergence is met. If the final iteration value is to be indicated with an asterisk (∗), then it can be seen in figure 5.1, that θ will go through the same number of iterations as e.

   

p(t) p(t−1)       −   (5.15)     (t) (t−1) q q

76 The state of convergence is determined after each estimation of the θ param- eters. In multiGeMS, if the maximum component of equation (5.15) is greater than a user specified ε (which is 10−6 by default), then the iteration is continued once more.

When this convergence criterion is no longer true, then the EM algorithm is completed and the final (e∗, θ∗) estimates are retained.

5.2.3 SNP and Consensus Calling

Assuming that convergence has been reached in θ, and that no genotype prior probability information has been specified, then the consensus genotype call for sample i is as follows.

∗ ∗ ∗ argmax ei,g = argmax P (Di|Gi = g, q )pg (5.16) g g

However, if prior information is specified in the form of πi,g, then the consensus genotype call for sample i is given as follows.

∗ ∗ argmax P (Di|Gi = g, q )pgπi,g (5.17) g

As described in section 3.2.3, the “Cross-Sample” parameter δˆi is equivalent to 1 − P (the genotype on sample i is homozygous reference). With respect to the notation used by multiGeMS, this value is equivalent to 1−ei,ref, where g = ref indicates the reference nucleotide at the particular location of interest on the reference genome.

Since the multiGeMS parameter estimates are compatible with the “Cross-Sample” lFDR estimator as given in equation (3.19) and as seen in equation (5.18), this estimator is used by multiGeMS to call SNPs.

77 "P ˆ # ˆ i ci log(1 − δi) lFDR = exp P i ci "P ∗ # i ci log(ei,ref) = exp P (5.18) i ci multiGeMS calculates lFDRˆ for all possible SNP sites and then calls as SNPs those sites where lFDRˆ ≤ ϕ. As with “Cross-Sample”, the multiGeMS default value for ϕ is 0.1.

It is of interest to note that the “Cross-Sample” article states that its chosen coverage weights ci were arbitrarily made. Thus the coverage weights that multiGeMS utilizes are different and are given in the following.

     0, if ni < 1      ci = ni, if 2 ≤ ni ≤ 200 (5.19)        200, if ni > 200 

5.3 Validation

As in section 4.3, it is of interest to see how multiGeMS compares to other popular SNP callers. Thus in the following, a simulation study and a real data analysis will indicate that GeMS has a good performance balance of sensitivity and positive predictive value among the tested SNP callers. As before, it is difficult to objectively compare the different SNP callers with both simulated and real data, and so all the program settings and options will be given for reproducibility and transparency.

5.3.1 Simulation Analysis

The main motivation for the simulation analysis was to see how multiGeMS would compare to other SNP callers when the simulated data is relatively good. Thus

78 many of the HTS data simulator’s options (see table 5.3) are kept to their default value, such as the average coverage being at 100x.

It is of interest to see how multiGeMS reacts to simulated population level, group level and individual SNPs. Thus the simulated samples are given the structure illustrated in figure 5.2. The starting point for this structure is the first 1,000,000 bases of the reference genome of the haploid bacterial species Thermoanaerobacter sp. X514.

A commonly stated figure for the frequency of SNPs is 1 in 1,000 sites, so it is with this assumption that we can add SNPs to the population, group and individual levels.

In particular, 4 in 10,000 sites are given population level SNPs i.e. SNPs that all the samples had. In addition, there are 5 groups of 5 samples each, for a total of 25 samples.

Each of these groups feature group level SNPs at rate of 4 per 10,000 sites. The popula- tion and group level SNPs are all homozygous SNPs since these SNPs are made to the reference genomes which would be used to generate the HTS data. Finally, individual variants are added to each of the samples at a rate of 2 per 10,000 sites. 10% of these individual variants are simulated as indels, whereas the remaining 90% are simulated as both homozygous and heterozygous SNPs.

The results of the simulation study are listed in table 5.4. Here it is seen that four SNP callers, namely GATK, piCALL, VarScan and multiGeMS all share harmonic means of sensitivity and PPV which round to 0.99. Thus it can be said that all do very well on this simulated dataset. In contrast, the other four SNP callers either do reasonably well or very badly. Single sample GeMS was included in this simulation study to demonstrate that it is better to run multiGeMS and simultaneously account for all the samples, than to run single sample GeMS on each of the 25 datasets individually (all

79 Simulation Study Specifications and non-default options used Reference organism Thermoanaerobacter sp. X514 (truncated to first 1Mbp) FASTA variants MAQ-0.7.1 fakemut to add population SNPs with options -r 0.0004 -R 0 and group SNPs with options -r 0.0004 -R 0; individual sample SNPs added with dwgsim-0.1.10 below Simulation options dwgsim-0.1.10 -e 0.00006-0.03 -E 0.00006-0.03 -r 0.0002; the default coverage value of 100x was utilized Sample size 25 Ploidy X514 is haploid but dwgsim-0.1.10 was implemented in its default diploid mode Read Length 70bp Mate pair Paired-end Quality encoding Illumina 1.8+ Alignment options BWA-0.5.9 aln, sampe and SAMtools-0.1.18 view -F 1792 -q 20 Realignment options GenomeAnalysisTK-1.0.5336 -T RealignerTargetCreator, -T IndelRealigner Pileup options SAMtools-0.1.18 mpileup -q 20 -Q 17 -Bs multiGeMS options -e 0.000001 -n 0.1 GeMS options α = 0.05, single sample SNP call results pooled GATK options GenomeAnalysisTK-1.0.5336 -T UnifiedGenotyper -mbq 17 -mmq 20 piCALL options piCALL.v01 –rl 71 –qvoffset 33 –mbq 17 VarScan options VarScan.v2.2.8 –min-coverage 1 –min-reads2 1 –min-avg-qual 17 –output-vcf 1 –variants SAMtools options SAMtools-0.1.18 mpileup -q 20 -Q 17 -g, bcftools view -vcg Bambino options Bambino1.04 -min-quality 17 -min-mapq 20 -min-coverage 1 -min-alt-allele-count 1 FreeBayes options FreeBayes-0.9.0 –min-mapping-quality 20 –min-base-quality 17 –min-supporting-quality 20,17

Table 5.3: Options used in the simulation, alignment and SNP calling of the simulated multiple sample data. SAMtools view is just one procedure for refining the alignment file for SNP calling or any other downstream analyses. For more information on using SAMtools view to filter alignment files, see appendix C. The program options were chosen in the interest of fairness, objectivity and reasonableness. The default options of a program were preferred, unless these options were unrealistic, incompatible with the analysis or gave an advantage to any SNP caller.

80 Group 0.0002 Samples A SNPs 10% InDels 1-5

Group 0.0002 Samples 0.0004 B SNPs 10% InDels 6-10

0.0004

X514 0.0004 Population 0.0004 Group 0.0002 Samples (1Mbp) SNPs C SNPs 10% InDels 11-15 0.0004

0.0004 Group 0.0002 Samples D SNPs 10% InDels 16-20

Group 0.0002 Samples E SNPs 10% InDels 21-25

Figure 5.2: The samples generated in the multiGeMS simulation study.

of these single sample GeMS SNP calls are pooled for the combined 25 sample SNP call results).

5.3.2 Real Data Analysis

For the real data analysis, 50 samples of alignment data from the 1000 Genomes

Project [1] was downloaded. In particular, since a substantial range of coverage values

Count Sensitivity PPV Harmonic Mean GATK 6671 0.9796 0.9996 0.9895 piCALL 6659 0.9781 0.9998 0.9889 VarScan 6655 0.9777 1.0000 0.9887 multiGeMS 6642 0.9739 0.9980 0.9858 SAMtools 6407 0.9412 1.0000 0.9697 GeMS 7687 0.9821 0.8697 0.9225 Bambino 2539 0.3421 0.9173 0.4984 FreeBayes 2447 0.3348 0.9313 0.4925

Table 5.4: multiGeMS simulation results summary. SNP caller count, sensitivity, PPV and harmonic mean of sensitivity and PPV, listed in descending order of the harmonic mean.

81 1000 Genomes Data Specifications and non-default options used Organism Reference Human hs37d5.fa (∼3Gbp) Sample size 50, for sample identity information see tables 5.3 and 5.4 Ploidy Diploid Read Length Various Alignment options Prealigned from BAM files from 1000 Genomes Project [1] data FTP server; added SAMtools-0.1.18 view -F 1792 -q 20 20:60001-1060000 Pileup options SAMtools-0.1.18 mpileup -q 20 -Q 17 -Bs multiGeMS options -e 0.000001 -n 0.1 GeMS options α = 0.05, single sample SNP call results pooled piCALL options piCALL.v01 –rl 100 –qvoffset 33 –mbq 17 VarScan options VarScan.v2.2.8 –min-coverage 1 –min-reads2 1 –min-avg-qual 17 –output-vcf 1 –variants GATK options GenomeAnalysisTK-1.0.5336 -T UnifiedGenotyper -mbq 17 -mmq 20 SAMtools options SAMtools-0.1.18 mpileup -q 20 -Q 17 -g, bcftools view -vcg Bambino options Bambino1.04 -min-quality 17 -min-mapq 20 -min-coverage 1 -min-alt-allele-count 1 FreeBayes options FreeBayes-0.9.0 –min-mapping-quality 20 –min-base-quality 17 –min-supporting-quality 20,17

Table 5.5: Options used in the alignment and SNP calling of the 50 samples of 1000 Genomes Project dataset. The 50 samples were chromosome 20 “exome capture” pre- aligned data (BAM files, not FASTQ files) truncated to sites 60,001-1,060,000. SAMtools view is just one procedure for refining the alignment file for SNP calling or any other downstream analyses. For more information on using SAMtools view to filter alignment files, see appendix C. The program options were chosen in the interest of fairness, objectivity and reasonableness. The default options of a program were preferred, unless these options were unrealistic, incompatible with the analysis or gave an advantage to any SNP caller.

was sought out, “exome capture” data is used. In particular, this aligned “exome capture” data was isolated to sites 60,001-1,060,000 on chromosome 20. The reason why the first

60,000 sites on chromosome 20 is excluded, is that these sites contained ‘N’ nucleotides in the Human hs37d5.fa genome reference that the HTS read data was aligned to. The full program options and settings are given in table 5.5. The sample identity information is given in tables 5.3 and 5.4.

82 HG00253.chrom20.ILLUMINA.bwa.GBR.exome.20111114.bam HG00255.chrom20.ILLUMINA.bwa.GBR.exome.20111114.bam HG00256.chrom20.ILLUMINA.bwa.GBR.exome.20111114.bam HG00332.chrom20.ILLUMINA.bwa.FIN.exome.20111114.bam HG00371.chrom20.ILLUMINA.bwa.FIN.exome.20111114.bam HG00534.chrom20.ILLUMINA.bwa.CHS.exome.20111114.bam HG00557.chrom20.ILLUMINA.bwa.CHS.exome.20111114.bam HG00956.chrom20.ILLUMINA.bwa.CDX.exome.20111114.bam HG01107.chrom20.ILLUMINA.bwa.PUR.exome.20111114.bam HG01168.chrom20.ILLUMINA.bwa.PUR.exome.20111114.bam HG01197.chrom20.ILLUMINA.bwa.PUR.exome.20111114.bam HG01198.chrom20.ILLUMINA.bwa.PUR.exome.20111114.bam HG01251.chrom20.ILLUMINA.bwa.CLM.exome.20111114.bam HG01507.chrom20.ILLUMINA.bwa.IBS.exome.20111114.bam HG01516.chrom20.ILLUMINA.bwa.IBS.exome.20111114.bam HG01709.chrom20.ILLUMINA.bwa.IBS.exome.20111114.bam HG01756.chrom20.ILLUMINA.bwa.IBS.exome.20111114.bam HG01810.chrom20.ILLUMINA.bwa.CDX.exome.20111114.bam HG01878.chrom20.ILLUMINA.bwa.KHV.exome.20111114.bam HG01883.chrom20.ILLUMINA.bwa.ACB.exome.20111114.bam HG01920.chrom20.ILLUMINA.bwa.PEL.exome.20111114.bam HG02374.chrom20.ILLUMINA.bwa.CDX.exome.20111114.bam HG02382.chrom20.ILLUMINA.bwa.CDX.exome.20111114.bam HG02399.chrom20.ILLUMINA.bwa.CDX.exome.20111114.bam HG02450.chrom20.ILLUMINA.bwa.ACB.exome.20111114.bam

Figure 5.3: First 25 samples used in 50 sample 1000 Genomes Project data analysis. Data can be downloaded from 1000 Genomes Project data FTP server.

83 NA06984.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam NA11920.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam NA12275.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam NA12828.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam NA18562.chrom20.ILLUMINA.bwa.CHB.exome.20111114.bam NA18574.chrom20.ILLUMINA.bwa.CHB.exome.20111114.bam NA18874.chrom20.ILLUMINA.bwa.YRI.exome.20111114.bam NA18909.chrom20.ILLUMINA.bwa.YRI.exome.20111114.bam NA18986.chrom20.ILLUMINA.bwa.JPT.exome.20111114.bam NA19007.chrom20.ILLUMINA.bwa.JPT.exome.20111114.bam NA19060.chrom20.ILLUMINA.bwa.JPT.exome.20111114.bam NA19074.chrom20.ILLUMINA.bwa.JPT.exome.20111114.bam NA19075.chrom20.ILLUMINA.bwa.JPT.exome.20111114.bam NA19108.chrom20.ILLUMINA.bwa.YRI.exome.20111114.bam NA19144.chrom20.ILLUMINA.bwa.YRI.exome.20111114.bam NA19247.chrom20.ILLUMINA.bwa.YRI.exome.20111114.bam NA19755.chrom20.ILLUMINA.bwa.MXL.exome.20111114.bam NA20281.chrom20.ILLUMINA.bwa.ASW.exome.20111114.bam NA20313.chrom20.ILLUMINA.bwa.ASW.exome.20111114.bam NA20754.chrom20.ILLUMINA.bwa.TSI.exome.20111114.bam NA20758.chrom20.ILLUMINA.bwa.TSI.exome.20111114.bam NA20770.chrom20.ILLUMINA.bwa.TSI.exome.20111114.bam NA20783.chrom20.ILLUMINA.bwa.TSI.exome.20111114.bam NA20813.chrom20.ILLUMINA.bwa.TSI.exome.20111114.bam NA20845.chrom20.ILLUMINA.bwa.GIH.exome.20111114.bam

Figure 5.4: Last 25 samples used in 50 sample 1000 Genomes Project data analysis. Data can be downloaded from 1000 Genomes Project data FTP server.

84 Before the results are mentioned, some background information should be given.

In particular, dbSNP [60] is a well known catalog of common variants. Thus “dbSNP concordance” commonly refers to the percentage of SNPs in a particular SNP call set that is shared with the dbSNP database. Another concept of importance is the “transitions- transversions ratio” which is usually expressed, in short, as the “Ti/Tv ratio”. To un- derstand this concept, one needs to know that the A and G nucleotides are classified as purines and the C and T nucleotides are classified as pyrimidines. A transition is a mutation between purines (A ↔ G) or between pyrimidines (C ↔ T). A transversion

is a mutation between a purine and a pyrimidine or between a pyrimidine and a purine

(A ↔ C, A ↔ T, C ↔ G, T ↔ T). We see that there are 4 types of transitions and

8 types of transversions. Thus if mutations were made randomly, then the Ti/Tv ratio

would settle around 0.5. However, partly because of the chemical structure of these

nucleotides, the Ti/Tv ratio for the whole human genome has been empirically observed

to be 2.0-2.1 [13]. Likewise, the Ti/Tv ratio for the human exome has been empirically

observed to be 3.0-3.3 [13].

The dbSNP concordance and the Ti/Tv (calculated with the VCFtools [12]

–TsTv procedure) ratio of multiGeMS and 7 other SNP callers are given in table 5.6. At

first we see that the dbSNP concordance of multiGeMS is the largest of the SNP callers.

It is also noticed that multiGeMS does not have the Ti/Tv ratio closest to that expected

of an exome dataset, namely around 3.0-3.1. Though piCALL has a higher Ti/Tv ratio, it

also called far less SNPs indicating less power when compared to multiGeMS. A possible

reason why the Ti/Tv ratios are not near their expected value range of 3.0-3.3 is that the

“exome capture” of the read data, prior to alignment, does not perfectly capture exactly

85 Count dbSNP Concordance Ti/Tv multiGeMS 2383 0.9589 2.2955 piCALL 439 0.9567 2.7931 VarScan 1678 0.9547 2.2456 GATK 3461 0.9382 2.1260 SAMtools 4323 0.8418 1.8740 Bambino 5460 0.6405 1.4256 GeMS 26179 0.1980 1.4964 FreeBayes 40529 0.1213 0.6274

Table 5.6: multiGeMS 1000 Genomes Project data results summary. SNP caller count, concordance with dbSNP [60] and the transitions transversions (Ti/Tv) ratio, listed in descending order of the dbSNP concordance. the exome. Additionally, this “exome capture” data was aligned to the entire human hs37d5.fa genome reference, not just the exon locations.

86 Chapter 6

Future Work

Essentially, all models are wrong, but some are useful.

–George Box [8]

As a consequence of the pithy statement above, statisticians continue to im- prove on existing models. Likewise, software engineers continue to work on software packages, even decades after their original release. SNP caller procedures are, of course, no different.

The GeMS, and by extension multiGeMS, procedures have much potential to become a part of a larger modular pipeline for the analysis of HTS data. To accomplish this goal, certain refinements will be beneficial. First, a listing of these possible refine- ments will be given. Second, an example of a future application of GeMS, that is GeMS applied to metagenomics data, will be explored.

6.1 Refinements

There are a few small refinements needed in the command-line user interface for multiGeMS. Another small refinement would be for multiGeMS to either give an user

87 option for, or automatically identify, which Phred encoding the input data is provided in.

Other larger model-based refinements that can be incorporated into both single sample

GeMS and multiGeMS is the ability to call other types of variants such as indels.

Finally, GeMS and multiGeMS currently only support input alignment data in the pileup format. In the near future, BAM alignment files should be supported.

Thus GeMS and multiGeMS will also be able to account for more biological or chemical features that are not currently indicated with the pileup format. Also, the VCF format should be supported for GeMS and multiGeMS output files.

6.2 metaGeMS

The study of microorganisms or microbes has been shown to be of increasing importance not only for human and domestic animal pathogen resistance but also in the areas spanning natural ecological services to alternative sources of energy. Further, microbes are important to biology because they allow microevolutionary changes to be studied within relatively short periods of time. Compared with many macro-organisms, microbes are the ideal model system to study microevolution because of the following 6 reasons.

1. Microbes reproduce quickly which allows multiple generation experiments to be

conducted.

2. Experimental replication can be facilitated because large populations can be con-

tained with very limited resources.

3. Microbes can be stored in suspended animation and later revived to be compared

with groups of their descent.

88 4. The asexual reproduction of many microbes presents high clonal precision for ex-

perimental replication.

5. It is relatively easy to control the environmental variables of successive generations

of microbes.

6. There already exists an abundance of genomic data for many species as well as doc-

umented techniques for the precise manipulation and analysis of microbe genomes.

Thus the holistic study of the microevolution of microbes, at the ecological, population dynamic, phenotypic and genetic levels, can be revolutionized with HTS.

The existing issue with applying HTS to the study of microbes is that, with current technologies, many microbes of significance are not able to be isolated, cultured and thus individually manipulated. To be studied, microbiologists must investigate these microbes in their natural habitat which commonly involves unseparated microbial com- munities of differing generations and differing species. Sequencing and studying the genomic samples of various organisms from the environment with which these microbial communities live is often referred to as metagenomics.

At the current time, the options for SNP calling with metagenomics data is not very comprehensive, thus GeMS applied to metagenomics data, or what will be called

“metaGeMS”, has much potential. The details of the forthcoming metaGeMS procedure are not yet fixed, but will include the following details. First, the genomes that are highly abundant and reflected in the metagenomics HTS read data will be identified.

This step will utilize the abundance of existing microbe genomic reference data available in genomic databases. Second, a dynamic reference genome database will be created based on the previously selected highly abundant genomes. Third, the HTS reads will be assigned to these highly abundant genomes using some type of clustering method.

89 Finally, once these reads are aligned to their associated genomes, then some form of multiGeMS can then be applied to the metagenomics assigned and aligned reads.

90 Bibliography

[1] A map of human genome variation from population-scale sequencing. Nature, 467(7319):1061–1073, October 2010.

[2] Cornelis A. Albers, Gerton Lunter, Daniel G. MacArthur, Gilean McVean, Willem H. Ouwehand, and Richard Durbin. Dindel: Accurate indel calls from short-read data. Genome Research, 21(6):961–973, June 2011.

[3] Vikas Bansal, Olivier Harismendy, Ryan Tewhey, Sarah S. Murray, Nicholas J. Schork, Eric J. Topol, and Kelly A. Frazer. Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome research, 20(4):537–545, April 2010.

[4] Vikas Bansal and Ondrej Libiger. A probabilistic method for the detection and genotyping of small indels from population-scale sequence data. , 27(15):2047–2053, August 2011.

[5] Yoav Benjamini and Yosef Hochberg. Controlling the False Discovery Rate: A Prac- tical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289–300, 1995.

[6] Yoav Benjamini and Daniel Yekutieli. The Control of the False Discovery Rate in Multiple Testing under Dependency. The Annals of Statistics, 29(4):1165–1188, 2001.

[7] NCBI BLAST. Query input and database selection. http://blast.ncbi.nlm.nih. gov/blastcgihelp.shtml. Accessed: 10/16/2012.

[8] George E. P. Box and Norman R. Draper. Empirical Model-Building and Response Surfaces (Wiley Series in Probability and Statistics). Wiley.

[9] Héctor C. Bravo and Rafael A. Irizarry. Model-Based Quality Assessment and Base-Calling for Second-Generation Sequencing Data. Biometrics, 66(3):665–674, September 2010.

[10] A. Chakravarti. Single nucleotide polymorphisms:... to a future of genetic medicine. Nature, 409(6822):822–823, 2001.

[11] Genia Corporation. Genia’s integrated circuits enable massively parallel single- molecule dna sequencing. http://www.geniachip.com. Accessed: 10/16/2012.

91 [12] Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, Richard Durbin, and 1000 Genomes Project Analysis Group. The and VCFtools. Bioinformatics (Oxford, England), 27(15):2156–2158, August 2011.

[13] Mark A. DePristo, Eric Banks, Ryan Poplin, Kiran V. Garimella, Jared R. Maguire, Christopher Hartl, Anthony A. Philippakis, Guillermo del Angel, Manuel A. Rivas, Matt Hanna, Aaron McKenna, Tim J. Fennell, Andrew M. Kernytsky, Andrey Y. Sivachenko, Kristian Cibulskis, Stacey B. Gabriel, David Altshuler, and Mark J. Daly. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics, advance online publication, April 2011.

[14] W. J. Dixon. Analysis of extreme values. Ann. Math. Statistics, 21:488–506, 1950.

[15] Michael N. Edmonson, Jinghui Zhang, Chunhua Yan, Richard P. Finney, Daoud M. Meerzaman, and Kenneth H. Buetow. Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformat- ics (Oxford, England), 27(6):865–866, March 2011.

[16] Bradley Efron. Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis. Journal of the American Statistical Association, 99(465):96–104, 2004.

[17] Bradley Efron and Robert Tibshirani. Empirical bayes methods and false discovery rates for microarrays. Genetic epidemiology, 23(1):70–86, June 2002.

[18] Yaniv Erlich, Partha P. Mitra, Melissa delaBastide, W. Richard McCombie, and Gregory J. Hannon. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nature methods, 5(8):679–682, August 2008.

[19] Erik Garrison. Freebayes - the marthlab, October 2010.

[20] Genome Sequencing and Analysis Group. Unified genotyper - gsa, April 2011.

[21] Rodrigo Goya, Mark G. F. Sun, Ryan D. Morin, Gillian Leung, Gavin Ha, Kim- berley C. Wiegand, Janine Senz, Anamaria Crisan, Marco A. Marra, Martin Hirst, David Huntsman, Kevin P. Murphy, Sam Aparicio, and Sohrab P. Shah. SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors. Bioinformatics, 26(6):730–736, March 2010.

[22] The SAM Format Specification Working Group. The sam format specification (v1.4- r985). http://samtools.sourceforge.net/SAM1.pdf. Accessed: 10/16/2012.

[23] Erika C. Hayden. The $1,000 genome: are we there yet? Nature News Blog, January 2012. Accessed: 10/16/2012.

[24] Nils Homer and Stanley Nelson. Improved variant discovery through local re- alignment of short-read next-generation sequencing data using SRMA. Genome Biology, 11(10):R99+, 2010.

[25] Inc. Illumina. Genomestudio software dna sequencing module work- flow. http://www.illumina.com/Documents/products/technotes/technote_ genomestudio_dna_sequencing_module_workflow.pdf. Accessed: 10/16/2012.

92 [26] Inc. Illumina. Illumina - sequencing technology. http://www.illumina.com/ technology/sequencing_technology.ilmn. Accessed: 10/18/2012. [27] National Human Genome Research Institute. The human genome project comple- tion: Frequently asked questions. http://www.genome.gov/11006943. Accessed: 10/18/2012. [28] Wei-Chun C. Kao, Kristian Stevens, and Yun S. Song. BayesCall: A model-based base-calling algorithm for high-throughput short-read sequencing. Genome research, 19(10):1884–1895, October 2009. [29] Martin Kircher, Udo Stenzel, and Janet Kelso. Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biology, 10(8):R83+, August 2009. [30] Daniel C. Koboldt, Ken Chen, Todd Wylie, David E. Larson, Michael D. McLellan, Elaine R. Mardis, George M. Weinstock, Richard K. Wilson, and Li Ding. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics (Oxford, England), 25(17):2283–2285, September 2009. [31] Daniel C. Koboldt, Qunyuan Zhang, David E. Larson, Dong Shen, Michael D. McLellan, Ling Lin, Christopher A. Miller, Elaine R. Mardis, Li Ding, and Richard K. Wilson. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research, 22(3):568–576, March 2012. [32] Lukasz Komsta. Package ‘outliers’, January 2011. [33] Jeffrey Kriseman, Christopher Busick, Szabolcs Szelinger, and Valentin Dinu. BING: biomedical informatics pipeline for Next Generation Sequencing. Journal of biomed- ical informatics, 43(3):428–434, June 2010. [34] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3):R25–10, March 2009. [35] Heng Li. Fastq format specification. http://maq.sourceforge.net/fastq.shtml. Accessed: 10/16/2012. [36] Heng Li. Pileup format. http://samtools.sourceforge.net/pileup.shtml. Ac- cessed: 10/16/2012. [37] Heng Li. Mathematical notes on samtools algorithms, October 2010. [38] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows- Wheeler transform. Bioinformatics, 26(5):589–595, March 2010. [39] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, Richard Durbin, and 1000 Genome Project Data Process- ing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16):2078–2079, August 2009. [40] Heng Li, Jue Ruan, and Richard Durbin. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research, 18(11):1851–1858, November 2008.

93 [41] Ruiqiang Li, Yingrui Li, Xiaodong Fang, Huanming Yang, Jian Wang, Karsten Kristiansen, and Jun Wang. SNP detection for massively parallel whole-genome resequencing. Genome Research, 19(6):1124–1132, June 2009.

[42] Ruiqiang Li, Chang Yu, Yingrui Li, Tak-Wah Lam, Siu-Ming Yiu, Karsten Kris- tiansen, and Jun Wang. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967, August 2009.

[43] Nawar Malhis, Yaron Butterfield, Martin Ester, and Steven J. M. Jones. Slider- maximum use of probability information for alignment of short sequence reads and SNP detection. Bioinformatics, 25(1):6–13, January 2009.

[44] Nawar Malhis and Steven J. M. Jones. High quality SNP calling using Illumina data at shallow coverage. Bioinformatics, 26(8):1029–1035, April 2010.

[45] G. T. Marth, I. Korf, M. D. Yandell, R. T. Yeh, Z. Gu, H. Zakeri, N. O. Stitziel, L. Hillier, P. Y. Kwok, and W. R. Gish. A general approach to single-nucleotide polymorphism discovery. Nature genetics, 23(4):452–456, December 1999.

[46] Gabor Marth. Freebayes – the marthlab, February 2009.

[47] A. Martínez-Alcántara, E. Ballesteros, C. Feng, M. Rojas, H. Koshinsky, V. Y. Fofanov, P. Havlak, and Y. Fofanov. PIQA: pipeline for Illumina G1 genome ana- lyzer data quality assessment. Bioinformatics (Oxford, England), 25(18):2438–2439, September 2009.

[48] Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibul- skis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, and Mark A. DePristo. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9):1297–1303, September 2010.

[49] Michael L. Metzker. Sequencing technologies – the next generation. Nature Reviews Genetics, 11(1):31–46, December 2009.

[50] Urs A. Meyer. Pharmacogenetics - five decades of therapeutic lessons from genetic diversity. Nature reviews. Genetics, 5(9):669–676, September 2004.

[51] Omkar Muralidharan, Georges Natsoulis, John Bell, Daniel Newburger, Hua Xu, Itai Kela, Hanlee Ji, and Nancy Zhang. A cross-sample statistical model for SNP detection in short-read sequencing data. Nucleic Acids Research, 40(1):e5, January 2012.

[52] Gabriel H. Murillo. ngs-snp-calling (gems) - cui lab. http://cui.bioinformatics. ucr.edu/home/software/ngs-snp-calling. Accessed: 10/16/2012.

[53] The 1000 Genomes Project. Vcf (variant call format) version 4.1. http://www.1000genomes.org/wiki/Analysis/Variant\%20Call\%20Format/ vcf-variant-call-format-version-41. Accessed: 10/16/2012.

[54] Ji Qi, Fangqing Zhao, Anne Buboltz, and Stephan C. Schuster. inGAP: an in- tegrated next-generation genome analysis pipeline. Bioinformatics, 26(1):127–129, January 2010.

94 [55] R Development Core Team. R: A Language and Environment for Statistical Com- puting. R Foundation for Statistical Computing, Vienna, Austria, 2010. ISBN 3-900051-07-0. [56] NCBI Molecular Biology Review. Nucleotide base codes (iupac). http: //www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/iupac_nt_ abbreviations.html. Accessed: 10/18/2012. [57] Jacques Rougemont, Arnaud Amzallag, Christian Iseli, Laurent Farinelli, Ioannis Xenarios, and Felix Naef. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics, 9(1):431+, 2008. [58] Sohrab P. Shah, Ryan D. Morin, Jaswinder Khattra, Leah Prentice, Trevor Pugh, Angela Burleigh, Allen Delaney, Karen Gelmon, Ryan Guliany, Janine Senz, Chris- tian Steidl, Robert A. Holt, Steven Jones, Mark Sun, Gillian Leung, Richard Moore, Tesa Severson, Greg A. Taylor, Andrew E. Teschendorff, Kane Tse, Gulisa Turashvili, Richard Varhol, Rene L. Warren, Peter Watson, Yongjun Zhao, Carlos Caldas, David Huntsman, Martin Hirst, Marco A. Marra, and Samuel Aparicio. Mu- tational evolution in a lobular breast tumour profiled at single nucleotide resolution. Nature, 461(7265):809–813, October 2009. [59] Yufeng Shen, Zhengzheng Wan, Cristian Coarfa, Rafal Drabek, Lei Chen, Eliza- beth A. Ostrowski, Yue Liu, George M. Weinstock, David A. Wheeler, Richard A. Gibbs, and Fuli Yu. A SNP discovery method to assess variant allele probability from next-generation resequencing data. Genome research, 20(2):273–280, February 2010. [60] S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin. dbSNP: the NCBI database of genetic variation. Nucleic Acids Research, 29(1):308–311, January 2001. [61] Michael Snyder, Jiang Du, and Mark Gerstein. Personal genome sequencing: current approaches and challenges. Genes & Development, 24(5):423–431, March 2010. [62] John D. Storey. The Positive False Discovery Rate: A Bayesian Interpretation and the q-Value (2003). The Annals of Statistics, 31(6):2013–2035, 2003. [63] Michael Strömberg and Wan-Ping Lee. Mosaik - the marthlab, December 2010. [64] Novocraft Technologies. Novocraft.com: Novocraft. http://www.novocraft.com/ main/index.php. Accessed: 10/16/2012. [65] Kris A. Wetterstrand. Dna sequencing costs: Data from the nhgri large-scale genome sequencing program. www.genome.gov/sequencingcosts. Accessed: 10/16/2012. [66] Nava Whiteford, Tom Skelly, Christina Curtis, Matt E. Ritchie, Andrea Löhr, Alexander W. Zaranek, Irina Abnizova, and Clive Brown. Swift: primary data analysis for the Illumina Solexa sequencing platform. Bioinformatics, 25(17):2194– 2199, September 2009. [67] Na You, Gabriel Murillo, Xiaoquan Su, Xiaowei Zeng, Jian Xu, Kang Ning, Shoudong Zhang, Jiankang Zhu, and Xinping Cui. SNP calling using genotype model selection on high-throughput sequencing data. Bioinformatics, 28(5):643– 650, March 2012.

95 Appendix A

Intuition Behind the EM Algorithm

Consider the observed random variable D, the unobserved random variable

G and an unknown vector of parameters θ, all linked together with the likelihood

L(θ|D, G) = P (D, G|θ). Consider that if we knew D and θ, we could then estimate

G. Likewise, if we knew D and G, we could then estimate θ. But since both G and θ are unknown, an iterative procedure may lead to estimates of both G and θ.

One goal, that of getting the MLE of θ from the marginal likelihood of D,

X L(θ|D) = P (D|θ) = P (D, G|θ), (A.1) G is often intractable. One way to achieve this goal is by using the EM algorithm as suggested in the above scenario. The first step is to set θ to some initial value θ(0).

Then in the E-step, we compute the best value for G given the tth iteration of θ(t). This is often expressed with the Q function, namely,

(t)   Q(θ|θ ) = EG|D,θ(t) log L(θ|D, G) . (A.2)

96 Then in the M-step, we compute a better estimate of θ, given the just computed E-step expectation, expressed by,

θ(t+1) = argmax Q(θ|θ(t)). (A.3) θ

We then iteratively repeat the E-step and the M-step until we have convergence in (G, θ).

97 Appendix B

The Local False Discovery Rate

There a many ways to resolve multiple testing issues, that is, the inflation of the number of significant test statistics when a large number of statistical inferences are considered simultaneously. With regard to SNP calling, the multiple testing problem is manifested in the risk that a SNP caller will call many more SNPs than is reasonable as indicated from the data.

B.1 FDR, Positive FDR and Bayesian FDR

A traditional way to handle the multiple testing problem is to control the false discovery rate (FDR). To define FDR, let us first consider the outcomes of the m

Do not reject null Reject null Total True Null U V m0 False Null T S m1 Total W R m

Table B.1: The possible outcomes from m hypothesis tests.

98 hypothesis tests as listed in table B.1. Benjamini and Hochberg [5] defined the false discovery rate (FDR) as follows, where R ∨ 1 sets V/R = 0 when R = 0.

    V V FDR = E = E R > 0 P (R > 0) (B.1) R ∨ 1 R

Similar to FDR, Storey [62] defined the positive false discovery rate (pFDR) as follows.

  V pFDR = E R > 0 (B.2) R

As we see the pFDR avoids the possibility of having no rejections in FDR by setting

P (R > 0) = 1. pFDR also has a nice Bayesian interpretation when described as a

function of a test significance region Γ. Given the test statistics T1,T2,...,Tm and the

random variables V (Γ) = #{null Ti : Ti ∈ Γ} and R(Γ) = #{Ti : Ti ∈ Γ}, we can specify

pFDR(Γ) as follows.

  V (Γ) pFDR(Γ) = E R(Γ) > 0 (B.3) R(Γ)

Theorem 1 (Bayesian interpretation of pFDR, Storey [62]) Assuming:

th 1. {Hi = I(i hypothesis test is false)} for i = 1, . . . , m are m identical hypothesis

tests where π0 = 1 − π1 is the prior probability that a hypothesis is true and Hi ∼

Bern(π1)

2. (Ti,Hi) are i.i.d. random variables where Ti|Hi ∼ (1 − Hi) · F0 + Hi · F1 for some

null distribution F0 and alternative distribution F1

Then independent of i or m,

99 pFDR(Γ) = P (H = 0|T ∈ Γ) (B.4)

π · P (T ∈ Γ|H = 0) = 0 π0 · P (T ∈ Γ|H = 0) + π1 · P (T ∈ Γ|H = 1)

A special case of pFDR is known as “Bayesian FDR” (bFDR). In bFDR, the general significance region of pFDR, {Γ}, is restricted to the one-sided significance region

{T ≤ t}. Using the same π and F notation as above, let F (t) = π0F0(t)+π1F1(t). Efron

and Tibshirani [17] define bFDR as the following.

F (t) bFDR(t) = π 0 (B.5) 0 F (t) P (T ≤ t|H = 0)P (H = 0) = P (T ≤ t)

= P (H = 0|T ≤ t)

#{Ti≤t} A nonparametric estimator for bFDR, with F (t) = m , is given as follows.

F (t) bFDR(t) = π 0 (B.6) 0 F (t)

Thus we can begin to how all the versions of FDR are connected. First, pFDR(Γ) = P (H = 0|T ∈ Γ) has the same general Bayesian form as bFDR(t) =

P (H = 0|T ≤ t). Second, when P (R > 0) = 1, then FDR = pFDR = P (H = 0|T ∈ Γ).

Thus we see that both pFDR and bFDR are special cases of FDR.

The article by Benjamini and Hochberg [5] does more than just define FDR, it

actually gives an algorithm to limit FDR at a level α. This result, which gives a way to

tackle the multiple testing problem, is as follows.

1. Let P1,P2 ...,Pm be the p-values associated with the independent tests T1,T2,...,Tm

and associated hypotheses H1,H2,...,Hm.

100 2. Order the p-values with the following order statistics notation, P(1) ≤ P(2) ≤ ... ≤

P(m).

  1 i α 3. For a fixed level α, determine iα = argmax P(i) ≤ i n π0

4. Reject all Hi with Pi ≤ P(iα).

The following theorem also shows that there is another connection between

FDR and bFDR.

Theorem 2 (bFDR-FDR control equivalence, Efron and Tibishirani [17]) The

Benjamini-Hochberg algorithm is equivalent to rejecting all Hi with Ti ≤ tα where tα = max{bFDR(t) ≤ α}. t

B.2 Local FDR

Efron [17] also defines local FDR (lFDR) as a “local” version of bFDR. First, let f0(t) and f(t) be the densities corresponding to F0(t) and F (t). Recall that bFDR

can be defined for a general rejection region Γ, in the following way.

Pf0 (T ∈ Γ) bFDR = π0 = P (Hi = 0|Ti ∈ Γ) (B.7) Pf (T ∈ Γ)

Likewise, for infinitesimally “local” rejection regions, Efron [17] defined local FDR (lFDR) at a point t in the following manner.

f (t) lFDR(t) = π 0 = P (H = 0|T = t) (B.8) 0 f(t) i i

1 Note that the Benjamini and Hochberg’s 1995 paper [5] took π0 = 1 but Benjamini and Yekutieli’s 2001 paper [6] considers estimating it. There are different strategies for determining π0, including keeping P (H = 0|Reject H) ∈ [0, 1].

101 “Uninteresting” “Interesting” Prior π0 π1 = 1 − π0 Density f0(t) f1(t) Table B.2: The prior probability and density notation of “uninteresting” and “interesting” test statistics, in the lFDR context.

The rationale for defining local FDR is seen in the following quote from Efron which illuminates the purpose between small-scale and large-scale hypothesis testing.

“Although we are not exactly looking for a needle in a haystack, we do not want the whole haystack either.” That is, single hypothesis tests are often designed with the expectation of rejecting the null hypothesis. In contrast, large-scale simultaneous hypothesis testing, like a screening operation, is intended to identify a small percentage of “interesting” cases among the majority of “uninteresting” cases. In discussing the above, Efron prefers to use “interesting” instead of “significant” and feels that “the proportion of interesting cases is small, perhaps 1% or 5% of [m], but not more than 10%.” [16]

Since Efron assumes that most test statistics will be “uninteresting”, he makes the assumption that π0 ≥ 0.9 and conservatively recommends users to set π0 = 1. The following is in practice how Efron recommends to use lFDR to determine whether or not a test is interesting.

1. Given the test statistics T1,T2,...Tm where m is large (i.e. m > 100), we would

like to know whether Ti was generated according to the “uninteresting” density in

table B.2.

2. By using Bayes’ theorem and the mixture density f(t) = π0f0(t) + π1f1(t), note

the following. f (t) P (“Uninteresting”|t) = π 0 (B.9) 0 f(t)

102 3. By assuming conservatively that π0 = 1, define the upper-bound of lFDR as

UB lFDR (t) ≡ f0(t)/f(t).

UB 4. Report ti as interesting if lFDR (ti) ≤ 0.1. Thresholds other than 0.1 can used,

though Efron recommends that the threshold not be larger than 0.1. Various

techniques are given in [16] on how one can estimate lFDRUB(t) and specifically

f0(t).

103 Appendix C

Filtering Alignment Files

Many HTS data alignment procedures record information about the short reads which are aligned to a reference genome. Short reads with undesirable characteristics can be filtered before downstream analyses, such as SNP calling.

This discussion will focus on the Sequence Alignment/Map (SAM) format (

[39], [22]) since it is the emerging file format standard for storing nucleotide sequence alignments. The BAM file format is a compressed, binary version of the SAM file format which can be indexed for computational efficiency. SAMtools [39] is a popular software suite that provides a toolkit for manipulating SAM format alignments.

C.1 SAM/BAM Bitwise Flags

Among the bitwise flags of the SAM/BAM alignment section, the final 3 flags may indicate that the read is undesirable and should be pre-filtered before SNP calling.

Table C.1 gives summary information on these flags. The specific definitions of these

flags may vary depending on the sequencing and alignment software that is used. As

104 Bit (Hex) Hex-to-Decimal Decimal Description 0x100 1 × (162) 256 Non-primary alignment 0x200 2 × (162) 512 Read fails quality controls 0x400 4 × (162) 1024 Read is PCR or optical duplicate

Table C.1: Selected bitwise flags from the FLAG field of the SAM/BAM alignment section [22]. The columns provide 1) the bitwise flags written in hexadecimal (hex) with the Unix-like/C “0x” notational prefix, 2) an expression to obtain the flag decimal value from the hexadecimal value, 3) the flag decimal value and 4) a short flag description.

the following is general information about these flags, practitioners are encouraged to consult the documentation of the software that they use.

The bit 0x100 flag indicates a non-primary read alignment and read alignment ambiguity. The bit 0x200 flag indicates that certain read quality metrics were not passed.

The assessment of these reads quality metrics or controls is typically done during pre- alignment sequencing. The bit 0x400 flag indicates that the read is most likely either a polymerase chain reaction (PCR) duplicate or an optical duplicate. PCR duplicates arise when the same parent DNA molecule is repeatedly sequenced over the course of many

PCR cycles, thus these PCR duplicates do not offer any unique information. Optical duplicates can occur when, during the image analysis of sequencing, the sequences of one cluster are falsely identified to belong to another real or illusory cluster.

C.2 Minimally Recommended Practices

Many users choose to filter reads with the above flags to prevent the bias that these reads can introduce in SNP calling. For example, retaining PCR or optical duplicate reads can distort the true allele frequency distribution and lead to false positive or false negative SNP calls. Likewise, retaining reads with established quality control failures or ambiguous alignments can also lead to false SNP calls.

105 A popular procedure used to filter these reads is samtools view [39] with the -F option. This option takes as its argument the integer sum of the decimal representations of the bitwise flags that are to be filtered out. To demonstrate, suppose we wish to filter out any read that have at least one of the following bitwise flags: 0x100, 0x200 and

0x400. To obtain the correct samtools view -F argument, we simply sum the decimal representations of these bits: 256 + 512 + 1024 = 1792. Then we run samtools view -F

1792 on our BAM file. Likewise, if we wish to filter out only those reads that have either the non-primary alignment flag (0x100) or the PCR/optical duplicate flag (0x400), then we should run samtools view -F 1280 since 256 + 1024 = 1280.

Figure C.1 gives minimally recommended command-lines for filtering a SAM /

BAM alignment file, converting the filtered BAM file to a pileup file and running the single sample GeMS SNP caller on the resulting pileup file. The documentation for

SAMtools offers more information on these procedures and the recommended usage of their options.

106 # 0. If starting with SAM, convert SAM to BAM samtools view -bT ref.fasta aln.sam -o aln.bam # 1. Sort the BAM and output to srt.bam samtools sort aln.bam srt # 2. Index the BAM which creates srt.bam.bai samtools index srt.bam # 3. Output mapped(3rd col)/unmapped(4th col) read counts samtools idxstats srt.bam # 4. Filter out reads with flags 0x100 and 0x400 samtools view -bF 1280 srt.bam -o flt.bam # 5. Index the new filtered BAM file samtools index flt.bam # 6. Read counts should be less than or equal to before samtools idxstats flt.bam # 7. Create pileup file with mapping(q)/base(Q) quality filters samtools mpileup -q 20 -Q 17 -sf ref.fasta flt.bam > flt.pileup # 8. Run GeMS on flt.pileup gems -i flt.pileup -o gems_output.txt

Figure C.1: Minimally recommended command-lines for filtering a SAM/BAM alignment file, converting the filtered BAM file to a pileup file and running the single sample GeMS SNP caller on the resulting pileup file.

107