UNIVERSITY OF CALIFORNIA RIVERSIDE
SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data
A Dissertation submitted in partial satisfaction of the requirements for the degree of
Doctor of Philosophy
in
Applied Statistics
by
Gabriel Hiroshi Murillo
December 2012
Dissertation Committee:
Dr. Xinping Cui, Chairperson Dr. Daniel Jeske Dr. Thomas Girke Copyright by Gabriel Hiroshi Murillo 2012 The Dissertation of Gabriel Hiroshi Murillo is approved:
Committee Chairperson
University of California, Riverside Acknowledgments
I am most grateful to my advisor, Dr. Xinping Cui, for her enthusiasm, dedi- cation and encouragement. Without her guidance, this dissertation would not exist.
I would also like to thank Dr. Na You, for the many helpful discussions we had.
iv To my parents.
v ABSTRACT OF THE DISSERTATION
SNP Calling Using Genotype Model Selection on High-Throughput Sequencing Data
by
Gabriel Hiroshi Murillo
Doctor of Philosophy, Graduate Program in Applied Statistics University of California, Riverside, December 2012 Dr. Xinping Cui, Chairperson
Recent advances in high-throughput sequencing (HTS) promise revolutionary impacts in science and technology, including the areas of disease diagnosis, pharmacogenomics, and mitigating antibiotic resistance. An important way to analyze the increasingly abundant HTS data, is through the use of single nucleotide polymorphism (SNP) callers. Considering a selection of popular HTS SNP calling procedures, it becomes clear that many rely mainly on base-calling and read mapping quality values. Thus there is a need to consider other sources of error when calling SNPs, such as those occurring during genomic sample preparation. Genotype Model
Selection (GeMS), a novel method of consensus and SNP calling which accounts for genomic sample preparation errors, is thus given. Simulation studies demonstrate that GeMS has the best balance of sensitivity and positive predictive value (PPV) among a selection of popular SNP callers. Real data analyses also support this conclusion.
As an extension to the aforementioned single sample GeMS, the multiple sample Geno- type Model Selection (multiGeMS) method is also given. A simulation study and a real data analysis demonstrate that multiGeMS has a good balance of sensitivity and PPV when compared to a selection of popular multiple sample SNP callers.
vi Contents
List of Figures ix
List of Tables x
1 Introduction 1
2 Background 4 2.1 Biological Background ...... 4 2.2 High Throughput (Next Generation) Sequencing ...... 7 2.3 File Format Specifications ...... 13
3 Literature Review 18 3.1 Single Sample SNP Callers ...... 18 3.1.1 MAQ ...... 19 3.1.2 gigaBayes ...... 21 3.1.3 Atlas-SNP2 ...... 21 3.1.4 SNVMix ...... 23 3.1.5 Method Comparison ...... 24 3.2 Multiple Sample SNP Callers ...... 25 3.2.1 SAMtools ...... 26 3.2.2 GATK ...... 29 3.2.3 “Cross-Sample” ...... 29
4 Single Sample Genotype Model Selection (GeMS) 32 4.1 Data Preparation ...... 33 4.2 Procedure ...... 35 4.3 Validation ...... 41 4.3.1 Simulation Analysis ...... 42 4.3.2 Real Data Analysis ...... 52 4.3.2.1 The Arabidopsis sup1ros1 dataset ...... 52 4.3.2.2 The Thermoanaerobacter sp. X514 Xw2010 dataset . . . 57 4.3.3 Computational Performance ...... 59 4.4 Discussion ...... 64 4.4.1 Haploid GeMS Analysis ...... 64 4.4.2 Prior Probabilities ...... 65
vii 5 Multiple Sample Genotype Model Selection (multiGeMS) 66 5.1 Data Preparation ...... 67 5.2 Procedure ...... 69 5.2.1 Single Sample GeMS Review ...... 71 5.2.2 EM Algorithm ...... 71 5.2.2.1 E-step ...... 72 5.2.2.2 M-step ...... 74 5.2.2.3 Initial Values ...... 75 5.2.2.4 Convergence ...... 76 5.2.3 SNP and Consensus Calling ...... 77 5.3 Validation ...... 78 5.3.1 Simulation Analysis ...... 78 5.3.2 Real Data Analysis ...... 81
6 Future Work 87 6.1 Refinements ...... 87 6.2 metaGeMS ...... 88
Bibliography 91
Appendix A Intuition Behind the EM Algorithm 96
Appendix B The Local False Discovery Rate 98 B.1 FDR, Positive FDR and Bayesian FDR ...... 98 B.2 Local FDR ...... 101
Appendix C Filtering Alignment Files 104 C.1 SAM/BAM Bitwise Flags ...... 104 C.2 Minimally Recommended Practices ...... 105
viii List of Figures
2.1 Description of some genetic variants ...... 6 2.2 HTS data analysis pipeline ...... 8 2.3 Aligned reads with variants ...... 10 2.4 FASTA file example ...... 14 2.5 FASTQ file example ...... 15
3.1 Multiple sample SNP calling data notation ...... 26
4.1 q and w parameter relationship ...... 39 4.2 Sensitivity and PPV plot of SNP caller performance ...... 48 4.3 Zoomed-in view of sensitivity and PPV plot of SNP caller performance . . 49 4.4 Arabidopsis SNP call Venn diagram ...... 54 4.5 X514 SNP call Venn diagram ...... 58
5.1 EM algorithm iterations ...... 76 5.2 multiGeMS simulation study samples ...... 81 5.3 50 samples of 1000 Genomes Project data, Part 1 ...... 83 5.4 50 samples of 1000 Genomes Project data, Part 2 ...... 84
C.1 Alignment File Filtering ...... 107
ix List of Tables
2.1 IUPAC Nucleotide Base Codes ...... 5 2.2 Simplified, relative comparison of Sanger and Illumina DNA sequencing technologies ...... 8 2.3 Some base-calling procedures ...... 10 2.4 Some alignment procedures ...... 11 2.5 Some SNP calling procedures ...... 12
3.1 Atlas-SNP2 prior probabilities ...... 23 3.2 Comparison of surveyed SNP calling methods ...... 26
4.1 Settings for GeMS SNP calling ...... 33 4.2 GeMS model notation ...... 36 g l g 4.3 Diploid q for Yj ∼ Categorical(q ) ...... 37 4.4 Options used in SNP calling of simulated single sample data ...... 43 4.5 Single sample simulation SNP caller sensitivity ...... 46 4.6 Single sample simulation SNP caller PPV ...... 47 4.7 GeMS simulation results summary ...... 47 4.8 3 model GeMS sensitivity ...... 51 4.9 3 model GeMS PPV ...... 51 4.10 Options used in SNP calling of the Arabidopsis dataset ...... 53 4.11 Arabidopsis SNP call proportions ...... 57 4.12 Options used in SNP calling of the X514 dataset ...... 59 4.13 Single sample simulated data SNP calling computer specifications . . . . . 60 4.14 Time to completion of single sample SNP calling procedures ...... 61 4.15 Average memory used by single sample SNP calling procedures ...... 62 4.16 Maximum memory used by single sample SNP calling procedures . . . . . 63 g l g 4.17 Haploid q for Yj ∼ Categorical(q ) ...... 64 5.1 Settings for multiGeMS SNP calling ...... 67 5.2 multiGeMS model notation ...... 70 5.3 Options used in SNP calling of simulated multiple sample data ...... 80 5.4 multiGeMS simulation results summary ...... 81 5.5 Options used in SNP calling of the 50 samples of 1000 Genomes Project data ...... 82 5.6 multiGeMS 1000 Genomes Project data results summary ...... 86
B.1 Possible outcomes from m hypothesis tests ...... 98
x B.2 lFDR prior and density notation ...... 102
C.1 Selected bitwise flags from SAM file FLAG field ...... 105
xi Chapter 1
Introduction
In the computing industry, the so-called “Moore’s Law” is a famous prediction that the number of transistors on a microprocessor will double approximately every
2 years. As a prominent growth benchmark, few technological trends have out-paced
Moore’s Law. However, since 2008 when Sanger-based DNA sequencing began to give way to ‘next generation’ technologies, high-throughput sequencing [49], [61] (HTS) has been one of those trends.
Data collected by the National Human Genome Research Institute illustrates the incredible speed that high-throughput sequencing is decreasing the cost of DNA sequencing. In 2001, the cost to sequence a human-sized genome was approximately
$100,000,000. In 2006, the cost decreased by a factor of 10 to $10,000,000. Similar decreases happened in 2008, 2009 and 2011 [65]. Currently two companies are now planning to release technology that enable a human-sized genome to be sequenced for
$1,000 by the end of 2012 [23], while at least one other company is shooting for the $100 genome [11]. At this rate, it can expected that within the coming years, getting one’s whole genome sequenced could become as commonplace as a blood test is today.
1 Without knowing what the exact costs of DNA sequencing will be in the next few years, one thing is clear, the genomic revolution is accelerating and will likely touch all our lives in a profound way. Since their inception, these HTS technologies were essentially limited to research organizations, however they are now expected to be accessible to all those in industrialized nations. Now many people will be able to accurately discover what diseases they and members in their families may be prone to. These individuals will then be able to live their lives in such a way that minimizes the harmful effects of those possible diseases. Further pharmacogenomics [50] holds the promise of accurately predicting which types of medical treatment, and pharmaceutical medication in particular, will provide the most good and least harm for every patient’s genetic profile.
Humans are not the only organisms that can be sequenced. In fact, all life is encoded in nucleic acid molecules such as DNA and RNA. Thus the applications of HTS go far beyond that of human genomics. Other areas of public health research that are being positively affected by HTS include the fight against antimicrobial resistance and the mitigation of infectious disease pathogens. Likewise, beyond public health, HTS is beginning to influence research in the agricultural sciences, environmental sciences and the science of alternative energy sources1.
Due to the massive decrease in the costs of DNA sequencing, there appears to be no shortage of HTS data available for scientists to analyze. Thus the bottleneck in these analyses is not the amount of data itself, but the group of statistical and computa- tional algorithms used to analyze this large amount of data. Indeed, more accurate and computationally efficient data analysis tools are needed to translate the raw sequencing data into findings useful for medical workers, patients and our global society in general.
1Due to all of these changes, legal and ethical concerns involved in insurance policies, employment requirements and genetic discrimination and disclosure in general, will need to be addressed. Thus this genomic revolution is expected to be a case of ‘technology outpacing morality’.
2 To this end, this dissertation chronicles the development, computational im- plementation and validation of the novel statistical algorithm Genotype Model Selection
(GeMS). GeMS is open source and freely available [52] to all who would like to explore single nucleotide polymorphisms [10] (SNPs) in sequencing data. Not only can GeMS be used by seasoned biologists, statisticians and other scientists to further research dis- coveries, but it can also be used to train future scientists with hands-on data analysis experience.
3 Chapter 2
Background
2.1 Biological Background
Before we begin a discussion of statistical methods used in single nucleotide polymorphism (SNP) calling, we must first understand what the data “looks like”. This requires that we first understand the basics of DNA. Deoxyribonucleic acid or DNA has often been compared to the blueprints of life, it is a molecular code that contains all of the instructions needed to describe our personal physical features, some personality traits, and how our bodies can maintain themselves.
In many living things, DNA is organized into chromosomes and has the often illustrated double-helix shape. The two strands of the double helix are composed of bases or nucleotides named Adenine, Cytosine, Guanine and Thymine. These bases are commonly identified with their abbreviations: A, C, G, and T, respectively. It is important to note that A always binds with T and C always binds with G. This means that both strands are essentially equivalent sequences and so we only need to consider
4 Symbol Meaning A Adenine C Cytosine G Guanine T Thymine M A or C R A or G W A or T S C or G Y C or T K G or T V A or C or G H A or C or T D A or G or T B C or G or T N A or C or G or T
Table 2.1: A listing of the 15 possible combinations of nucleotide base codes by the IUPAC [56]. one. Sometimes the identity of a nucleotide cannot be clearly determined. In this case we can use the IUPAC nucleotide base codes as displayed in table 2.1.
As we know, we received some characteristics from our father and some from our mother. This is because when we were conceived, we were given a set of 23 chromosomes from our father and another set of 23 chromosomes from our mother. Genomes, and thus organisms, that have two sets of chromosomes are called diploid (e.g. humans) and genomes with one set are called haploid (e.g. bacteria).
With much excitement in the scientific community, the Human Genome Project
(HGP) was finished in 2003 after an impressive collaboration of work from around the globe. It took $2.7 billion (1991 USD) and 13 years [27] to complete and it has been considered a triumph of human history. One of the main goals of the human genome project was to create a human reference genome.
A species’ reference genome is a representation of the genome sequence for any member of that species. It can be created by the (de novo) sequencing of an indi-
5 Reference Sequence ...CATCATCATCAT... Homozygous SNP ...CATCATGATCAT... Example ...CATCATGATCAT... Heterozygous SNP ...CATCATGATCAT... Example 1 ...CATCATCATCAT... Heterozygous SNP ...CATCATGATCAT... Example 2 ...CATCATTATCAT... Homozygous Insertion ...CATCATGCATCAT...... CATCATGCATCAT......
Figure 2.1: Here are a few of the many types of possible variations between a reference genome and that of a particular organism.
vidual organism’s genome or a pool of genomes from a group of organisms of the same species. Even if the organism is diploid, that is, it has two sets of chromosomes, the refer- ence genome generally represents only one set of chromosomes1. For multi-chromosomal organisms, the reference genome is constructed by concatenating each chromosome se- quence together. An index is used to track positions on this reference genome as well as where chromosomes start and finish.
As there are differences between individuals’ outward appearances, there are also differences between individuals’ genomes. It is often stated that the genomes of any two randomly selected people are 99.9% the same. Thus, scientists are interested in that
0.1% difference between individuals since they believe that some of these differences may be used to realize some of the public health benefits as described in Chapter 1.
It is important to discuss the types of genomic differences that can happen. See
figure 2.1 for some visual examples. First, imagine that we have the human reference genome and the genome of another person whom we will call Adam. Imagine that a particular site on the human reference genome, such as the one underlined in figure 2.1,
1The reason for this is that in the past, when sequencing was more costly, the benefit received from sequencing both sets of chromosomes was not worth the extra cost to do so. This is because at the vast majority of sites, the two sets of chromosomes are equal, i.e. they are homozygous.
6 contains the nucleotide C but Adam, in his two sets of chromosomes, might have the nucleotides G and G. Since there is a single nucleotide difference between the reference genome and Adam’s genome, this means that Adam has a single nucleotide polymorphism
(abbreviated SNP and pronounced “snip”) at this site. Moreover, in this example, Adam has a homozygous SNP since the nucleotides on both sets of chromosomes are identical.
If however, one of Adam’s sets of chromosomes has a C whereas the other has a G, then Adam would have a heterozygous SNP since the nucleotides between his sets of chromosomes do not match and there is still a difference between the reference genome and his genome.
Further, insertions and deletions are often seen as when compared to the ref- erence genome. An insertion in the sample genome means that an extra nucleotide is present in the sample’s genome as when compared to the reference genome. Likewise a deletion in the sample genome means that a nucleotide is missing from the sample’s genome as when compared to the reference genome. Collectively, insertions and deletions are known as ‘indels’. The ideas of heterozygosity and homozygosity can apply to indels as well. There are also other common genomic variations that are beyond the scope of this discussion.
2.2 High Throughput (Next Generation) Sequencing
As described in Chapter 1, since the middle to late 2000s, there has been a shift from the ‘first generation’ of sequencing, using Sanger technology, to newer ‘next generation’ methods [49]. Though the Sanger method dominated for nearly two decades and gave us the human reference, it’s very costly, it’s not very fast and thus, there wasn’t too much data to analyze. Next generation sequencing (NGS), which falls under
7 Technology Sanger Illumina Technical name Chain termination Sequencing by synthesis Technology generation moniker ‘First generation’ ‘Next generation’ Popular commercially 1980s - mid 2000s Late 2000s - present Amount of data produced Low Much higher Read length Long Shorter Accuracy High Not as high Speed Slow Much faster Cost Expensive Much cheaper Statistical Methods Mostly developed In development
Table 2.2: A very simplified, relative comparison of Sanger and Illumina DNA sequencing technologies.
1. Acquisition, Fragmentation, Amplification w w 2. Sequencing (raw intensity data) w w 3. Base-Calling (quality scores) w w 4. Alignment (quality scores) w w 5. Consensus and Variant (SNP) Calling
Figure 2.2: This figure identifies the major steps undertaken in most HTS data analyses.
the category of high throughput sequencing (HTS), however, has both advantages and disadvantages. In particular, this discussion will be focusing on the ‘next generation’
Illumina sequencing technology [26]. Though it is much cheaper, faster and generates a massive amount of data, Illumina sequencing is less accurate than the Sanger method, produces shorter reads and requires new statistical methods to handle the massively increased data throughput. See table 2.2 for a relative comparison between these 2 sequencing technologies.
The following is a very simplified storyline explaining how one is to understand how HTS data, from the Illumina sequencing hardware, can be used to find variants
8 between an organism’s genome and its species’ reference genome. Please see figure 2.2 for a summary of the following steps.
1. First, many genome samples are acquired from the organism. This can be done in
a variety of ways, one of which is swabbing some skin cells from inside the mouth.
This step is known as acquisition.
2. We realize it would be hard to sequence the genome from one end to the other since
the genome sequence is usually very long (> 3 billion base pairs in humans). So
we then break up the genome samples into small sections called reads. In existing
technologies, the read size often equals 50, 100 or an even larger amount of base
pairs (bp). This step is known as fragmentation.
3. Though it is now easier to sequence the individual reads than the entire genome, it
is still difficult to get a clear picture of the actual bases on each read since they are
at the molecular level in terms of size. So we attach these reads to different parts
of a slide and amplify the reads by making colonies of read copies since it is easier
to sequence a colony of read copies than it is to sequence an individual read. This
step is called amplification. Taken together, the acquisition, fragmentation and
amplification steps are also known as the genomic sample preparation procedure.
4. To sequence a colony of amplified reads, each base pair or cycle, is photographed
4 times. Each of these photographs yields a raw image which filters for one of the
4 bases A, C, G and T. Thus the intensity values for each base is recorded at for
each cycle. This step is called sequencing.
5. The base-calling step happens when the raw intensity data is processed by a base-
calling algorithm. The simplest base-calling algorithm just chooses the highest
9 Name Year Author(s) Notes BING [33] 2010 Kriseman et al. NGS Pipeline Srfim [9] 2009 Corrada-Bravo et al. Model-based Ibis [29] 2009 Kircher et al. Machine Learning BayesCall [28] 2009 Kao et al. PIQA [47] 2009 Martinez-Alcantara NGS Pipeline Swift [66] 2009 Whiteford et al. Rolexa [57] 2008 Rougemont et al. Alta-Cyclic [18] 2008 Erlich et al. Bustard [25] N/A Illumina
Table 2.3: Some base-calling procedures. Srfim is perhaps the most statistically inter- esting, though Illumina’s Bustard is the most popular as it is the default choice for those with Illumina hardware.
GAGTTATATCGCTTCCATGA GAGTTTTATCGCTTCCATGACGCACAAGTT GAGTTTTATCGCTTCCATGACGCAGAAGTT GAGTTTTATGGCTTCCATGACGCACAAGTTAACACTTTCG GCTTCCATGACGCACAAGTTAACACTTTCGGATATTTCTG CGCACAAGTTAACACTTTCGGATTTTTCTGATGAGTCGAA CGCACAAGTTAACACTTTCGGATTTTTCTGATGAGTCGAA AACACTTTCGGATATTTCTGATGAGTGGAA... GATTTTTCTGATGAGTCGAA... GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAA... 123456789012345678901234567890123456789012345678901234567890...
Figure 2.3: This is an illustration of some reads that have been aligned to a reference, along with possible SNPs. Non-reference alleles are in red.
raw intensity value as the base-call, though many would choose to use a more
statistically sophisticated model-based base-calling algorithm. See table 2.3 for a
partial listing of available base-callers. Most researchers use the default Illumina
Bustard base-caller that comes with the Illumina hardware. In addition to the ac-
tual determination of each base, most base-calling algorithms provide a base-calling
quality score for each base. Essentially this quality score gives the probability of
an incorrectly called base.
6. Now the problem we have is that breaking up the original genome samples in
the fragmentation stage results in the loss of information as to where each read
10 Name Year Author(s) Notes Bowtie [34] 2009 Langmead et al. RNA-Seq friendly SOAP2 [42] 2009 Li, R. et al. Mosaik [63] 2009 Strömberg et. al. BWA [38] 2009 Li, H. et al. Burrows-Wheeler transform Novoalign [64] N/A Novocraft For purchase MAQ [40] 2008 Li, H. et al. Precursor to BWA
Table 2.4: Some alignment procedures. The theory behind the alignment methods re- quires a strong computer science background.
goes. But since we have the reference genome, we can just match the reads to the
reference, which is called the alignment step. See figure 2.3 which demonstrates
what reads aligned to a reference genome would look like. Also, see table 2.4 which
gives a partial listing of alignment software packages. Since the organism will have
some differences with the reference, we understand we can’t be too perfect with each
alignment. For example, if a particular read matches all but 1 of the nucleotides,
we can still align the read by assuming that there might be an base-calling error,
SNP or indel at the mismatched location. Now that we have aligned our reads,
each read will have an alignment quality score which indicates how well the read
matches its aligned location. A base-call that is aligned to a specific location on
the reference genome is frequently termed an allele aligned to that position. The
set of alleles aligned to a location is often called the pileup at that location.
7. Generally the base-calling quality and alignment quality scores are considered along
with the allele pileup in the consensus and variant calling step. Based on these
scores, we can now determine the consensus genotype of each position on the
sample organism’s genome. The consensus genome sequence is the sequence we
decide belongs to the organism based on the aligned reads. Again, the most simple
way to determine the consensus genotype is to just output the mode of the aligned
11 Name Year Author(s) Notes VarScan 2 [31] 2012 Koboldt et al. Bambino [15] 2011 Edmonson et al. piCALL [4] 2011 Bansal et al. Based on SNIP-Seq GATK [48], [13] 2010 McKenna et al. Popular FreeBayes [19] 2010 Garrison et al. Based on gigaBayes SNIP-Seq [3] 2010 Bansal et al. SNVMix2 [21] 2010 Goya et al. Mixture model Slider II [44] 2010 Malhis Jones Coverage independent Atlas-SNP2 [59] 2010 Shen et al. Logistic reg priors inGAP [54] 2010 Qi et al. Based on POLYBAYES SAMtools [39] 2009 Li, H. et al. Popular VarScan [30] 2009 Koboldt et al. SNVMix1 [58] 2009 Shah et al. Mixture model SOAPsnp [41] 2009 Li, R. et al. Similar to MAQ Slider [43] 2009 Malhis et al. Merge-sort approach MAQ [40] 2008 Li, H. et al. Error dependency gigaBayes [46] N/A Marth et al. Based on POLYBAYES POLYBAYES [45] 1999 Marth et al. Pre-NGS
Table 2.5: Some SNP calling procedures. This is a partial listing of the SNP callers considered during the research phase of this dissertation.
alleles. For instance, looking at figure 2.3, it will be safe to say from the allele
pileups that the first 5 consensus genotypes are GAGTT. However, we encounter an
issue at the 6th location on the reference genome, as there is one non-reference allele
in the pileup. Specifically, the reference at this location is T and the allele pileup
contains 3 T nucleotides and 1 A nucleotide. Naturally one would gravitate to
calling the consensus genotype as T but there are situations where this may not be
prudent. For instance, it is possible that the T nucleotides can have low base-calling
and alignment quality scores associated to them and the A nucleotide has high
base-calling and alignment quality scores. In this case, we might call the consensus
nucleotide as W (A or T). Thus it is clear that we need to incorporate more than
just the allele pileup to get an accurate consensus call. The Genotype Model
Selection (GeMS) procedure, as explained in chapter 4, will give more information
on calling consensus genotypes.
12 8. Variant calling builds on consensus calling and adds the step of determining whether
there is enough evidence to report a variant. There are many possible variants,
such as those shown in figure 2.1, but we will focus on SNP calling. Looking back to
figure 2.3, we considered the fact that there might be a heterozygous SNP at loca-
tion 6 on the reference genome. Based on our model, if W has by far the strongest
genotype likelihood compared to the other genotype likelihoods then we should be
inclined to call a SNP here. Given non-extreme base-calling and alignment quality
scores, the sites with just one non-reference allele on figure 2.3 are less likely to
harbor a SNP than the sites that have a higher proportion of non-reference alleles.
For instance, without knowing the base-calling and alignment quality values, it
would be easier to label sites 25 and 44 as homozygous and heterozygous SNPs,
respectively. The quality scores and other information allows GeMS, as explained
in chapter 4, to accurately call SNPs.
2.3 File Format Specifications
When taking part in the analyses involved in figure 2.2, an HTS data analyst will likely need to work with a variety of file formats, just as such an analyst would want to work with many of the software packages listed in tables 2.3, 2.4 and 2.5. Six of such useful file formats are explained below:
• FASTA [7] The FASTA file specification is the most simple of the 6 useful file
formats described in section 2.3. FASTA files serve as text based containers of
nucleotide strings. With respect to SNP calling, FASTA files will most commonly
be used to contain reference genome sequences. For a single nucleotide string
FASTA file, such as a single chromosome reference genome, the first line is reserved
13 > Alphanumeric | sequence | code | sequence name keywords ACGATGCGATACAAAAAAAAAAAAAAGATAACGATAGATTTTTTAGACAGATACCCAGAC ACCCAGATAGCAGACCCCCCCAGTACAGCAATGACCCGGGGCATACGACCCCCCCCTACT CCCCCCCCCTCAGACCATGGATGGCGGGGGGGGGGGGGGACACGATGAGATCCAGAGTCA GCCCCCAGAGATCACGACATACGATCAGACTACGTTTTTTTTTACATCACGACATCAGAC CAGCATATTTTAGAGGGGGAAACGACACACAGCAGACCCCCCCCCCCCCCCAGACAGCAG
Figure 2.4: The first six lines of a FASTA file example.
for the sequence information and usually begins with the greater-than symbol (“>”).
The remaining lines contain the nucleotide sequence in the correct order as given
by the physical nucleotide string. It is generally recommended that no lines are
longer than 80 characters and sometimes other line lengths, such as 60, are used.
The first 6 lines of a toy example FASTA file are given in figure 2.4.
• FASTQ [35] The FASTQ file specification adds some more information to the
FASTA file specification. Instead of a general nucleotide sequence, FASTQ files
account for short reads, specifically those defined in section 2.2. Information about
short read sequences are given in every 4 line section of a FASTQ file, i.e. lines
1-4, 5-8, 9-12, etc. Thus FASTQ files will always have 4 times the number of
lines as the number of short reads described by the FASTQ file. The first line
of these 4 line sections always gives the read identifier and begins with an “at”
(‘@’) symbol. These sequence identifiers usually have a specific format depending
on the type of reads represented by the FASTQ file. The second of these 4 lines
always gives the base-calls of the read sequence. The third line always begins with
a “plus” (‘+’) symbol and is optionally followed by the same sequence identifier in
the first line. The fourth line gives the base-calling quality values associated with
the corresponding base-calls in the second line. Since these base-call quality values
are encoded in the Phred ASCII character-integer scheme, the user needs to be
14 @INSTRUMENT:LANE:TILE:X_POS:Y_POS:#0/1 ACGATGCGATACAAAAAAAAAAAAAAGATAACGATAGATTTTTTAGACAGATACCCAGAC + OFPXQXZTa‘aTaa‘aJaVVJJV‘ZWZ‘aaaaaa‘aTXOFNOOGOONNZTZTBQXWSSaT
@INSTRUMENT:LANE:TILE:X_POS:Y_POS:#0/2 ACCCAGATAGCAGACCCCCCCAGTACAGCAATGACCCGGGGCATACGACCCCCCCCTACT + VZZRaaOYOY‘ZTTZQQaXXVaaNV‘OYO‘X‘WXWUXYYWZXWVTZVVW‘aaTTBBBYBZ
Figure 2.5: The first reads of a pair of FASTQ files that contain paired end data.
aware of the exact Phred offset value used (usually 33 or 64). Additionally, HTS
read data can be “pair-end” or “mate-pair”. This means that the beginning and
end of a long nucleotide string are sequenced but the exact length of the center
nucleotide string is not known. If this is the case then the data is usually given
with a pair of FASTQ files, and a read at a certain location in one FASTQ file
will be paired with the read in the same location of the paired FASTQ file. If the
data is known to be a part of multiple samples, then generally each FASTQ file
will represent no more than 1 sample. Figure 2.5 gives a toy example that shows
the first reads of a pair of FASTQ files that contain pair-end data. The example
uses the Illumina sequence identifier format.
• SAM [22] The SAM file format is one of the most complicated out of these 6 com-
monly used file formats. SAM files contain the alignment information output after
running a FASTQ file through an alignment procedure that supports SAM format
output. Usually SAM format files have extensive header sections that give details
about the alignment. Following this header section is the read section. The order
of this read section is generally by default the order as given in the FASTQ file,
however users have some flexibility in reordering this section. Each line represents a
15 read from the FASTQ file and gives much read alignment information including the
read sequence, the base-call quality scores and mapping quality score. If the data is
known to be a part of multiple samples, then generally each SAM file will represent
no more than 1 sample. There is generally a wealth of information contained in
these files and it takes much experience to fully utilize all this information.
• BAM [22] The BAM file format is a binary version of the text-based SAM format.
Though the SAM format describes the layout of the BAM format, it is the BAM
format that is used most often. The reason for the ubiquity of the binary BAM
format is that BAM files are usually much smaller in size than SAM files, but they
contain the same amount of information. Even so, BAM files can easily be found in
the gigabyte range with respect to size. Additionally, BAM files have an advantage
with respect to computational performance when compared to the SAM format.
The premier software suite to work with these SAM/BAM files is SAMtools [39].
• PILEUP [36] Most SNP calling procedures now support BAM file input, however
prior to this, the PILEUP format was more popular. One reason could be the
ease with PILEUP files are parsed. Since many SNP calling procedures consider
each site independently, the PILEUP format can deliver the essential data without
any preprocessing. The main difference between the SAM/BAM formats and the
PILEUP format, is that the PILEUP gives a site’s allele pileup, the base-calling
quality values and optionally the alignment quality values on one line. In order to
get this information from a SAM/BAM file, the SNP caller would need to parse
many lines to gather all the alleles piled up at a site from all the reads that covered
the site. As with SAM/BAM files, if the data is known to be a part of multiple
samples, then generally each PILEUP file will represent no more than 1 sample.
16 The advantage that the SAM/BAM file formats have over the PILEUP format is
the wealth of information available about all the alignments. In this way, we see
that the PILEUP format is truly a gutted alignment file made from a SAM/BAM
file.
• VCF [53] The VCF file format is also one of the most complicated out of these 6
commonly used file formats. In this way, it is very similar to the SAM file. The
VCF file is composed of 2 parts, an extensive header and a detailed site-by-site
listing of variant information. As with the SAM file, the information contained in
the header and in the variant listing will depend on the variant caller used. The
variant listing will include the reference sequence position, the reference allele and
the variant genotype. If the data is known to be a part of multiple samples, then
generally the one VCF file will be used to account for all the variants in all the
samples. In this way, the VCF format is in general different from the other file
formats, discussed above, which usually only account for just 1 sample. There is
an extensive amount of additional information contained in these files and it takes
much experience to fully utilize this information. The BCF file format is a binary
version of the VCF file format. It shares the advantages that the BAM file format
has over the SAM file.
17 Chapter 3
Literature Review
To begin this literature review, the distinction between single sample and mul- tiple sample data should be made. Single sample HTS data is generally assumed to be sequencing data of genomics reads that were taken only from 1 organism1. Broadly speaking ‘multiple sample HTS data’ can include samples from different species, but the term usually refers to sequencing data of genomics reads that were taken only from more than 1 organism of the same species.
3.1 Single Sample SNP Callers
As most single sample SNP callers work with similar types of data, there are quite a few similarities between the SNP callers. Thus instead of considering the details of all SNP callers introduced, we will consider a few representative SNP calling procedures, namely MAQ, gigaBayes, Atlas-SNP2 and SNVMix. Then we will compare major points of interest between them. 1If we work under the assumption that the samples were collected from different places on the organism which would harbor different genomes and/or during different time periods which would harbor different genomes, then accounting for these differences, a multiple genome sample framework would be preferable. This situation though, is beyond the scope of this document.
18 As covered in the simplified storyline in section 2.2, for a given genomic site,
SNP calling methods first try to determine the consensus genotype and then they call a SNP if there is a relatively high amount of confidence that the consensus genotype is non-reference. To determine this confidence level, as well as the consensus genotype,
SNP callers use a variety of information beyond the basics of the allele pileup and the base-calling and alignment quality scores. To incorporate all of this information into the
SNP calling decision, almost all of the available SNP calling procedures use some form of the Bayes’ theorem statement2. Given that B is some set such that P (B) > 0 and that A1,A2,A3,...,An form a partition of a sample space, a basic form of the Bayes’
theorem states the following.
P (B|Ai) P (Ai) P (B|Ai) P (Ai) P (Ai|B) = = P (3.1) P (B) j P (B|Aj) P (Aj)
A large difference between the SNP calling procedures, however, is just how the partition of the {Ai} in Bayes’ theorem is structured.
3.1.1 MAQ
As alluded to above, MAQ [40] uses Bayes’ theorem to call a consensus. MAQ considers only the 2 most frequently aligned alleles b and b0 covering each particular site3.
The reason for this is that, given a diploid genome, there really only should be up to two nucleotides represented in the reads as more than two nucleotides would generally
2There are some SNP callers which do not make use of Bayes’ theorem. One such example is the heuristic VarScan procedure [30] which in essence reports all variant alleles at genomic sites that meet certain criteria.
3Note that each of the profiled methods are described using different types of notation. Though notational unity is preferred, the ideas between the SNP calling methods can be both similar and diverse and thus attempting to combine all these ideas under a unified notational scheme leads to logical difficulties. To resolve this issue, the following method summaries will use the notation as described in their respective papers.
19 represent an error. From these two most frequent nucleotides, MAQ assumes that there are three possible genotypes, namely g ∈ {hb, bi, hb0, bi, hb0, b0i}, where b, b0 ∈ {A, C, G, T }
and the angled brackets, h i, denote an unordered set. Let n be the combined total of b
and b0 alleles covering the site and let k be the number of called b alleles.
−Q /10 The data D includes i = 10 i where Qi = min(mapping quality of the ith
read covering the site, base-calling quality for the base on the ith read covering the site).
The consensus is called to be gˆ = arg max P (g|D) where, g
P (g)P (D|g) P (g|D) = P . (3.2) s P (s)P (D|s)
The prior probabilities are fixed as follows: P (hb, b0i) = r and P (hb, bi) =
0 0 1−r P (hb , b i) = 2 where r = 0.2 at known SNP sites and r = 0.001 elsewhere. Thus,
MAQ works best, as intended, when prior knowledge of known SNP sites is available.
As for developing the conditional probability P (D|g), first let us consider g = hb0, bi. For
this genotype, the MAQ reference [40] simply states,