USOO9388462B1

(12) United States Patent (10) Patent No.: US 9,388,462 B1

Eltoukh 45) Date of Patent: Jul. 12,9 2016

(54) DNA SEQUENCING AND APPROACHES 6,681,186 B1* 1/2004 Denisov et al...... TO2/20 THEREFOR 2002fO147548 A1* 10, 2002 Walther ...... GON 39.83 (75) Inventor: Helmy Eltoukhy, Woodside, CA (US) 2003/O120471 A1*ck 6/2003 IZmailov ...... GOIN 27s, (73) Assignee: The Board of Trustees of the Leland OTHER PUBLICATIONS States Junior University, Stanford, Svantesson et al. (A mathematical model of the Pyrosequencing reaction system.” Biophysical Chemistry 2004. 110, 129-145).* - r Ewing et al. (“Base-Calling of Automated Sequencer Traces Using (*) Notice: Subject to any disclaimer, the term of this Phred. I. Accuracy Assessment” Genome Research 1998, 8, 175 patent is extended or adjusted under 35 185).* U.S.C. 154(b) by 1193 days. Eltoukhy et al., “Modeling and Base-Calling for DNA Sequencing by-Synthesis.” 2006 IEEE International Conference on Acoustics, (21) Appl. No.: 11/748,266 Speech and Signal Processing, vol. II, pp. 1032-1035, May 14-19. 2006. 1-1. Eriksson, S. "Cytochrome P450 Genotyping by Multiplexed Real (22) Filed: May 14, 2007 Time DNA Sequencing with PyrosequencingTM Technology.” ASSAY and Drug Development Technologies, 2002, 1,49-59.* O O Brent Ewing and Phil Green. “Base-Calling of Automated Sequencer Related U.S. Application Data Traces. UsingPhred.?II.Error?Probabilities.” Genome Research 8: (60) Provisional application No. 60/799,769, filed on May 186-194 (1998). 12, 2006. k . s cited by examiner (51) CI2OInt. Cl. I/68 (2006.01) Primary Examinerin a Yelena G Gakh Assistant Examiner — Michelle Adams C12OGOIN 21/66 1/76 (2006.01) (74)74). AttAttorney, Agent, or FirmFi — Crawford MMaunu PLLC (52) U.S. Cl. 57 ABSTRACT CPC ...... CI2O I/6869 (2013.01); C12O 1/66 (57) - 0 (2013.01); G0IN 2 1/76 (2013.01); Y10T Polymer Sequencing 1S facilitated. According to an example 436/143333 (2015.01); Y10T 436/166666 embodiment of the present invention, a polymer sequencing (2015.01) approach is implemented uS1ng a multitude of polymer speci 58) Field of Classification S h mens for a particular polymer type is implemented. For each (58) Field of Classification Searc step in a polymer sequencing test, non-idealities are catego

USPC ... ------435/94 rized using data obtained from the polymer sequencing test, See application file for complete search history. and in response to the categorized non-idealities, a polymer (56) References Cited sequence is identified for a corresponding step of the polymer sequencing test. With this approach, read lengths achieved with a particular polymer sequencing method can be U.S. PATENT DOCUMENTS improved. 5,527,675 A * 6/1996 Coullet al...... 435/6 6,303.303 B1 * 10/2001 Green et al...... 435/6 17 Claims, 23 Drawing Sheets

e 400

------f" 1. 420 Polymer hemistry Non-Ideality Model 8x

- 430 - 440 ax E. : Estimation/DetectionAlgorithm L. Polyne?Most sequence likel U.S. Patent Jul. 12, 2016 Sheet 1 of 23 US 9,388,462 B1

S

U.S. Patent Jul. 12, 2016 Sheet 2 of 23 US 9,388,462 B1

FIG. 2A

FIG. 2B

0.15 FIG. 2C 0.1 0.05

100 120 140 # Tests U.S. Patent Jul. 12, 2016 Sheet 3 of 23 US 9,388,462 B1

MLSD Estimate N-310 A.N-330

SBS Bound i.

FIG. 3 U.S. Patent US 9,388,462 B1

\

#7“?IH

U.S. Patent Jul. 12, 2016 Sheet 5 of 23 US 9,388,462 B1

f A - G TT

rixis

is

--- l W & - T G C A Nucleotide added

FIG. 5 U.S. Patent Jul. 12, 2016 Sheet 6 of 23 US 9,388,462 B1

(

FIG. 6 U.S. Patent Jul. 12, 2016 Sheet 7 of 23 US 9,388,462 B1

FIG. 7 U.S. Patent Jul. 12, 2016 Sheet 8 of 23 US 9,388,462 B1

Test DAG for Template TAG Subgroup Weights

1 O O

F.G. 8 U.S. Patent Jul. 12, 2016 Sheet 9 of 23 US 9,388,462 B1

0.9 U.S. Patent Jul. 12, 2016 Sheet 10 of 23 US 9,388,462 B1

NOn-linear

Filter

l? -- Linear Filter

FIG. 10 U.S. Patent Jul. 12, 2016 Sheet 11 of 23 US 9,388,462 B1

o 5 10 15 2O 25 3.

Issus 140 xis so gx88s: 180 88.8.xxyyxon 200 220 U.S. Patent Jul. 12, 2016 Sheet 12 of 23 US 9,388,462 B1

# Tests

FIG. 12 U.S. Patent Jul. 12, 2016 Sheet 13 of 23 US 9,388,462 B1

U.S. Patent Jul. 12, 2016 Sheet 14 of 23 US 9,388,462 B1

p=0.95 c=0.04 15 s

O.5

15

1 U.S. Patent Jul. 12, 2016 Sheet 15 of 23 US 9,388,462 B1

Partial-MLSD

if Tests

FIG. 15 U.S. Patent Jul. 12, 2016 Sheet 16 of 23 US 9,388,462 B1

.8

t U.S. Patent Jul. 12, 2016 Sheet 17 Of 23 US 9,388,462 B1

4. it is Tests FIG. 17 (b) U.S. Patent Jul. 12, 2016 Sheet 18 of 23 US 9,388,462 B1

r

FIG. 18 U.S. Patent Jul. 12, 2016 Sheet 19 of 23 US 9,388,462 B1

"o so too so too so sooo so Tests

FIG. 19 U.S. Patent Jul. 12, 2016 Sheet 20 of 23 US 9,388,462 B1

X MLSD Bound y

S5S Bound

WMLSD Bound U.S. Patent Jul. 12, 2016 Sheet 21 of 23 US 9,388,462 B1

O - GOO l sool

400 3CO OO

OO

SOD ------w------

500 400h. 300

O

25 50 75 OO 125 50 S. 20 Tests (b) FIG. 21 U.S. Patent Jul. 12, 2016 Sheet 22 of 23 US 9,388,462 B1

--- imental to redicte

848& Experimental - Predicted

5 t

FIG. 22 (b) U.S. Patent Jul. 12, 2016 Sheet 23 of 23 US 9,388,462 B1

282 29O. 298 306 34 322 330 338 Tests

FIG. 23 US 9,388,462 B1 1. 2 DNA SEQUENCING AND APPROACHES However, despite the past and ongoing enhancements to THEREFOR existing CAE technology, under this regime, the lower limit in terms of cost per mammalian genome is intolerably exces RELATED PATENT DOCUMENTS sive for many applications. 5 This patent document claims the benefit, under 35 U.S.C. These and other characteristics have been challenging to S119(e), of U.S. Provisional Patent Application No. 60/799, polymer sequencing and related applications, for both 769, filed May 12, 2006 and entitled: “Polymer Sequencing sequencing-by-synthesis and cleavage approaches, for a vari and Approaches Therefor.” ety of polymers such as proteins and DNA. FEDERALLY SPONSORED RESEARCHOR 10 SUMMARY DEVELOPMENT The present invention is directed to overcoming the above This invention was made with Government support under mentioned challenges and others related to the types of appli contract HG003571 awarded by the National Institutes of cations discussed above and in other applications. These and Health. The U.S. Government has certain rights in this inven 15 other aspects of the present invention are exemplified in a tion. number of illustrated implementations and applications, FIELD OF THE INVENTION Some of which are shown in the figures and characterized in the claims section that follows. The present invention relates generally to polymer Various aspects of the present invention are applicable to sequencing, and more particularly to arrangements and the sequencing of polymers in a manner that addresses dif approaches for polymer sequencing. ferences among polymers analyzed for a particular sequenc ing approach. Such differences may result, for example, BACKGROUND where a particular step in a synthesis or cleaving approach is 25 not carried out with all polymers in a group of polymers Polymer sequencing, such as protein sequencing or DNA undergoing analysis, or with different tests performed at dif sequencing, is susceptible to errors, misalignment and other ferent times for a particular type of polymer. In accordance issues related to inconsistent synthesis and/or cleavage as with various example embodiments, results from sequencing associated with polymers undergoing analysis. These issues tests are adjusted in consideration of these differences. can result in inaccurate analysis and other related issues. 30 In one embodiment of the present invention, a method for One approach to synthesis-based sequencing is DNA polymer sequencing via a multitude of polymer specimens sequencing-by-synthesis, which generally involves bio for a particular polymer type is implemented. For each step in chemical processes by which a target DNA strand is itera a polymer sequencing test, non-idealities are categorized tively built up to determine its sequence. An example of Such using data obtained from the polymer sequencing test, and in methods is Pyrosequencing, which offers the promise of high 35 throughput and low cost sequencing via efficient mass paral response to the categorized non-idealities, a polymer lelization and its relative simplicity. However, achieving reli sequence is identified for a corresponding step of the polymer able DNA read lengths using Pyrosequencing requires high sequencing test. reagent costs to maximize signal fidelity and thereby has In another embodiment, a method for DNA sequencing is often been prohibitively expensive for applications such as 40 implemented. Results from a DNA sequencing test are mod whole genome shot-gun assembly. eled as a noisy Switched linear system parameterized by Recently, there has been great interest in developing potential DNA sequences and respective DNA bases. Using cheaper, higher throughput de novo DNA sequencing tech the modeling, a DNA sequence is selected from the potential nologies and protocols to jumpstart the next phase of genetic DNA sequences with respect to a probability of matching the inquiry beyond the . Although much 45 test results. progress has been made in the last two decades, whereby Consistent with another embodiment, an apparatus is sequencing cost has been reduced from tens of dollars per implemented for polymer sequencing via a multitude of poly base to a few cents per base today, the cost of sequencing a mer specimens for a particular polymer type. The apparatus single mammalian-sized genome is still in the tens of millions has means for applying a polymer sequencing test to itera of dollars. This exorbitant cost greatly hinders such vital 50 tively synthesize or cleave polymer bases. The apparatus has studies as comparative genomic analysis across species, means for categorizing, for each iteration, the non-idealities detailed Studies of human genetic variation, and analyses of using data obtained from the polymer sequencing test, and difficult-to-culture microbial communities. means for identifying a polymer sequence for a correspond Microarray-based technologies can be used for single ing step of the polymer sequencing test using the categorized nucleotide polymorphism (SNP) analysis; however, these 55 genotyping methods are likely to miss rare differences that non-idealities. may be critical to diagnosing certain conditions as well as fail Another embodiment of the present invention includes a to extract long-range information, Such as genomic rear system for polymer sequencing via a multitude of polymer rangements. specimens for a particular polymer type. A polymersequenc One sequencing approach is Sanger sequencing, which 60 ing device iteratively modifies a multitude of polymer speci uses fluorescence detection of dideoxy terminated fragments mens with respect to their polymer bases. Logic, for each resolved by capillary array electrophoresis (CAE). See, e.g., iterative modification, categorizes the non-idealities for dif B. Ewing, P. Green, “Basecalling of automated sequencer ferent ones of the multitude of polymer specimens using data traces using phred. II. Error probabilities. Genome Research from the polymer sequencing device. Logic, for each iterative 8, pp. 186-194, 1998. This method, first developed in 1977, 65 modification and using the categorized non-idealities, iden now allows the sequencing of read segments of approxi tifies a polymer sequence for a corresponding step of the mately 1000 nucleotides long with reasonable accuracy. polymer sequencing test. US 9,388,462 B1 3 4 The above summary is not intended to describe each illus FIG. 11(c) illustrates the evolution of the time-varying trated embodiment or every implementation of the present impulse response for p=0.95, where ISI but becomes more invention. severe after 100 tests, consistent with embodiments of the present disclosure; BRIEF DESCRIPTION OF THE DRAWINGS FIG. 12 depicts a plot showing decorrelation between dif ferent base types for p=0.99, where a sample sequence is The invention may be more completely understood in con randomly perturbed along only a single base-type and its sideration of the detailed description of various embodiments absolute value of its difference from the original sequence at of the invention that follows in connection with the accom the output is plotted and where interpeak regions correspond panying drawings, in which: 10 to tests of the 3 other base-types and are only slightly affected FIG. 1 shows a system and approach for polymer sequenc by the perturbation, consistent with embodiments of the ing as applicable for implementation in connection with one present disclosure; or more example embodiments of the present invention; FIG. 13 depicts pruning of search space, i.e., sphere decod FIG. 2 shows plots of a pulse response as a function of the ing, consistent with embodiments of the present disclosure; 15 FIG. 14(a) illustrates a performance comparison between number of tests respectively in 2A, 2B and 2C: SBSD and the Stack algorithm, where the dashed line repre FIG. 3 shows plots of pulse responses as a function of a sents the end of the error-free region for SBSD, consistent number of tests in sequencing, according to another example with embodiments of the present disclosure; embodiment of the present invention; FIG. 14(b) illustrates a performance comparison between FIG. 4 is a data flow diagram for an approach to sequenc SBSD and partial-MLSD, where the dashed line represents ing, according to another example embodiment of the present the end of the error-free region for SBSD, consistent with invention; embodiments of the present disclosure; FIG. 5 illustrates a DNA template undergoing Pyrose FIG. 15 depicts a Monte-Carlo simulation showing P. per quencing reaction with the output signal is shown in the inset formance for SBSD and partial-MLSD, consistent with box, consistent with embodiments of the present disclosure; 25 embodiments of the present disclosure; FIG. 6(a) illustrates incomplete incorporation phenom FIG.16(a) depicts a simulated Pyrosequencing output with enon relative to the starting state of the DNA strand popula O=0.04 for p=0.995, consistent with embodiments of the tion, consistent with embodiments of the present disclosure; present disclosure; FIG. 6 (b) illustrates where the base T is added, consistent FIG.16(b) depicts a simulated Pyrosequencing output with with embodiments of the present disclosure; 30 O=0.04 for p=0.95, consistent with embodiments of the FIG. 6 (c) illustrates where the base C is added, consistent present disclosure; with embodiments of the present disclosure; FIG. 17(a) illustrates the probability of correct decoding FIG. 6 (d) illustrates where the base A is added, consistent for SBSD versus the number of tests for p=0.95, where the with embodiments of the present disclosure; dotted line shows the lower bound approximation, consistent FIG. 6 (e) illustrates where the base G is added, consistent 35 with embodiments of the present disclosure; with embodiments of the present disclosure; FIG. 17(b) illustrates the probability of correct decoding FIG. 6 (f) illustrates where the base T is added again result for SBSD versus the number of tests for p=0.995, where the ing in an incorporation event occurring at two different posi dotted line shows the lower bound approximation, consistent tions in the sequence, consistent with embodiments of the with embodiments of the present disclosure; present disclosure; 40 FIG. 18 contains a plot showing d, as a function of the FIG. 7 illustrates a Markov chain depicting simplified number of tests for various values of p, consistent with model of single strand nucleotide incorporation, consistent embodiments of the present disclosure; with embodiments of the present disclosure; FIG. 19 shows a loose upper bound on Pfor MLSD decod FIG. 8 illustrates a directed acyclic graph (DAG) showing ing for p=0.95 and p=0.995, consistent with embodiments of state distribution of DNA template (TAG) population versus 45 the present disclosure; added base where the corresponding fraction of the initial FIG.20(a) contains a plot showing MLSDP approxima Strand population at each node is highlighted in the table and tion versus partial-MLSD Monte Carlo simulation and lower are denoted as Subgroup weights, consistent with embodi bound for SBSD for p=0.95, consistent with embodiments of ments of the present disclosure; the present disclosure; FIG. 9 illustrates an output signal versus the number of 50 FIG.20(b) contains a plot showing MLSDP approxima tests n, for the example case oftemplate, ACGTACGT... and tion versus partial-MLSD Monte Carlo simulation and lower t=1 2 3 4 1 2 3 4. . . . consistent with embodiments of the bound for SBSD for p=0.995, consistent with embodiments present disclosure; of the present disclosure; FIG. 10 illustrates a block diagram of AWGN communi FIG. 21(a) shows experimental Pyrosequencing data in cation channel representation of DNA sequencing-by-Syn 55 relative units (RLU) before correction, consistent with thesis, where the output represents a convolution of the input embodiments of the present disclosure; sequence with the time-varying and signal dependent impulse FIG. 21(b) shows experimental Pyrosequencing data in response, which is non-linearly dependent on the input, con relative units (RLU) after correction, consistent with embodi sistent with embodiments of the present disclosure; ments of the present disclosure; FIG. 11(a) illustrates the evolution of the time-varying 60 FIG.22 contains plots showing accuracy of predicted out impulse response for p=0.95, where ISI is negligible for the put versus experimental data in two different regions of the first base, consistent with embodiments of the present disclo DNA sequence, consistent with embodiments of the present Sure; disclosure; and FIG. 11(b) illustrates the evolution of the time-varying FIG. 23 shows experimental Pyrosequencing output show impulse response for p=0.95, where ISI but becomes more 65 ing breakdown in chemistry due to the adversities of product severe after 50 tests, consistent with embodiments of the inhibition and accumulation, consistent with embodiments of present disclosure; the present disclosure. US 9,388,462 B1 5 6 While the invention is amenable to various modifications hood method, maximum a posteriori method, expectation and alternative forms, specifics thereof have been shown by maximization or approximate versions of these methods or way of example in the drawings and will be described in similar estimation or detection methods is applied to experi detail. It should be understood, however, that the intention is mental Pyrosequencing data to generate results having longer not to limit the invention to the particular embodiments read lengths, and in Some applications (e.g., with pyrose described. On the contrary, the intention is to cover all modi quencing), read lengths exceeding 200 bases. These gener fications, equivalents, and alternatives falling within the spirit ated read lengths, having an order of (e.g., natural or modi and scope of the invention. fied) nucleotide bases in a DNA molecule, can be used to determine the sequence and or structure of proteins encoded DETAILED DESCRIPTION 10 by that DNA as well as simply to infer the DNA sequence The present invention is believed to be applicable to a itself. In certain applications, the results are used to provide variety of different types of processes, devices and arrange bounds on the probability of correctly decoding a given ments for polymer sequencing, and certain aspects have been Sequence. found to be particularly suited for protein-related and DNA 15 According to another example embodiment of the present sequencing. Such as sequencing approaches involving natural invention, a polymer sequence is determined using a pyrose or modified nucleotides or amino acids. While the present quencing approach with a noisy Switched linear system invention is not necessarily so limited, various aspects of the model as discussed above. The model assumes incomplete invention may be appreciated through a discussion of incorporation, and includes read noise and non-specific incor examples using this context. poration due to chemical and/or detector non-idealities. According to an example embodiment of the present inven These non-idealities may also be referred to as de-phasing, tion, a polymer is sequenced by analyzing a plurality of Such de-synchrony, carry-forward, lag or other related terms. The polymers at different stages of assembly. In analyzing the stochastic nature of the incorporation rate is ignored (but can plurality of polymers in steps, certain ones of the polymers be included for a more precise fit), and its mean value pe(0, 1) are at stages of assembly that are different than other ones of 25 is considered (which is defined as the average fraction of the polymers due, for example, to incomplete synthesis or template strands that incorporate the added base, when incor ineffective cleavage. The polymers are analyzed in a manner poration should ideally occur). It should be noted that this is that facilitates the identification of a proper stage of assembly just one embodiment of many models that can be used with for the polymer at a particular step from data including poly similar approaches; in various embodiments, other models mers at different stages of assembly. 30 In one application, the above-discussed approach involves that capture the impact and/or behavior of non-idealities in tracking distortion at different stages of assembly of the poly the polymer chemistry of interest are similarly used. mers, and identifying the Source of the distortion at each For general information regarding sequencing, and for spe stage. Using the identified source of the tracked distortion, cific information regarding approaches to sequencing that test results characterized by the distortion are processed to 35 may be used in connection with one or more example embodi mitigate or remove the distortion from the results and thereby ments herein, reference may be made to H. Eltoukhy and A. El facilitate the identification of a polymer sequence that fits the Gamal, Modeling and Base-calling for DNA Sequencing-by results. synthesis, 2006 IEEE International Conference on Acoustics, According to another example embodiment of the present Speech and Signal Processing, 2006, Volume: 2, pages: II-II invention, a sequencing approach involves the analysis of a 40 (2006), which is fully incorporated herein by reference. polymer such as a protein or DNA via synthesis. A plurality of Incomplete incorporation gives rise to a population oftem polymers corresponding to the polymer to be analyzed is plate strands with Subgroups classified by a representative lag synthesized in a step-by-step type of sequencing approach from the ideal Subgroup. The process of Subgroup formation (e.g., Such as that used with Pyrosequencing). At each step in and evolution can be represented, for example, by a weighted the synthesis, the polymers are analyzed in a manner that 45 facilitates the detection and/or determination of a state of the directed acyclic graph such as the example illustrated in FIG. polymers at each step that represents the proper state of the 4 of the above-referenced publication “Modeling and Base particular polymer. calling for DNA Sequencing-by-synthesis.” Nodes, which In another example embodiment of the present invention, a represent subgroups, are organized in levels each correspond sequencing approach involves the analysis of a polymer via 50 ing to a given test. When incorporation occurs, a fraction p of cleavage or other “deconstruction' type approach. A plurality of polymers of the polymer undergoing analysis are cleaved the Strands in a Subgroup advances to the Succeeding Sub or otherwise deconstructed on a step-by-step basis. At each group, while a fraction 1-p remains. When no incorporation step, the polymers are analyzed in a manner that facilitates the occurs, the Subgroup position remains unchanged, i.e., does detection and/or determination of a state of the polymers at 55 not advance to the right, and a weight of 1 is assigned to the each step that represents the proper state of the particular corresponding edge. The initial population of template polymer. Strands becomes distributed over many different Subgroups, In connection with another example embodiment of the present invention, the process of DNA sequencing-by-Syn with laggard Subgroups contributing to the overall signal at thesis and its nonidealities are modeled as a noisy Switched 60 each test thereby distorting signal quality at longer reads. linear system parameterized by an unknown DNA sequence, These distortions are addressed in connection with various where the Switching is performed by the input test sequence. example embodiments, with a model as described below; for A basecalling problem is then formulated as a parameter detection problem as follows: given a test sequence and its various definitions and/or explanations ofterms as applicable corresponding noisy output sequence, determine the system 65 to the model, reference may be made to the aforesaid publi parameters, i.e., the DNA sequence that minimizes the prob cation entitled “Modeling and Base-calling for DNA ability of decoding error. At least one of a maximum likeli Sequencing-by-synthesis.” US 9,388,462 B1 7 8 In connection with the model, the i' component of the akinto the incorporation rate, p. Depending on the outcome of weight vector at time n can be expressed as each test, Subgroups either advance to the Succeeding Sub group with weight p upon Success or with weight S upon failure. Sinces is typically small (e.g., less than about 0.05), as a good approximation and to simplify the analysis, one Subgroup advancement is considered per test. However, a more precise treatment of non-specific incorporation is a pl(St.),).(St), included in certain embodiments with polymer or sequenc refers to thei' element in the vectorS't, and I(a)=1 ifa-0 and ing-by-synthesis chemistries involving a significant non-ide Zero, otherwise. Thus, the evolution of the weight vector can 10 ality. Where appropriate, data regarding these subgroups is be described by the switched linear relation recorded for the modeling. To include the effect of non-specific incorporation, a, is w,-Aw, 1, for 1snsN, replaced in the definition of A, by where the state-transition matrix 15 d = ..(St).), where 1 - a 1 0 ... O a1n 1 - a2, ... O A = O (2n . . . O and e, otherwise. In connection with the above results, an approach to base O O 1 - a M+1,n calling is implemented as follows: Given T. p, and y, y, ..., y, the sequence matrix S is estimated. The problem is cast as a parameter detection problem, using maximum Hence, each test vector t, selects (or switches to) the appro likelihood detection (MLD) to obtain priate state-transition matrix A, for the current test. 25 The noiseless output X, is defined to be the sum of the contributions from each Subgroup of the template population S* = argnaxfy. ... , y w I S. T. p) that “tests positive' in response to test t. i.e.,

30 - p(S,2-1 ,). Each term in the inner product of St. and w1 represents a contribution from a subgroup. To include read noise, the noise is assumed as additive White Gaussian Noise (WGN) Each homopolymeric region is viewed as producing a par that is independent of the signal and system parameters. With ticular “channel’ pulse in response to a given set of tests over noise added, the output of the system is given by 35 time. Accordingly, each X, is decomposed into the Sum of the shifted pulse responses due to each individual homopoly }, x+y, for 1snsN, meric region as where (xx2 ... xN-HT-111...111). 40 where v., 1snisN are independent, identically distributed N (0.of) random variables. b. 0 O In connection with the above examples, one embodiment b2.1 b2.2 O involves addressing noise involves by effectively (nearly) 45 HT = p . . canceling system noise by consolidating information from a bN. bN2 bN.M single analysis run. For instance, noise can be approximated by averaging error values obtained from experimental data, less a model fit. and FIG. 1 shows an overall system 100 (e.g., model) with 50 bi-wi-1 (St). processing components 110, 120, 130 and 140, in connection with an example embodiment, where the inputt dictates the Each column of H represents the contribution from an indi evolution of the system with parameters S and p. The system vidual homopolymeric region to the resultant X, X2,..., Xy. 100 may be implemented using one or more types of devices FIG. 2 shows plots of a pulse response as a function of the and corresponding logic. For instance, a sequencing-by-Syn 55 number of tests respectively in 2A, 2B and 2C. Note that as thesis device (110) can be operated in connection with corre long as p<1, each homopolymeric region contributes to Suc sponding logic (120, 130, 140) implemented using a device ceeding outputs even for very large N, although possibly Such as a computer processor executing instructions, a pro negligibly, or in communication theory parlance, the channel cessor-readable storage medium configured with processor exhibits severe intersymbol-interference (ISI). executable instructions, a dedicated processor or a program 60 Two features of the model discussed above are selectively mable logic device; moreover, the sequencing-by-synthesis exploited to reduce the computational complexity of base device and corresponding logic may be implemented together calling; first, a limited portion of the puke response is used as on Such a device. significant, and second, tests of different base types are rea If non-specific incorporation is non-negligibly present, it is sonably uncorrelated for values ofp that are close to one. modeled by introducing a non-specific incorporation rate, 65 Taking these properties into consideration, the soft-input Vit erbialgorithm (VA, see, e.g., G. D. Forney, “The Viterbi Oses1 algorithm.” Proc. IEEE, Vol. 61, N. 3, pp. 268-278, March US 9,388,462 B1 9 10 1973) is modified to perform base-calling and, hence, be extracted from a portion of the sequence that is known a approximate maximum-likelihood sequence detection priori. The iterative partial-MLSD algorithm is applied to (MLSD). Pyrosequencing datasets. Four trellises are used (each one with states corresponding A multitude of dataset ranges are used in connection with to only one test type), and symbol-by-symbol detection is various example embodiments. In one implementation, the performed to obtain a rough estimate of the sequence to datasets range in length between 55 and 224 bases, with the initialize the algorithm. A standard VA is performed on each first 170 out of 208 and 205 out of 224 bases of the longest two of the four trellises, with the best surviving paths from the templates, as well as all shorter sequences, correctly decoded. other three trellises filling in the gaps. The paths are extended In another implementation, pyrosequencing systems are used 10 in conjunction with a base-calling approach discussed above one path length at a time, and the trellises are iterated through to read lengths of between about 200-300 bases or more. based on the order specified by the test matrix T. In this Various example embodiments are implemented in con manner, MLSD is approximated using an iterative, partial nection with the approaches discussed hereinbelow, generally MLSD approach. In some applications, a confidence score is describing a communication framework for polymer generated for each base using a soft-output VA (SOVA) or 15 sequencing. List VA (LVA). For example SOVA approaches that may be In this discussion, we develop an algorithm for enhanced implemented in connection with various example embodi automated base-calling in DNA sequencing-by-synthesis ments, reference may be made to J. Hagenauer, P. Hoeher, 'A based approaches predicated on a communication-theoretic Viterbi Algorithm With Soft Decision Outputs and Its Appli framework. Although we focus on Pyrosequencing specifi cations.” Proceedings of IEEE GLOBECOM, pp. 47.1.1- cally, the analytical model and resulting decoding algorithms 47.1.6., November 1989. For example LVA approaches that are suitable for a variety of sequencing-by-synthesis meth may be implemented in connection with various example ods. The basis of the analysis is a chemically defined “com embodiments, reference may be made to N. Seshadri, C-E.W. munication-channel” model. The model is similar to an Addi Sundberg, “List Viterbi Decoding Algorithms with Applica tive White Gaussian Noise (AWGN) channel with severe tions. IEEE Transactions on Communications, Vol. 42, no. 25 intersymbol interference, except that the system impulse 2/3/4, pp. 313-323, 1994. response is a function of the input signal. This communica In Some applications, the above communication analogy is tion theoretic approach to modeling Pyrosequencing makes it used to derive bounds and estimates on the probability of possible to leverage the vast body of work on detection to correct decoding (P). For example, FIG. 3 shows a plot 310 develop accurate base-calling algorithms. We use this frame of the lower bound on Pc obtained assuming symbol-by 30 work, both to perform decoding of the received signal set as symbol (SBS) detection (i.e., with no lookahead) and a plot well as to derive bounds on the probability of decoding error. 320 of a conservative estimate assuming MLSD with worst The rest of the discussion is organized as follows. In the case ISI to Pobtained using a Monte-Carlo simulation of the next section we give a general description of the Pyrose partial-MLSD algorithm (plot 330). quencing chemistry and the non-idealities that limits its read FIG. 4 is a data flow diagram and approach 400 for an 35 length. In Section III, we develop a model of the underlying approach to sequencing, according to another example chemistry. In Section IV, we discuss the similarities between embodiment of the present invention. The arrangement and this model and the AWGN channel with intersymbol interfer approach 400 may be implemented, for example, in connec ence. In Section V, we explore both symbol-by-symbol (SBS) tion with one or more of the example embodiments described and approximate maximum-likelihood (ML) sequence detec above, such as that shown in FIG. 1 and otherwise. Measure 40 tion algorithms for base-calling. In Section VI, we detail the ment/test data 410 is provided to a sequencing processor 450. performance of both SBSD and MLSD interms of probability A non-ideality model function 420 is implemented to model of error versus read length. Finally, in Section VII, we present underlying chemistry and noise in the data 410 using one or experimental data highlighting the performance of the more of a variety of approaches. For instance, individual approximate MLSD algorithm. polymer Subgroups can be tracked together with their contri 45 II. Background bution to the test data. First, we familiarize the reader with some of the biological The modeled chemistry and noise is processed at an esti terms that will be used throughout the discussion. Deoxyri mation/detection algorithm function 430, using the model bonucleic acid or DNA is a polymer or strand composed of data to choose a maximum likely, or a posteriori, sequence only four possible constituent molecules or nucleotide bases; that best corresponds to the measurement/test data 410. The 50 these bases are adenine (A), cytosine (C), guanine (G), and estimation/detection algorithm function 430 then outputs a thymine (T). Moreover, a DNA strand can be either single or most likely polymer sequence 440. double-stranded. Single-stranded DNA (ssDNA) can bind The approaches to sequencing discussed herein are appli with its unique ssDNA complement to form a double cable to a variety of polymers in a variety of manners. The stranded version of itself (dsDNA), whereby A bases bond following example describes Such applications. In one appli 55 only with T bases and C bases bond only with G bases. For cation, preprocessing steps are taken before applying the example, the sequence ACTGGC is complementary to iterative partial-MLSD algorithm to experimental Pyrose TGACCG. The goal of DNA sequencing in general is to quencing data. A baseline correction is performed due to the “read a particular ssDNA strand or template, that is, to changing chemical background signal present as well as inte reliably determine the sequence of its bases, since these bases gration of the total photoemission from each test. The inte 60 can appear in any order. As DNA sequences can be several grated signals are normalized to account for chemical varia hundred millions of bases in length, it is the goal of any de tions from run to run. The scaling factor for Such novo DNA sequencing approach to have as long and as accu normalization is automatically extracted from the histogram rate of a read length as possible to aid in DNA assembly, a of the photoemission values. Additionally, the parameters of process by which short DNA reads are pieced together to the model are estimated, e.g., p. base specific gains, and 65 form the entire sequence. The terms nucleotide and base will others. These parameters are similarly extracted from a subset be used interchangeably throughout this discussion to refer to of the data itself through an iterative fitting procedure or can a nucleotide base. In addition, the term homopolymeric refers US 9,388,462 B1 11 12 to sequences or portions thereof composed of a single base short, a synchronization problem occurs, where by continued type. We next describe the Pyrosequencing chemistry and its analogy with cassette tape reading, instead of using one tape non-idealities in order to provide the reader with better physi head, we have many tape heads with slightly varying read cal intuition regarding the model presented herein for DNA rates, and only the sum of their signals is observed. FIG. 6 sequencing-by-synthesis. illustrates the problem showing Snapshots of the representa A. Pyrosequencing tive template strands versus nucleotide dispensation. In FIG. Pyrosequencing is a DNA sequencing-by-synthesis 6(a), at the start of sequencing, all strands are synchronized method by which an ssDNA is sequenced through an iterative with one another. In FIG. 6(b), the nucleotide T is added: buildup of its complement, the progress of which is inferred however, we now have two representative subgroups, those through detection of the resultant chemical photoemission. 10 that incorporated the T base and those that failed to do so. The chemistry itself consists of a mixture of 4 main compo Sequencing proceeds normally until when in FIG. 6(d), nents: the ssDNA template (strand to be sequenced), poly another subgroup is generated, that is, those that failed to merase to accelerate the rate of nucleotide incorporation, incorporate the added Abase. Finally, FIG. 6(f) shows incor ATP-sulfurylase to help convert pyrophosphate (PPi) into poration at two sites, the “ideal Zero-lag subgroup and at a ATP and luciferase (firefly enzyme) to indirectly quantify the 15 previous incorporation position on a laggard Subgroup. The ATP in solution via light emission. Since each nucleotide problem is of course that both of these subgroups contribute incorporation event results in the release of one PPi molecule, to the output signal. In this way, this arising convoluted signal this release can be detected via the enzymatic cascade pre distorts the output and complicates the decoding of the Pyro cipitated by ATP-sulfurylase and luciferase as the eventual gram to recover its representative underlying sequence. It is emission of a proportional number of photons. Typically, a important to note that the extraneous Subgroups generated by template is sequenced by repeatedly cycling through the addi incomplete incorporation are all lagging in nature, since a tion/dispensation of the 4 possible bases, A, C, G, T, with the Strand only incorporates correctly or does not. resulting incorporation quantified at each step through the Spurious leading Subgroups, which can only be engen detection of the number of photons emitted. The one twist is dered by undesired nucleotide additions, are more of a sec that the incorporation stops only at the first base not equal to 25 ond-order non-ideality and are caused by misincorporation as the current one, i.e., each homopolymeric region is sequenced well as contaminated dispensations. Misincorporation is due at a time. FIG. 5 illustrates this process for the DNA template to the fact that polymerase has a non-zero probability (very AGTTCAG. Ideally, a light signal linearly proportional to the small, i.e., 1 in 10-10) of incorporating the wrong nucle number of bases complementary to that added at the current otide. The problem with this of course is that it counts as a position is detected. Assuming a dispensation order of T. G. 30 valid incorporation event and advances the polymerase to the C, A, we obtain a signal proportional to the run-length code, next position on the template. Contaminated dispensations, 1T-0G-1C-2A, and can infer that the first four bases of the on the other hand, occur when the nucleotide dispensed is template sequence is AGTT. In this way, there is a direct contaminated with dilute concentrations of some or all of the relationship between a template and its Pyrogram, i.e., the other three nucleotides. This contamination is generally on plot of a template's detected light signal versus nucleotide 35 the order of 0.01%, once again a typically negligible level. dispensation, from which the sequence of the template can be The third major class of non-idealities is due to product easily inferred. This overall scheme to DNA sequencing is inhibition and secondary structure interactions of the tem analogous to the serial readout approach employed by a cas plate molecules. Product inhibition refers to the phenomenon sette player, whereby the single-sided DNA molecule is simi whereby the accumulation of chemical byproducts inhibits lar to the tape and the polymerase enzyme represents the tape 40 the forward reactions from proceeding. Fortunately, a clever head. combination of chemistry and engineering can be used to B. Non-Idealities combat this non-ideality, such as, employing reaction cham The above description details the ideal outcome of the bers that confine only the template, while continually flushing Pyrosequencing chemistry, whereby the length of the DNA out accumulated products. The other non-ideality that occurs template to be sequenced can be arbitrarily long. Current 45 in practice is the effect of secondary-structure interactions in approaches to Pyrosequencing, however, can only sequence the template. This can cause a fluctuation of the incorporation up to 40 bases reliably, due to the existence of several non rate at certain stretches where hairpin structures may exist. idealities in the reading process. In practice, there are three This too can be ameliorated through the introduction of addi main sources of signal degradation, in addition to overall tional enzymes to decrease the likelihood of Such interac system noise, that limit realizable read lengths: incomplete 50 tions. incorporation, misincorporation/contaminated additions, and In addition to the above three non-idealities, there is also product inhibition/secondary structure interactions. system noise due to chemical variations arising from fluctu Incomplete incorporation is the main non-ideality respon ating nucleotide dispensations and mixing efficiency, as well sible for the degradation in read lengths. It is predicated on the as measurement noise due to the photosensing device. concept that the polymerase-assisted incorporation reaction 55 For the rest of the discussion, we will concern ourselves itself, only with incomplete incorporation and system noise, since these constitute the dominant distortion phenomena encoun DNA+dNTP olvmerase DNA+PPi, tered using Pyrosequencing. Fortunately, extension of the where dNTP refers to a single nucleotide, is a stochastic framework to include other non-idealities is relatively process with a mean forward and reverse rate. As such, given 60 straightforward. that we are forced to sequence multiple copies of the template III. Sequencing-by-Synthesis Model to amplify the signal, the number of Strands that incorporate To model incomplete incorporation, we first make some the dispensed nucleotide follow a Poisson distribution with a simplifying assumptions, which are well Substantiated by rate determined by the overall forward and reverse reaction empirical data. First, we ignore the stochastic nature of the rates. The stochastic nature of the process implies that not all 65 incorporation rate, and only consider its mean value pe(0, 1), template molecules incorporate when they ideally should and which represents the average fraction oftemplate strands that herein lies the crux of this most significant non-ideality. In incorporate the dispensed nucleotide, when incorporation US 9,388,462 B1 13 14 should ideally occur. For example, p=1 implies that no incom weights capture the evolution of the Subgroups in response to plete incorporation phenomenon exists and consequently no each base addition step. When incorporation occurs, a frac distortion occurs. To illustrate why use of a deterministic tion p of the strands in a subgroup advance to the Succeeding value for p is reasonable, consider a hypothetical Solution Subgroup, while a fraction 1-p remains. Hence a weight of p consisting of a single strand with only one base type. When is assigned to the edge from the parent node corresponding to we add the complementary base, incorporation occurs with the Subgroup to its child on the right and a weight of 1-p is probability p and does not occur with probability 1-p (see assigned to its other child. When no incorporation occurs, the FIG. 7, Markov chain depicting simplified model of single Subgroup position remains unchanged, i.e., does not advance Strand nucleotide incorporation.) Assuming the outcomes of to the right, and a weight of 1 is assigned to the corresponding the tests to be independent, at the end of N incorporation tests, 10 edge. The number of non-Zero Subgroups or nodes at a given the probability of ending up at a certain base position k is level of the graph grows at most linearly with the number of levels, since branch recombination occurs except for the lead given by the binomial probability law, i.e., ing subgroup. From FIG. 8 the reason for the eventual severe distortion in signal quality becomes apparent. The initial 15 population of template strands becomes distributed over many different Subgroups, with laggard Subgroups contribut ing to the overall signal at each nucleotide dispensation. Before describing our model, we introduce the following Now consider Z strands with independent and identically needed definitions. First we assign to each homopolymeric distributed incorporations. The number of strands Z termi region of length le 1,2,..., K}, in the template DNA strand, nating at each position n, for OsnsN, follows a multinomial a "run-length” codeword as follows: probability law AA. . . A () (1000) CC. . . Ce (01.00) GG ... Ge (0010) 25 Z! A TT ... Te (0001) QN (Zo = 30, Z = 31, ... , ZN = 3N) = , in, Although, in general, there is no bound on 1, we limit 1 to a gn=0 maximum of K to simplify the analysis. Moreover, larger =0 values of 1 occur with much lower probabilities in naturally occurring DNA sequences. We then form the 4x(M+1) DNA where 30 sequence matrix S whose columns are the codewords corre sponding to the Mhomopolymeric regions of the given tem plate DNA strand with the (M-1)th column all Zeros. For W example, the oligonucleotide, TAGCGG, is represented by, 6 = PN (n) and X3, = Z. =0 35

The mean and variance of Z, are given by

E(Z) = ZPN (n) and 40 Note that at each iteration, a certain nucleotide is dispensed. Alternatively, we can think of each nucleotide dispensation as a “test, which aims at detecting the presence of a certain AS 45 nucleotide at the current position of the template. Thus, an “A test, for example, attempts to detect the presence of the base A at the current position in the sequence. In this way we O sidestep the details regarding DNA complementariness and Z - co, 7 - 0 as vz. hence this is the terminology we employ for the remainder of 50 the discussion. Although in practice a cyclical testing order is Since in practice Z>10, the standard deviation is very small used, i.e., ACGTACGT ..., for generality, we allow for any and this suggests we are justified in assuming that a deter testing sequence. To indicate which test is performed at each ministic fraction p transition to the Succeeding state. step, we define the DNA test sequence vector t of length N Furthermore, we also assume that p can be empirically such that t=1 if test n is for base A, 2 if for base C, 3 if for base determined and is slowly varying, hence constant throughout 55 G, and 4 if for base T. For example, the test sequence for long stretches of DNA sequencing. TGCATG, is represented by It is clear from the above discussion, that incomplete incor t=4 3 2 1 4 3. poration gives rise to a population of template strands with Next, we define the incorporation vector u of length N as a Subgroups classified by a representative lag from the ideal row vector whose nth element u, is the ideal number of subgroup. It is also clear, given the discrete nature of DNA (in 60 incorporated bases due to test t (i.e., when p=1). For terms of constituent bases), that given a strand of M example, for the DNA template in FIG.8, u=100101. As we homopolymeric regions, we can have at most M-1 subgroups will show soon, u is a function of Sandt. once sequencing has concluded. This process of Subgroup Finally, for each test 1snsN, we define the column weight formation and evolution can be represented by a weighted vector w, of length M+1 to consist of the fractions of the total directed acyclic graph as illustrated in FIG. 8. The nodes, 65 template population in each possible Subgroup after the nth which represent the subgroups, are organized in levels each test, beginning with wo100...O.Thus, by definition, for corresponding to a base addition step. The edges and their each n, W20, for all i, and US 9,388,462 B1 15 16 X,w,1. For the example DNA template sequence in FIG.8. where the weight vectors are the rows of Subgroup weights, i.e., w=1-pp 00", ..., wa1-pp(1-p) p’ 0. Now that we have defined the necessary variables and A. parameters, we can relate them to the observed output at time n, i.e., after the nth test. From the above discussion, it can be seen that the ith is the number of non-Zero entries in u" with index great than component of the weight vector at time n can be expressed in n-4i, m-q(u"'.0)+1, and r(u",t) is the number of non-zero the recursive form values in u" corresponding to test t. Note also that u is a 10 function of S and t, since Wherea, pl(s,..). S. refers to the ith element in the tithrow l, San of S, and I(a)=1 if a>0 and Zero, otherwise. For example, the where again second subgroup weight in ws for the template in FIG. 8 is given by 15

W52 = a1.5 WA + (1 - a 2.5)WA2 From this explicit relation for X, it becomes apparent that each output is simply a sum of a Subset of the terms in a modified binomial expansion. As an example, for the case where t-1 2 3 4 1 2 3 ... I and for DNA sequence, ACG TACGT. . . , the above relation reduces to Combining these recursive equations, we can describe the evolution of the weight vector in the following state-space 25 form (n-1),4 (3) in - 3i - 1 - (1-p). WA,w,1, for 1snsN, (1) X ( i where the state-transition matrix 30 FIG. 9 plots equation (3) versus the number of tests n, for 1 - a1n O O p=0.9. As can be seen, the output for this example has three main phases (i) a simple multiplicative decay where there are a1n 1 - a2, ... O no contributions from lagging Subgroups, (ii) a multiplicative An = O (2n . . . O decay plus one contribution from lagging Subgroups, and (iii) 35 a steady-state where the decline in signal from the leading O O 1 - a M+1,n Subgroups is matched by the addition of signal from lagging OS. FIG. 9 illustrates an output signal versus the number of tests Now, we define the noiseless output x, as the sum of the n, for the example case oftemplate, ACGTACGT... and t=1 contributions from each Subgroup of the template population 40 that “tests positive' in response to test tim, i.e., 2 3 4 1 2 3 4. . . . Now a reasonable question to askis, given the sequence X, X, PSV-1, (2) X2,..., X can S be uniquely determined? This is indeed the wherest, is the tith row of the S matrix defined above. Each case. To show this, we rewrite equation (2) as term in the inner product of St, and W, represents a contri 45 will pS.W. 1, bution from a subgroup. la Since X, is a function of W, it can be thought of as the where m-q(u",0)+1 is the position of the ideal subgroup in output of a linear time varying system described in the State w, after tests t" and s', represents a row from the partial space form, where w, is the state of the system at time n. sequence matrix S", in which only the first m-1 columns We can find an explicit formula for the output V as follows. contain non-zero elements. Thus, X, is the sum of two terms, First, by repeated back-substitution of the state-space equa 50 the contribution from the ideal subgroup after test n and the contribution from lagging groups. By solving for un in the tion (1), we obtain above relation, it can be shown by induction that S can be uniquely determined from X1,X2,..., XM, since given 2-1 x1,..., x, for nsN. S.", t," and w, . X p. a. 55 Xn - PSwn-1f tin F Sin F By careful examination of the form of each term in the above Pin-1 in inner product, it can be shown that for a cyclical testing order 60 In other words, to solve for the next base in the sequence, we calculate the signal contribution from preceding bases, Sub tract that from the observed signal and then normalize the remainder by the size of the ideal subgroup. This implies that 65 we can arbitrarily solve (on an iterative symbol-by-symbol basis) for any read length m given we observe the noiseless output X, and given infinite precision. US 9,388,462 B1 17 18 In practice, of course, the output is corrupted by System communication channel is an important observation, since it noise, which is the second major non-ideality we discussed. allows us to leverage the vast literature on decoding methods We model this as additive White Gaussian Noise (WGN) that for AWGN channel. To further the analogy, we can think of is independent of the signal and system parameters. With each element of the incorporation vector u as an M-PAM noise added, the output of the system is given by symbol (where here M-K--1) transmitted over an AWGN communication channel with memory represented by our y = x+y, for 1snsN, above model of incomplete incorporation. Recall that the where V, 1snsN are independent, identically distributed N noiseless output is given by (0.of) random variables, pilw 10 -1, PSW a-1: V, PSW-1 pill, W-1PSW,-1 where the first term represents the contribution from the cur and rent transmitted symbol u, and the second represents the contribution from other Subgroups. In other words, a non w, A,w, 1. negligible interSymbol interference (ISI) phenomenon exists; 15 however, this ISI is more insidious than in the conventional IV. Communication Channel Analogy communication channel setting as it is not constant but From the results of the previous section, it is clear that depends on the underlying sequence. given a DNA sequence and a test vector, we can solve for the FIG. 10 shows a block diagram of the equivalent commu expected value of the output signal E(y) X, at each time n. nication channel represented by this biological system. The However, it is the reverse procedure that is of much more DNA sequence itself in conjunction with the test vector non practical interest to : Giveny, y, ...,y, how well linearly affects the memory in the channel (i.e., the taps of its can we estimate the underlying sequence matrix S2 In prac discrete-time impulse response). By further borrowing from tice, this type of decoding is generally referred to as base communication theory, it is clear that a causal or a symbol calling. Thus, we wish to determine the underlying sequence by-symbol detector is not optimal due to the inherent corre matrix S. given y1, y2. . . . . yx, t, and p. We can employ 25 lation between received signals. The degree of suboptimality maximum-likelihood detection (MLD) to derive the solution of these methods, however, depends on the severity of the ISI. as follows: Channel Impulse Response To gain some insight into the severity of the channel ISI, we S" = argmaxf(y1,... , y NIS, t) decompose each X, into the sum of the scaled impulse S 30 responses due to each individual homopolymeric region, = argmaxf(y1,... , y Nut, t) since X, is a linear combination of the incorporation events S over the existing subgroups. Mathematically, we decompose = argmaxf(x1 + V1, ... , xN + v Nut, t) the X, as S

W 35 (xx) ... xN-H,N-111... 111'. 33Xg f (x + v u,is t) where

bil () O 40 b2.l b2.2 O W HN = p = argmax X In 1 -(yn-pst wi-1/2c S 2it O. bN. bN2 bN.M

3. gaX 45 and bi, wi-list, (y – ps, w–1). Each column of HN represents the contribution from each homopolymeric region to the resultant X, X ..., X versus 50 test number. Explicitly, we can write each columnjin, as a Thus, we only need to minimize the 1 distance between the shifted and scaled impulse response in D-transform notation observed sequence and the chosen sequence over the search by defining h(D)-hy 'D'D' . . . D''. The impulse space of all possible input sequences. However, the compu response is shifted since the impulse occurs at the time of first tational complexity of this approach is daunting, since given incorporation for the corresponding homopolymeric region the above constraint of at most K bases in a row, we must 55 (i.e., not at time 0), while it is scaled since the length of that search through 4/3(3K)' possible sequences. For M-200 and homopolymeric region is not necessarily one. In this form, we K-5, this is approximately 2.2x10 sequences. Clearly, fur see that for p<1, each homopolymeric region (or transmitted ther insight into the model is necessary in order to reduce the symbol in the communication analogy) contributes to Suc computational complexity. ceeding output even for very large N, although possibly neg The above derivation is analogous to the derivation of maxi 60 ligibly. mum-likelihood detection (MLD) for the conventional Addi From FIG. 11, we can visualize this phenomenon by exam tive White Gaussian Noise (AWGN) channel, where the ML ining the above relation for a specific case. The shifted sequence is the closest sequence in the la sense from the impulse responses (columns of HN) are plotted for received sequence. Indeed, the similarity between this DNA homopolymeric regions 1, 51 and 101, i.e., column 1, 51 and sequencing-by-synthesis model and the AWGN communica 65 101 of HN for that particular sequence. At the beginning of tion channel is an important observation, since it allows us to sequencing (see FIG. 11(a)), the channel response is nearly leverage the vast literature on decoding methods for AWGN memory less for p close to one (i.e., approximately a Kro US 9,388,462 B1 19 20 necker delta), while many testing cycles later (see FIGS. y . . . , y) lies within a sphere of radius K of X=(X, 11(b) and (c)), the response is distributed over a wide range of X, ..., xx) with probability 1-e, when KocO (ca constant). tests. Clearly the overall constraint length of the convolution we can discard all input sequences that correspond to points generated by the channel grows proportionally to read length. outside this noise sphere, since they are highly unlikely to be Furthermore, it is apparent that even the effective constraint 5 the correct sequence (see FIG. 13). In short, we are searching length (non-negligible portion of the impulse response) for the closest lattice point in the skewed output space, x'. grows considerably, a daunting characteristic in terms of Thus, we constrain the search to input sequences, u' that computation, since MLSD's complexity grows exponentially satisfy with constraint length. Looking at computational complexity from the perspective of the incorporation vector, for a mod 10 (4) erate constraint length of 20 and K=5, there are nearly where the relation is constrained to test Subsequences of a 6's3.6x10" possible sequences. In the following section we single base-type. explore several of the approximate MLSD techniques used in Computing admissibility of points in the sphere may prove the communication field. as computationally intensive as simply computing their cor V. Approximate MLSD Algorithm for Base-Calling 15 responding metrics in a particular detection algorithm. Thus, Although we cannot perform full-blown MLSD due to we simplify the above expression in two ways: (1) we only computational complexity requirements, we can nevertheless consider a portion of the output sequence of length L and (2) attempt to approximate it, such that we recover Some addi we linearize and approximate the channel response around tional information that increases the immunity of the output the region of the sequence considered. The latter simplifica to noise. The three classes of appropriately modified algo tion allows us to fully employ the general Pohst enumeration rithms we consider are symbol-by-symbol (SBS) detection procedure, whereby the output space is assumed to be a linear (following causal equalization), the Stack algorithm, and transformation of the input space (see the Appendix section MLSD via the Viterbialgorithm. Before detailing these three for an outline of this procedure). It should be noted however algorithms, we first discuss potential computational saving that the savings afforded by sphere decoding in general is gains through exploitation of the problem model. We then 25 proportional to the SNR of the received signal. compare the performance of these algorithms in the last part B. Approximate MLSD Approaches of this section. We now discuss three approximate ML decoding tech A. Exploiting Problem Structure niques that we appropriately modify for use with our channel In an attempt to lower the complexity of full-blown MLSD, model. one reasonable question to ask would be, why not process the 30 Symbol-by-Symbol Detection: A, C, G, and T test subsequences independently? In other A Zero-order approximation to MLSD is to limit the search words, perform whatever decoding/deconvolution technique set to sequences of length one, or in other words, perform we would like, but treat the responses from the 4 base-types as symbol-by-symbol detection (SBSD) following causal equal completely uncorrelated. This is not feasible in general since ization. The procedure for SBSD follows the approach for given a sequence ACTATA and p=0.8, we would have A 35 solving for the next transmitted symbol in the absence of signals 0.8 0.5696 0.6218 versus 0.8 0.672, 0.6669 for noise outlined above, except that the expression is rounded to ATATA. Clearly, the presence of the C base makes a big the nearest integer. Thus, difference. For p=0.995, on the other hand, the difference between the two, 0.995 0.9851 0.99 versus 0.995 0.99 0.99, is much less significant. FIG. 12 plots the absolute 40 r An - PSW,-1 value of the difference between two sequences at the output tin Stin - - given one sequence is randomly perturbed along one base n-lin type at the input for p=0.99. The interpeak regions in the plot correspond to the test differences of the 3 other base-types where is the nearest integer function and m-q(u"',0)+1 and are only slightly affected by the perturbation. Hence, this 45 as before. This can be thought of as causal Zero-forcing equal plot illustrates how most of the separation distance between ization (ZFE). The main advantage of this approach is its output sequences for a given base is aligned along its base simplicity, since only one expression must be calculated per type rather than evenly distributed throughout the sequence test for base-calling. Furthermore, SBSD can be implemented for p close to one. As such, under the condition that p is very online without delay and thus can be used to generate imme close to one (which is typically the case in practice) and that 50 diate estimates of the underlying sequence. Of course, as with we have a reasonable approximation of the intervening ZFE in practice, unreliable estimates can result when SNR is sequence, we can separately decode the four different test especially low largely due to the high susceptibility to error Subsequences. This implies that under these conditions, we propagation in this regime. can extend our MLSD constraint length, L, by a factor of 4 MLSD via Greedy Algorithms: with only a four times increase in complexity rather than an 55 There are a number of greedy algorithms which can be used exponential increase of (K+1). to approximate MLSD and that are often employed in the Not only can we exploit the decorrelation between base communication field for channels with longtime dispersions. types when p is close to one, but we can also exploit the fact A representative example is the stack algorithm. In this algo that the observed signals, y, y. . . . . y combined with an rithm, a finite number of minimum distance paths along with estimate of the noise variance of, reveal information about 60 their respective metrics are stored in a stack from best to the input signal power. Indeed, we can prune the input search worst, where each path in this case consists of the respective space drastically since we know that the input signal power, values of u, chosen in response to each test. The path with multiplied by the effective impulse response for the stretch of minimum path metric (at the top of the stack) is deleted and observation variables which we are examining, should give a extended at each iteration (with K+1 possibilities). These reasonable estimate of the output signal power. In other 65 K+1 paths and their associated metrics are then added to the words, we can perform sphere decoding or, in this case, Pohst stack. In this way, the stack is continually ordered from least enumeration. Accordingly, given that the point y=(y, cost to highest, and the path at the top is extended at each US 9,388,462 B1 21 22 iteration. Since only a finite number of paths are kept, higher VI. Bounds on the Probability of Correct Decoding cost paths are discarded. The number of paths stored can be The obvious question to ask at this point is, how many increased to indirectly extend the effective sequence length bases can we reliably call? We can answer this question by that the algorithm considers. However, the storage require further exploiting the communication-theoretic framework ments roughly grow exponentially with the Stack algorithms with which we have couched the entire sequencing-by-Syn effective sequence length of consideration, although at a thesis model. It was mentioned that we can think of the smaller rate than with optimal MLSD. Additionally, the stack channel input as an uncoded M-PAM constellation with the algorithm can be used quite effectively with Pohst enumera channel itself introducing severe ISI due to the incomplete tion as outlined in. incorporation phenomenon. With this idea in mind, we can Iterative Partial-MLSD: 10 derive relations for the probability of a single symbol error Finally, we consider the soft-input Viterbialgorithm (VA) and then use that result to derive bounds on the probability of appropriately modified to exploit the problem structure. We correct decoding, P., for SBSD and MLSD. choose a symbol length 4L, which is long enough to capture First we calculate the probability of error for one symbol most of the time dispersion in the channel and short enough to 15 given the rest of the sequence is correct. Mathematically, we meet complexity constraints. We then modify the general VA express this as by using 4 trellises each with symbol length L, thus each having (K+1) possible states. We first perform SBSD as P(tizit- : I|til n-1-,=u 3: n-1 , litt,: N- : W W ), outlined above to obtain a rough estimate of the sequence. We where we write the decoded symbols as il, and the correct then proceed to perform the standard VA on each of the four transmitted symbol as u,. Thinking about what this means trellises, with the best surviving paths from the other three further, given we are certain about all but one homopolymeric trellises (or the rough estimate as available) filling in the gaps. region in the entire sequence, the difference between symbol We extend the paths one path length at a time and iterate un, being equal to a 0, 1,..., or K, is essentially a difference through the trellises based on the order specified by the test represented by a multiple of only its impulse response. Thus, vector, t. Trellis States (of length L) corresponding to 4L 25 at the channel output in the XY space, we still have an M-PAM length sequences are pruned appropriately (in accordance constellation with minimum distance captured by the impulse with our noise sphere) so that highly improbable states are not response of the symbol in question. As such, we can directly traversed saving computation time. In this manner, we use the probability of error result for M-PAM channels to approximate MLSD using an iterative, partial-MLSD ascertain that approach. 30 In many cases, the marginal distribution (or an approxima tion thereof) for each symbol (i.e., each base) would be of benefit, especially for later DNA assembly algorithms that P(a, Eu, i = u, bill billy, y) is 1) o dign ) use consensus estimates obtained from multiple overlapping 35 reads. For instance, this is performed for Sanger sequencing, where d(n) is the distance to the closest sequence with whereby a confidence score for each base is output. Accord respect to the nth symbol and is a function of its impulse ingly, a soft-input soft-output (SISO) variant of the above response, O is the noise level, N is the total length of the approach can be used, such as soft-output VA (SOVA) or List observed output sequence, and M-K--1 is the number of VA (LVA), if such confidence measures are so desired. Fur 40 symbols. The approximation results from the non-linearity of thermore, such soft-output can be used to further refine the an incorporation event in the model, that is, the minimum estimated sequence, if additional a priori information about distance between u-0 and u-1 is slightly different (larger) the underlying sequence is known, such as invalid codewords, than that of the other symbol separation lengths. The shorter in an iterative approach akin to iterative Viterbi for the decod distance can be used to produce a conservative estimate of the ing of concatenated codes. 45 probability of error. C. Performance Comparison A. Lower Bound on P. for SBSD In order to ascertain the FIG. 12 shows a comparison of the SBSD approach versus efficacy of SBSD in terms of some measure of the attainable the stack algorithm and iterative partial-MLSD. These plots read-length, we must first calculate the probability of symbol show typical results for a particular simulated sequence with error. Given the past sequence is correct and that our model is p=0.95 and O=0.04. Clearly, both approximate MLSD algo 50 accurate, the remaining uncertainty lies only in the determi rithms outperform SBSD quite a bit with the iterative partial nation of the current symbol. That is, was it 0, 1,..., or K? We MLSD technique performing best. Although p is much closer can write this probability as P(a,zu, ü, "'u'',y\). to one in practice, a smaller p was chosen for illustrative This boils down to M-PAM with minimum distance between convenience, since signal degradation occurs much more received symbols in this case given by p times the weight of quickly. FIG. 15 shows the results of a 10,000 test input vector 55 the leading group, W, where m=q(u," '.0)+1 is again the Monte Carlo simulation comparing SBSD to MLSD perfor position of the ideal Subgroup in W-1. Accordingly, we can mance in terms of the probability of correct decoding (P) derive a lower bound on the probability of correct decoding Versus the number of tests (a rough measure of read length). (i.e., the probability we have not made a single error) versus To illustrate the efficacy of all of these algorithms, FIG. 16 the number of tests since shows the output around test 50 for the sample sequence of 60 FIG. 14. Clearly, the existing adaptive bandpass filtering or manual base-calling techniques would be futile in this highly distorted signal regime. Unfortunately, the computational complexity of full MLSD is just too prohibitive even for a comparative study under this simulated setting. In the next 65 section, we bound the probability of correct decoding for SBSD and MLSD. US 9,388,462 B1 23 24 -continued the minimum distance between the true sequence and the next closest sequence (with errors in only the last 1 symbols). =(1-3 lots-) Explicitly, 2(M - 1) --- - (-title...) dinin-IS1(n) is agil X hinto (Den D'" & The last inequality follows since we can lower bound the minimum distance for SBS, pw, by p". FIG. 17 plots P. where m(i) q(u'.0)+1. Now dist(n) can be used to pro versus the number of tests for two cases, p=0.95 and p=0.995. vide a better approximation of MLSD performance with the The dotted curve represents the d, approximation and is a adversities of ISI considered. FIG. 200a) plots this approxi mation as well as the lower bound for SBSD against the good fit for p close to one. Notice the effect p has on the Monte-Carlo performance of the partial-MLSD algorithm for accuracy of base-calling: a change of p from 0.95 to 0.995 15 represents a seven-fold improvement in the length of high p=0.95. As can be seen, the MLSD error approximation is accuracy decoding. quite close to the empirical performance of the partial-MLSD B. Bounds on Pc for MLSD algorithm. FIG.20(b) compares MLSD to SBSD for p=0.995. We can further this analysis to derive probability of correct These plots also suggest that a significant portion of the decoding bounds for MLSD. We know from equation (5) that channel's ISI cannot be combated. the probability of error is dependent on d. As alluded to VII. Experimental Results above, calculating d, for MLSD involves finding the In order to perform the iterative partial-MLSD algorithm impulse response for a given symbol, since that characterizes on real biological Pyrosequencing data, there are a few prac the 1 distance between two output sequences. Alternatively tical issues which must be considered. First of all, a baseline di can be thought of as the 1 distance between respective 25 correction must be performed due to the changing chemical X's, when input symbolu, is perturbed by one unit. FIG. 18 background signal present. This causes a slow exponential plots d, as a function of the number of tests for various decay versus time of the baseline. The correction is performed values of p. We can now derive an upper-bound on the prob by subtraction of the approximate baseline obtained via inter ability of correct decoding for MLSD. This bound is optimis polation of the Smoothed interpeak segments. Next, integra tic as it uses the “one-shot” probability of error in its deriva 30 tion of the entire signal is required since the peak values are in tion, which essentially ignores the reduction in d, due to general not directly proportional to the total area under the ISI. Accordingly, curve, i.e., the total photoemission for that incorporation. Finally, we must complete our model with the necessary experimental parameters, e.g., p. nucleotide specific gains, 35 etc., fitted from empirical training data with known underly ing sequences. FIG. 21 shows experimental sequencing data r W N N before and after baseline correction. = Pa = u i = u N, yN) We applied the iterative partial-MLSD approach to the experimental Pyrosequencing data shown in FIG. 21(b). The s Pa,(, -F it;. . . it.il it;iN - iii., y W ) 40 sequenced template consisted of a 208 base length poly merase-chain reaction (PCR) product. The incorporation rate was determined to be close to 0.995 for this particular system setup and Os0.05. The algorithm correctly decoded the first i 280 tests without error (see FIG. 22). This corresponds to 45 error-free sequencing of the first 170 bases. This is quite an sI( - 3. 1) o digo ). improvement over the 30-40 bases that are typically reliably sequenced with commercial Pyrosequencing machines. Unfortunately, due to the prototype systems limitations, FIG. 19 plots the above upper bound for two different incor product inhibition rather than incomplete incorporation pre poration rates and O=0.04. 50 vented any longer reads, as the signal quality degrades The bound presented above is quite optimistic and it is quickly in that regime (see FIG. 22). This can be solved clear from the Monte-Carlo simulation presented in FIG. 15 though using systems which implement slightly better that the iterative partial-MLSD method, although outper reagent handling techniques. With Such a system in conjunc forming SBSD, falls quite short from it. One of two conclu tion with improved base-calling algorithms such as detailed sions can be drawn: either the algorithm itself can be 55 herein, read lengths well over 200-300 bases should be attain improved dramatically or the adverse effects of ISI are too able with Pyrosequencing. Such read lengths would make severe to combat. To investigate this more closely, we must DNA sequencing-by-synthesis methods a potentially much analyze the effect of ISI on d. To do this, we need to find the higher throughput and much more cost effective alternative to time-varying channel impulse response, h(D), for a specific the existing techniques. sequence and testing order. We choose the worst-case 60 sequence and testing order, that is, the example case of equa APPENDIX SECTION tion (3) with incorporation occurring at each test, since this produces the widest dispersion for a given number of tests. Sphere Decoding Approximation We then calculate the worst-case ISI on this channel by per To perform Pohst Enumeration, we can approximate the forming an exhaustive search through all error sequences of 65 inequity (4) by length 1, (eo, e, . . . . e. ). The worst-case ISI is the error sequence with the minimum metric. This metric represents N US 9,388,462 B1 25 26 Where B-Toeplitz (han)D, m=q(u'',0)+1 is the posi 4) with rate p or it is unaffected by the addition. More pre tion of the ideal subgroup in W. after N-L tests, D diag(p. cisely, p. p. p". - - - ). wv-/ww(1)... ww(i)... ww(m), where w(i) p(U(i-1N mod 4)) way (i-1)+(1-p(Uy a 0 O (i, Nimod 4))w (i). Toeplitz(a1, a2, ... , all) = | d2. (il. O It is possible to derive a non-recursive relation between X and the input sequence: (in (in-l ... (il 10

L is the considered constraint length and the impulse response vector h, N is obtained from a first-pass symbol-by-sym x, y (" " via - py. bol (SBS) estimate. A more refined estimate can be obtained by setting B to the appropriate subset of matrix HN (also 15 UN (n - q(Sw, i), Nimoda). derived from the SBS estimate). The primes on Y and K' result from subtracting off the contributions of the first m-1 N(a,b) is a function which returns the number of non-zero Subgroups, that is, Subtracting fromy the row-wise Sum of the incorporations in the sequence vector, a, equivalent to base first m-1 columns of HN and then taking the innerproduct of B(b), q(Si) returns the number of non-zero entries in SN that sum and subtracting that from ki. This gives us a trian from component N up to N-4i, and m is the total number of gular system of linear inequalities which can be back-Substi columns in UN. tuted to solve for the admissible value ofu'. Armed with the above model, we can now attempt to find the In various embodiments, underlying DNA sequences are sequence vector, S which maximizes the aposteriori prob characterized using an approach as follows to model the 25 ability, detection of pyrophosphate and corresponding non-idealities. Sequencing Base-Calling Problem: Given: Y,....Y., where Y,X,+N, where the X, are the sum P(UNY, ... , YN). ofall PPireleased at a given nucleotide additioni correspond 30 ing to base B(i mod 4), N, is iid Gaussian distributed noise (N(0, o) and B(1)=A, B(2)='C', B(3)=G, B(4)=T. Given: Process modeled by incomplete incorporation phe SR = argmaxP(SNY, ... , YN) nomenon with known rate p, where 0

Let us first define the matrix TX, which is a one-to-one trans W formation of the sequence vector, Sy: argmaxX 2OilnP(Sw(i))-(Y, -pus (imod4) w;-) SN 50 W SN (j) if i = i mod 4 argmaxX (Y, -pus (imod4) w; ), O otherwise SN when P(SN (i)) = P(SN (j)) for i + i.

We then crop all Zero columns from T. producing U with 55 m{-N columns. Note that U is an implicit function of S. In the last line Furthermore, since X is the sum of all pyrophosphate mol above, each symbol, 0, . . . , K, of the sequence vector Sy is ecules released in a single addition, we can relate X to was assumed as equally probable, simplifying the solution to the follows: minimum distance sequence between the vector Y . . . 60 Y. . . . Y. and X (a deterministic function of S). In other Xv, p Uv(N mod 4) ww. words, the sequence vector S that minimizes the resultant distance between Y and X is chosen. In general, for a We now must relate the current state to previous states in sequence vector of length N, there are (K+1)Y possible solu order to iteratively propagate the model. The hidden states, tions, making full search quickly intractable with increasing W, at time i, are a function of W, in that there are only two 65 N. Fortunately, this general class of problem arises frequently possibilities: either each component (DNA group with a par in communication theory and is termed maximum-likelihood ticularlag) of W, incorporates the added nucleotide B(imod sequence detection (MLSD). Dynamic programming meth US 9,388,462 B1 27 28 ods (e.g., Viterbialgorithm), produce the maximum-likeli contributions from non-ideal DNA specimens that corre hood solution, however, because of the very deep memory in spond to a current step in the DNA sequencing test. this case, it is not much better than full search in terms of 6. The method of claim 1, further including identifying storage requirements or computational complexity. Assum noise from a source other than the non-ideal DNA specimens ing K is limited to 5 and 300 nucleotide additions, this would 5 and using the identified noise in the modeling of the potential generate a search space of 6's2.8x10°. This can be distortion for the DNA sequencing test. reduced significantly by assuming that the received sequence 7. The method of claim 1, wherein using the computer has a much smaller effective memory; however, even in this processor by executing instructions further causes the com case the search space is prohibitive, since the effective puter processor to determine the potential distortion by Sum memory quickly reaches over about 60 for 300 nucleotide 10 additions. ming contributions from non-ideal DNA specimens accord While the present invention is described with reference to ing to weighting corresponding to a fraction of DNA several particular example embodiments, those skilled in the specimens that successfully advance in each step of the DNA art will recognize that many changes may be made thereto sequencing test. without departing from the spirit and scope of the present 15 8. The method of claim 1, wherein using the computer invention. Such changes may include, for example, sequenc processor by executing instructions further causes the com ing polymers other than those discussed, replacing synthesis puter processor to determine the potential distortion by with another approach Such as cleavage or replacing cleavage assigning a weight corresponding to a probability of DNA with another approach Such as synthesis or other deconstruc specimens existing in each of the Subgroups. tion-type approach, or using approaches in “Modeling and 9. The method of claim 1, wherein using the computer Base-Calling for DNA Sequencing-by-Synthesis'. These and processor by executing instructions further causes the com other approaches as described in the claims below character puter processor to determine the potential distortion by cal ize aspects of the present invention. culating contributions from non-ideal DNA specimens inde What is claimed is: pendently for each base component of the multitude of DNA 1. A method for DNA sequencing using a multitude of 25 specimens. DNA specimens each having the same DNA sequence, the 10. The method of claim 1, wherein the data captured from method comprising: the DNA sequencing test includes photoemission data cap performing at least one step in a DNA sequencing test, tured during pyrosequencing. where each of the at least one step in the DNA sequenc 11. The method of claim 10, wherein modeling the poten ing test is performed on the multitude of DNA speci 30 mens and produces ideal DNA specimens and non-ideal tial distortion as intersymbol interference resulting from DNA specimens, wherein, for each step in the DNA channel pulses further includes normalizing the photoemis sequencing test, the ideal DNA specimens have a corre sion data captured during pyrosequencing based upon an sponding ideal length and the non-ideal DNA specimens integration of photoemission levels from the DNA sequenc have lengths that are different from the corresponding 35 ing test. ideal length, 12. The method of claim 1, wherein using a computer for each of the at least one step in the DNA sequencing test, processor to identify the new base includes applying a partial using a computer processor by executing instructions iterative maximum-likelihood sequence detection (MLSD) to that cause the computer processor to: the data captured from the DNA sequencing test. receive data captured from a corresponding step of the 40 13. The method of claim 1, wherein using a computer DNA sequencing test, processor to identify the new base includes applying a Viterbi determine potential distortion in the received data by algorithm to the data captured from the DNA sequencing test. modeling the potential distortion as interSymbol inter 14. A system for DNA sequencing via a multitude of DNA ference resulting from channel pulses that correspond specimens each having the same DNA sequence, the system to subgroups of DNA specimens that represent poten 45 comprising: tial non-ideal DNA specimens, and a polymer sequencing device that is configured to perform identify, for a corresponding step, a new base by com at least one step in a DNA sequencing test, where each of paring the data received from the DNA sequencing the at least one step in the DNA sequencing test is test to the determined potential distortion. performed on the multitude of DNA specimens and pro 2. The method of claim 1, wherein each step in the DNA 50 duces ideal DNA specimens and non-ideal DNA speci sequencing test is a step in synthesis of the multitude of DNA mens, wherein, for each step in the DNA sequencing specimens. test, the ideal DNA specimens have a corresponding 3. The method of claim 1, wherein each step in the DNA ideal length and the non-ideal DNA specimens have sequencing test is a step in cleavage of the multitude of DNA lengths that are different from the corresponding ideal specimens. 55 length; and 4. The method of claim 1, wherein using the computer logic that includes a processor and processor-readable stor processor by executing instructions further causes the com age medium configured with processor executable puter processor to determine the potential distortion by cat instructions that when executed and for each of the at egorizing the multitude of DNA specimens into Subgroups least one step in the DNA sequencing test, cause the corresponding to incomplete incorporation for at least some 60 processor to: of the multitude of DNA specimens during the DNA sequenc receive data captured from a corresponding step of the ing test. DNA sequencing test, 5. The method of claim 1, wherein using the computer determine potential distortion in the received data by processor by executing instructions further causes the com modeling the potential distortion as interSymbol inter puter processor to determine the potential distortion by com 65 ference resulting from channel pulses that correspond bining contributions from non-ideal DNA specimens from a to subgroups of DNA specimens that represent poten previous step in the DNA sequencing test with one or more tial non-ideal DNA specimens, and US 9,388,462 B1 29 30 identify, for a corresponding step, a new base by com paring the data received from the DNA sequencing test to the determined potential distortion. 15. The system of claim 14, wherein the polymer sequenc ing device performs one of pyrosequencing of DNA and cleaving of DNA. 16. The system of claim 14, wherein the polymer sequenc ing device stores data in a database for Subsequent use by the logic, the data being obtained from DNA sequencing. 17. The system of claim 14, wherein the polymer sequenc 10 ing device and the logic are implemented on a common device.