Algorithm Polyne? Sequence
Total Page:16
File Type:pdf, Size:1020Kb
USOO9388462B1 (12) United States Patent (10) Patent No.: US 9,388,462 B1 Eltoukh 45) Date of Patent: Jul. 12,9 2016 (54) DNA SEQUENCING AND APPROACHES 6,681,186 B1* 1/2004 Denisov et al. ................. TO2/20 THEREFOR 2002fO147548 A1* 10, 2002 Walther ............. GON 39.83 (75) Inventor: Helmy Eltoukhy, Woodside, CA (US) 2003/O120471 A1*ck 6/2003 IZmailov .......... GOIN 27s, (73) Assignee: The Board of Trustees of the Leland OTHER PUBLICATIONS States Junior University, Stanford, Svantesson et al. (A mathematical model of the Pyrosequencing reaction system.” Biophysical Chemistry 2004. 110, 129-145).* - r Ewing et al. (“Base-Calling of Automated Sequencer Traces Using (*) Notice: Subject to any disclaimer, the term of this Phred. I. Accuracy Assessment” Genome Research 1998, 8, 175 patent is extended or adjusted under 35 185).* U.S.C. 154(b) by 1193 days. Eltoukhy et al., “Modeling and Base-Calling for DNA Sequencing by-Synthesis.” 2006 IEEE International Conference on Acoustics, (21) Appl. No.: 11/748,266 Speech and Signal Processing, vol. II, pp. 1032-1035, May 14-19. 2006. 1-1. Eriksson, S. "Cytochrome P450 Genotyping by Multiplexed Real (22) Filed: May 14, 2007 Time DNA Sequencing with PyrosequencingTM Technology.” ASSAY and Drug Development Technologies, 2002, 1,49-59.* O O Brent Ewing and Phil Green. “Base-Calling of Automated Sequencer Related U.S. Application Data Traces. UsingPhred.?II.Error?Probabilities.” Genome Research 8: (60) Provisional application No. 60/799,769, filed on May 186-194 (1998). 12, 2006. k . s cited by examiner (51) CI2OInt. Cl. I/68 (2006.01) Primary Examinerin a Yelena G Gakh Assistant Examiner — Michelle Adams C12OGOIN 21/66 1/76 (2006.01) (74)74). AttAttorney, Agent, or FirmFi — Crawford MaunuM PLLC (52) U.S. Cl. 57 ABSTRACT CPC ................ CI2O I/6869 (2013.01); C12O 1/66 (57) - 0 (2013.01); G0IN 2 1/76 (2013.01); Y10T Polymer Sequencing 1S facilitated. According to an example 436/143333 (2015.01); Y10T 436/166666 embodiment of the present invention, a polymer sequencing (2015.01) approach is implemented uS1ng a multitude of polymer speci 58) Field of Classification S h mens for a particular polymer type is implemented. For each (58) Field of Classification Searc step in a polymer sequencing test, non-idealities are catego USPC ... - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 435/94 rized using data obtained from the polymer sequencing test, See application file for complete search history. and in response to the categorized non-idealities, a polymer (56) References Cited sequence is identified for a corresponding step of the polymer sequencing test. With this approach, read lengths achieved with a particular polymer sequencing method can be U.S. PATENT DOCUMENTS improved. 5,527,675 A * 6/1996 Coullet al. ....................... 435/6 6,303.303 B1 * 10/2001 Green et al. ...................... 435/6 17 Claims, 23 Drawing Sheets e 400 - - - - - - - - - - - - f" 1. 420 Polymer hemistry Non-Ideality Model 8x - 430 - 440 ax E. : Estimation/DetectionAlgorithm L. Polyne?Most sequence likel U.S. Patent Jul. 12, 2016 Sheet 1 of 23 US 9,388,462 B1 S U.S. Patent Jul. 12, 2016 Sheet 2 of 23 US 9,388,462 B1 FIG. 2A FIG. 2B 0.15 FIG. 2C 0.1 0.05 100 120 140 # Tests U.S. Patent Jul. 12, 2016 Sheet 3 of 23 US 9,388,462 B1 MLSD Estimate N-310 A.N-330 SBS Bound i. FIG. 3 U.S. Patent US 9,388,462 B1 \ #7“?IH U.S. Patent Jul. 12, 2016 Sheet 5 of 23 US 9,388,462 B1 f A - G TT rixis is --- l W & - T G C A Nucleotide added FIG. 5 U.S. Patent Jul. 12, 2016 Sheet 6 of 23 US 9,388,462 B1 ( FIG. 6 U.S. Patent Jul. 12, 2016 Sheet 7 of 23 US 9,388,462 B1 FIG. 7 U.S. Patent Jul. 12, 2016 Sheet 8 of 23 US 9,388,462 B1 Test DAG for Template TAG Subgroup Weights 1 O O F.G. 8 U.S. Patent Jul. 12, 2016 Sheet 9 of 23 US 9,388,462 B1 0.9 U.S. Patent Jul. 12, 2016 Sheet 10 of 23 US 9,388,462 B1 NOn-linear Filter l? -- Linear Filter FIG. 10 U.S. Patent Jul. 12, 2016 Sheet 11 of 23 US 9,388,462 B1 o 5 10 15 2O 25 3. Issus 140 xis so gx88s: 180 88.8.xxyyxon 200 220 U.S. Patent Jul. 12, 2016 Sheet 12 of 23 US 9,388,462 B1 # Tests FIG. 12 U.S. Patent Jul. 12, 2016 Sheet 13 of 23 US 9,388,462 B1 U.S. Patent Jul. 12, 2016 Sheet 14 of 23 US 9,388,462 B1 p=0.95 c=0.04 15 s O.5 15 1 U.S. Patent Jul. 12, 2016 Sheet 15 of 23 US 9,388,462 B1 Partial-MLSD if Tests FIG. 15 U.S. Patent Jul. 12, 2016 Sheet 16 of 23 US 9,388,462 B1 .8 t U.S. Patent Jul. 12, 2016 Sheet 17 Of 23 US 9,388,462 B1 4. it is Tests FIG. 17 (b) U.S. Patent Jul. 12, 2016 Sheet 18 of 23 US 9,388,462 B1 r FIG. 18 U.S. Patent Jul. 12, 2016 Sheet 19 of 23 US 9,388,462 B1 "o so too so too so sooo so Tests FIG. 19 U.S. Patent Jul. 12, 2016 Sheet 20 of 23 US 9,388,462 B1 X MLSD Bound y S5S Bound WMLSD Bound U.S. Patent Jul. 12, 2016 Sheet 21 of 23 US 9,388,462 B1 O - GOO l sool 400 3CO OO OO SOD ------w--------------------------------------------- ---------------------- 500 400h. 300 O 25 50 75 OO 125 50 S. 20 Tests (b) FIG. 21 U.S. Patent Jul. 12, 2016 Sheet 22 of 23 US 9,388,462 B1 --- imental to redicte 848& Experimental - Predicted 5 t FIG. 22 (b) U.S. Patent Jul. 12, 2016 Sheet 23 of 23 US 9,388,462 B1 282 29O. 298 306 34 322 330 338 Tests FIG. 23 US 9,388,462 B1 1. 2 DNA SEQUENCING AND APPROACHES However, despite the past and ongoing enhancements to THEREFOR existing CAE technology, under this regime, the lower limit in terms of cost per mammalian genome is intolerably exces RELATED PATENT DOCUMENTS sive for many applications. 5 This patent document claims the benefit, under 35 U.S.C. These and other characteristics have been challenging to S119(e), of U.S. Provisional Patent Application No. 60/799, polymer sequencing and related applications, for both 769, filed May 12, 2006 and entitled: “Polymer Sequencing sequencing-by-synthesis and cleavage approaches, for a vari and Approaches Therefor.” ety of polymers such as proteins and DNA. FEDERALLY SPONSORED RESEARCHOR 10 SUMMARY DEVELOPMENT The present invention is directed to overcoming the above This invention was made with Government support under mentioned challenges and others related to the types of appli contract HG003571 awarded by the National Institutes of cations discussed above and in other applications. These and Health. The U.S. Government has certain rights in this inven 15 other aspects of the present invention are exemplified in a tion. number of illustrated implementations and applications, FIELD OF THE INVENTION Some of which are shown in the figures and characterized in the claims section that follows. The present invention relates generally to polymer Various aspects of the present invention are applicable to sequencing, and more particularly to arrangements and the sequencing of polymers in a manner that addresses dif approaches for polymer sequencing. ferences among polymers analyzed for a particular sequenc ing approach. Such differences may result, for example, BACKGROUND where a particular step in a synthesis or cleaving approach is 25 not carried out with all polymers in a group of polymers Polymer sequencing, such as protein sequencing or DNA undergoing analysis, or with different tests performed at dif sequencing, is susceptible to errors, misalignment and other ferent times for a particular type of polymer. In accordance issues related to inconsistent synthesis and/or cleavage as with various example embodiments, results from sequencing associated with polymers undergoing analysis. These issues tests are adjusted in consideration of these differences. can result in inaccurate analysis and other related issues. 30 In one embodiment of the present invention, a method for One approach to synthesis-based sequencing is DNA polymer sequencing via a multitude of polymer specimens sequencing-by-synthesis, which generally involves bio for a particular polymer type is implemented. For each step in chemical processes by which a target DNA strand is itera a polymer sequencing test, non-idealities are categorized tively built up to determine its sequence. An example of Such using data obtained from the polymer sequencing test, and in methods is Pyrosequencing, which offers the promise of high 35 throughput and low cost sequencing via efficient mass paral response to the categorized non-idealities, a polymer lelization and its relative simplicity. However, achieving reli sequence is identified for a corresponding step of the polymer able DNA read lengths using Pyrosequencing requires high sequencing test. reagent costs to maximize signal fidelity and thereby has In another embodiment, a method for DNA sequencing is often been prohibitively expensive for applications such as 40 implemented. Results from a DNA sequencing test are mod whole genome shot-gun assembly. eled as a noisy Switched linear system parameterized by Recently, there has been great interest in developing potential DNA sequences and respective DNA bases. Using cheaper, higher throughput de novo DNA sequencing tech the modeling, a DNA sequence is selected from the potential nologies and protocols to jumpstart the next phase of genetic DNA sequences with respect to a probability of matching the inquiry beyond the human genome project.