(12) United States Patent (10) Patent No.: US 8,407,013 B2 Rogan (45) Date of Patent: Mar

USOO8407013B2 (12) United States Patent (10) Patent No.: US 8,407,013 B2 Rogan (45) Date of Patent: Mar. 26, 2013 (54) AB INITIOGENERATION OF SINGLE COPY Claverie, J-M.. “Computational Methods of the Identification of GENOMIC PROBES Genes in Vertebrate Genomic Sequences.” Hum Molec Genet, 1997. 6.10:1735-1744. Craig, J.M., et al., “Removal of Repetitive Sequences from FISH (76) Inventor: Peter K. Rogan, London (CA) Probes Using PCR-Assisted Affinity Chromatography.” Hum Genet, 1997, 100/3-4:472-476. (*) Notice: Subject to any disclaimer, the term of this Delcher, A.L., et al., “Alignment of Whole Genomes.” Nucl Acids patent is extended or adjusted under 35 Res, 1999, 27/11:2369-2376. U.S.C. 154(b) by 0 days. Devereux, J., et al., A Comprehensive Set of Sequence Analysis Programs for the VAX, NuclAcids Res, 1984, 12/1:387-395. Dover, G., et al., “Molecular Drive.” Trends in Genetics, 2002, (21) Appl. No.: 13/469,531 18.11:587-589. Edgar, R.C., et al., “PILER: Identification and Classification of (22) Filed: May 11, 2012 Genomic Repeats.” Bioinformatics, 2005, 21(S1):i152-i158. Eisenbarth, I., et al., "Long-Range Sequence Composition Mirrors (65) Prior Publication Data Linkage Disequilibrium Pattern in a 1.13 Mb Region of Human Chromosome 22, Human Molec Genet, 2001, 10/24:2833-2839. US 2012/O253689 A1 Oct. 4, 2012 Faranda, S., et al., “The Human Genes Encoding Renin-Binding Related U.S. Application Data Protein and Host Cell Factor are Closely Linked in Xq28 and Tran scribed in the Same Direction. Gene, 1995, 155:237-239. (63) Continuation of application No. 12/794.933, filed on Healy, J., et al., “Annotating Large Genomes with Exact Word Jun. 7, 2010, now Pat. No. 8,209,129, which is a Matches. Genome Res, 2003, 13:2306-2315. Howell, M.D., et al., “Rapid Identification of Hybridization Probes continuation of application No. 1 1/324,102, filed on for Chromosomal Walking.” Gene, 1987, 55:41-45. Dec. 30, 2005, now Pat. No. 7,734,424. Jareborg, N., et al., "Comparative Analysis of Noncoding Regions of 77 Orthologous Mouse and Human Gene Pairs.” Genome Res, 1999, (60) Provisional application No. 60/687,945, filed on Jun. 9:815-824. 7, 2005. Jurka, J., “Repeats in Genomic DNA: Mining and Meaning.” Curr Opin in Struct Biol, 1998, 8/3:333-337. (51) Int. Cl. Jurka, J., et al., “Censor-A Program for Identification and Elimina G06F 9/00 (2011.01) tion of Repetitive Elements from DNA Sequences.” Computers CI2N IS/II (2006.01) Chem, 1996, 20/1:119-121. Kent, W.J., et al., "Conservation, Regulation, Synteny, and Introns in CI2O I/68 (2006.01) a Large-Scale C. briggsae-C. elegans Genomic Alignment. Genome (52) U.S. Cl. ......................... 702/20:536/24.3:435/6.11 Res, 2000, 10:115-1125. (58) Field of Classification Search ........................ None Kent, W.J., “BLAT The Blast-Like Alignment Tool.” Genome Res., See application file for complete search history. 2002, 12:656-664. Li, Y-C., et al., “Microsatellites: Genomic Distribution, Putative (56) References Cited Functions and Mutational Mechanisms: A Review.” Molec Ecol, 2002, 11:2453-2465. Lichter, P., et al., “Delineation of Individual Human Chromosomes in U.S. PATENT DOCUMENTS Metaphase and Interphase Cells by In Situ Suppression Hybridiza 6,150,160 A 11/2000 Kazazian, Jr. tion Using Recombinant DNA Libraries.” Hum Genet, 1988, 6,828,097 B1 12/2004 Knoll et al. 80,3:224-234. 7,014,997 B2 3, 2006 Knoll et al. Morgenstern, B., et al., “DIALIGN: Finding Local Similarities by 2003/0022204 A1 1/2003 Lansdorp Multiple Sequence Alignment.” Bioinformatics, 1998, 14/3:290 2003/0044822 A1 3/2003 Fletcher et al. 2.94. 2003. O108943 A1 6/2003 Gray et al. Mottez, E., et al., “Conservation in the 5' Region of the Long Inter 2003. O1947.18 A1 10/2003 Tomita et al. spersed Mouse Ll Repeat: Implication of Comparative Sequence 2004O161773 A1 8/2004 Rogan et al. Analysis.” Nucl Acids Res, 1986, 14/7:31 19-3136. 2004/024.1734 A1 12/2004 Davis Nakamura, Y, et al., “Variable Number of Tandem Repeat (VNTR) 2005, OO64450 A1 3/2005 Lucas et al. Markers for Human Gene Mapping.” Science, 1987. 235: 1616-1622. FOREIGN PATENT DOCUMENTS (Continued) WO O188089 A2 11/2001 Primary Examiner — John S Brusca OTHER PUBLICATIONS (74) Attorney, Agent, or Firm — Tracy Jong Law Firm; Altschul, S.F., et al., “Basic Local Alignment Search Tool.” J Mol Tracy P. Jong Biol, 1990, 215/3:403-410. (57) ABSTRACT Bardoni, et al., “Isolation and Characterization of a Family of Single copy sequences Suitable for use as DNA probes can be Sequences Dispersed on the Human X Chromosome. Cytogenet and defined by computational analysis of genomic sequences. Cell Genet, Human Gene Mapping 9. Abstracts of Workshop Presen tations, Paris Conference, 1987, p. 575. The present invention provides an ab initio method for iden Batzoglou, S., et al., “Human and Mouse Gene Structure: Compara tification of single copy sequences for use as probes which tive Analysis and Application to Exon Prediction.” Genome obviates the need to compare genomic sequences with exist Research, 2000, 10:950-958. ing catalogs of repetitive sequences. By dividing a target Buhler, J., “Efficient Large-Scale Sequence Comparisonby Locality reference sequence into a series of shorter contiguous Sensitive Hashing.” Bioinformatics, 2001, 17/5:419–428. sequence windows and comparing these sequences with the Carrillo, H., et al., “The Multiple Sequence Alignment Problem in reference genome sequence, one can identify single copy Biology.” SIAM J Applied Math, 1988, 48/5:1073-1082. sequences in a genome. Probes can then be designed and Chang, P-C., et al., “Design and Assessment of Fast Algorithm for Identifying Specific Probes for Human and Mouse Genes.” produced from these single copy intervals. Bioinformatics, 2003, 19/11:1311-1317. 24 Claims, 2 Drawing Sheets US 8,407,013 B2 Page 2 OTHER PUBLICATIONS Schwartz, S., et al., “PipMaker-A Web Server for Aligning Two Genomic DNA Sequences.” Genome Res, 2000, 10:577-586. Newkirk, H.L., et. al., “Distortion of Quantitative Genomic and Smit, A.F.A., “The Origin of Interspersed Repeats in the Human Expression Hybridization by Cot-1 DNA: Mitigation of this Effect.” Genome.” Current Opin in Gen & Dev, 1996, 6/6:743-748. Vermeesch, J.R., et al., “Interstitial Telomeric Sequences at the Junc NuclAcids Res, 2005, 33/22:e 191, 8 pages. tion Site of a Jumping Translocation.” Human Genet, 1997, 99:735 Newkirk, H.L., et al., “Determination of Genomic Copy Number 737. with Quantitative Microsphere Hybridization.” Human Mutation, Vincens, P. et al., “A Strategy for Finding Regions of Similarity in 2006, 27/4:376-386. Complete Genome Sequences.” Bioinformatics, 1998, 14/8:715 Price, A.L., et al., “De Novo Identification of Repeat Families in 725. Large Genomes.” Bioinformatics, 2005, 21(S1):i1351-i1358. Zhang, Z. et al., “A Greedy Algorithm for Aligning DNA Rogan, P.K., et al., L1 Repeat Elements in the Human e-Gy-Globin Sequences.” J of Comp Biol. 2000, 7/1-2:203-214. Gene Intergenic Region: Sequence Analysis and Concerted Evolu Gene Expression: vol. 2. Eukaryotic Chromosomes, 1983, Lewin, B., tion with this Family, Mol Biol, 1987, 4/4:327-342. Ed., Wiley, p. 503, Wiley & Sons, Inc., New York City, New York. U.S. Patent Mar. 26, 2013 Sheet 2 of 2 US 8,407,013 B2 FIG. 2 INPUT 1, SEQUENCE OF REGION 202 2, LENGTH OF SUBSEQUENCE (L) 3. LENGTH OFFSETBETWEEN SUBSEQUENCES PROGRAMABINTO.PL. 204 CREATES A SET OF INDIVIDUAL SUBSEQUENCES COVERING REGION FOR GENOME COMPARISIONS SCRIPTWUBL (INPUT FROM ABINITIO.PL). suiciences 1. GENOME COMPARISON WITH WU-BLASTN 206 HAVE BEEN 2. PROGRAMBLASTPARSE:FILTER AND ANALYZED CONDENSE OUTPUT TO HIT LIST BASED ON EMPRICALLY DERVED CRITERA PROGRAM COUNTHITS, PLTAKES THE OUTPUT FROM BLASTPARSE.PL. 1. DISTILL HIT LIST FOREACHINTERVAL TO A COPY NUMBER 208 2. SORT BY SEQUENCE COORDINATE 3. IDENTIFY INTERVALS WITH MULTIPLE HITS (THESE CONTAIN REPEATELEMENTS) 4. RECORD SINGLE COPY INTERVALSAS SETA 210 1. GROUP ADJACENT SINGLE COPY INTERVALS INTO CONTIGS (L1...}, WHICHARE MEMBERS OF THE SETA 2. FOREACH CONTIG, CREATEA SERIES OF SUBSEQUENCES WITHSMALL OFFSETUPTOL FROM BEGINNING AND END OF CONTIG WITH PROGRAM SUBSEQ SPAWN INDEPENDENT THREADS UPSTREAMBOUNDARY (U) DOWNSTREAMBOUNDARY (D) UNTIL COUNTHITS CALL PROGRAMS. PRODUCES. HIT COUNT 1. SCRIPT WUBL >1 (DEFINESSINGLE COPY 2. PROGRAMBLASTPARSE BOUNDARY) 3. PROGRAMCOUNTHTS 1. FOREACH CONTIG, RECORD COORDINATES OF SINGLE COPY INTERVALBOUNDARIES (U.D) 2. COMBINE WITH ADJACENT SINGLE COPY CONTIG TO DEFINE COMPLETE INTERVAL (A-UA+D) US 8,407,013 B2 1. 2 AB INTO GENERATION OF SINGLE COPY blocking their hybridization, or by deducing the single copy GENOMIC PROBES sequences by comparisons of known genomic reference sequences with comprehensive databases of consensus CROSS REFERENCE TO RELATED sequences that are representative of established repetitive APPLICATIONS sequence families and subfamilies (Jurka, Curr Opin Struct Biol. 1998, 8(3):333-7). This continuation-in-part application claims the benefit of Cot-1 DNA is often used to attempt to suppress cross U.S. Ser. No. 60/687,945, filed Jun. 7, 2005, non-provisional hybridization of repetitive sequences to probes. The problem application U.S. Ser. No. 1 1/324,102 filed on Dec. 30, 2005 with attempting to suppress repeat hybridization with Cot-1 and now U.S. Pat. No. 7,734,424 issued Jun. 8, 2010, and 10 DNA is that it can result in enhanced non-specific hybridiza continuation application U.S. Ser. No. 12/794,933 filed on tion between probes and genomic targets. Specifically, it has Jun. 7, 2010, also publication number US 2010-024.0880A1.

Load more