USOO8407013B2

(12) United States Patent (10) Patent No.: US 8,407,013 B2 Rogan (45) Date of Patent: Mar. 26, 2013

(54) AB INITIOGENERATION OF SINGLE COPY Claverie, J-M.. “Computational Methods of the Identification of GENOMIC PROBES in Vertebrate Genomic Sequences.” Hum Molec Genet, 1997. 6.10:1735-1744. Craig, J.M., et al., “Removal of Repetitive Sequences from FISH (76) Inventor: Peter K. Rogan, London (CA) Probes Using PCR-Assisted Affinity Chromatography.” Hum Genet, 1997, 100/3-4:472-476. (*) Notice: Subject to any disclaimer, the term of this Delcher, A.L., et al., “Alignment of Whole .” Nucl Acids patent is extended or adjusted under 35 Res, 1999, 27/11:2369-2376. U.S.C. 154(b) by 0 days. Devereux, J., et al., A Comprehensive Set of Sequence Analysis Programs for the VAX, NuclAcids Res, 1984, 12/1:387-395. Dover, G., et al., “Molecular Drive.” Trends in Genetics, 2002, (21) Appl. No.: 13/469,531 18.11:587-589. Edgar, R.C., et al., “PILER: Identification and Classification of (22) Filed: May 11, 2012 Genomic Repeats.” , 2005, 21(S1):i152-i158. Eisenbarth, I., et al., "Long-Range Sequence Composition Mirrors (65) Prior Publication Data Linkage Disequilibrium Pattern in a 1.13 Mb Region of Human Chromosome 22, Human Molec Genet, 2001, 10/24:2833-2839. US 2012/O253689 A1 Oct. 4, 2012 Faranda, S., et al., “The Human Genes Encoding Renin-Binding Related U.S. Application Data Protein and Host Cell Factor are Closely Linked in Xq28 and Tran scribed in the Same Direction. , 1995, 155:237-239. (63) Continuation of application No. 12/794.933, filed on Healy, J., et al., “Annotating Large Genomes with Exact Word Jun. 7, 2010, now Pat. No. 8,209,129, which is a Matches. Res, 2003, 13:2306-2315. Howell, M.D., et al., “Rapid Identification of Hybridization Probes continuation of application No. 1 1/324,102, filed on for Chromosomal Walking.” Gene, 1987, 55:41-45. Dec. 30, 2005, now Pat. No. 7,734,424. Jareborg, N., et al., "Comparative Analysis of Noncoding Regions of 77 Orthologous Mouse and Human Gene Pairs.” Genome Res, 1999, (60) Provisional application No. 60/687,945, filed on Jun. 9:815-824. 7, 2005. Jurka, J., “Repeats in Genomic DNA: Mining and Meaning.” Curr Opin in Struct Biol, 1998, 8/3:333-337. (51) Int. Cl. Jurka, J., et al., “Censor-A Program for Identification and Elimina G06F 9/00 (2011.01) tion of Repetitive Elements from DNA Sequences.” Computers CI2N IS/II (2006.01) Chem, 1996, 20/1:119-121. Kent, W.J., et al., "Conservation, Regulation, Synteny, and Introns in CI2O I/68 (2006.01) a Large-Scale C. briggsae-C. elegans Genomic Alignment. Genome (52) U.S. Cl...... 702/20:536/24.3:435/6.11 Res, 2000, 10:115-1125. (58) Field of Classification Search ...... None Kent, W.J., “BLAT The Blast-Like Alignment Tool.” Genome Res., See application file for complete search history. 2002, 12:656-664. Li, Y-C., et al., “Microsatellites: Genomic Distribution, Putative (56) References Cited Functions and Mutational Mechanisms: A Review.” Molec Ecol, 2002, 11:2453-2465. Lichter, P., et al., “Delineation of Individual Human Chromosomes in U.S. PATENT DOCUMENTS Metaphase and Interphase Cells by In Situ Suppression Hybridiza 6,150,160 A 11/2000 Kazazian, Jr. tion Using Recombinant DNA Libraries.” Hum Genet, 1988, 6,828,097 B1 12/2004 Knoll et al. 80,3:224-234. 7,014,997 B2 3, 2006 Knoll et al. Morgenstern, B., et al., “DIALIGN: Finding Local Similarities by 2003/0022204 A1 1/2003 Lansdorp Multiple .” Bioinformatics, 1998, 14/3:290 2003/0044822 A1 3/2003 Fletcher et al. 2.94. 2003. O108943 A1 6/2003 Gray et al. Mottez, E., et al., “Conservation in the 5' Region of the Long Inter 2003. O1947.18 A1 10/2003 Tomita et al. spersed Mouse Ll Repeat: Implication of Comparative Sequence 2004O161773 A1 8/2004 Rogan et al. Analysis.” Nucl Acids Res, 1986, 14/7:31 19-3136. 2004/024.1734 A1 12/2004 Davis Nakamura, Y, et al., “Variable Number of Tandem Repeat (VNTR) 2005, OO64450 A1 3/2005 Lucas et al. Markers for Human Gene Mapping.” Science, 1987. 235: 1616-1622. FOREIGN PATENT DOCUMENTS (Continued) WO O188089 A2 11/2001 Primary Examiner — John S Brusca OTHER PUBLICATIONS (74) Attorney, Agent, or Firm — Tracy Jong Law Firm; Altschul, S.F., et al., “Basic Local Alignment Search Tool.” J Mol Tracy P. Jong Biol, 1990, 215/3:403-410. (57) ABSTRACT Bardoni, et al., “Isolation and Characterization of a Family of Single copy sequences Suitable for use as DNA probes can be Sequences Dispersed on the Human X Chromosome. Cytogenet and defined by computational analysis of genomic sequences. Cell Genet, Human Gene Mapping 9. Abstracts of Workshop Presen tations, Paris Conference, 1987, p. 575. The present invention provides an ab initio method for iden Batzoglou, S., et al., “Human and Mouse Gene Structure: Compara tification of single copy sequences for use as probes which tive Analysis and Application to Exon Prediction.” Genome obviates the need to compare genomic sequences with exist Research, 2000, 10:950-958. ing catalogs of repetitive sequences. By dividing a target Buhler, J., “Efficient Large-Scale Sequence Comparisonby Locality reference sequence into a series of shorter contiguous Sensitive Hashing.” Bioinformatics, 2001, 17/5:419–428. sequence windows and comparing these sequences with the Carrillo, H., et al., “The Multiple Sequence Alignment Problem in reference genome sequence, one can identify single copy Biology.” SIAM J Applied Math, 1988, 48/5:1073-1082. sequences in a genome. Probes can then be designed and Chang, P-C., et al., “Design and Assessment of Fast Algorithm for Identifying Specific Probes for Human and Mouse Genes.” produced from these single copy intervals. Bioinformatics, 2003, 19/11:1311-1317. 24 Claims, 2 Drawing Sheets US 8,407,013 B2 Page 2

OTHER PUBLICATIONS Schwartz, S., et al., “PipMaker-A Web Server for Aligning Two Genomic DNA Sequences.” Genome Res, 2000, 10:577-586. Newkirk, H.L., et. al., “Distortion of Quantitative Genomic and Smit, A.F.A., “The Origin of Interspersed Repeats in the Human Expression Hybridization by Cot-1 DNA: Mitigation of this Effect.” Genome.” Current Opin in Gen & Dev, 1996, 6/6:743-748. Vermeesch, J.R., et al., “Interstitial Telomeric Sequences at the Junc NuclAcids Res, 2005, 33/22:e 191, 8 pages. tion Site of a Jumping Translocation.” Human Genet, 1997, 99:735 Newkirk, H.L., et al., “Determination of Genomic Copy Number 737. with Quantitative Microsphere Hybridization.” Human Mutation, Vincens, P. et al., “A Strategy for Finding Regions of Similarity in 2006, 27/4:376-386. Complete Genome Sequences.” Bioinformatics, 1998, 14/8:715 Price, A.L., et al., “De Novo Identification of Repeat Families in 725. Large Genomes.” Bioinformatics, 2005, 21(S1):i1351-i1358. Zhang, Z. et al., “A Greedy Algorithm for Aligning DNA Rogan, P.K., et al., L1 Repeat Elements in the Human e-Gy-Globin Sequences.” J of Comp Biol. 2000, 7/1-2:203-214. Gene Intergenic Region: Sequence Analysis and Concerted Evolu Gene Expression: vol. 2. Eukaryotic Chromosomes, 1983, Lewin, B., tion with this Family, Mol Biol, 1987, 4/4:327-342. Ed., Wiley, p. 503, Wiley & Sons, Inc., New York City, New York.

U.S. Patent Mar. 26, 2013 Sheet 2 of 2 US 8,407,013 B2

FIG. 2 INPUT 1, SEQUENCE OF REGION 202 2, LENGTH OF SUBSEQUENCE (L) 3. LENGTH OFFSETBETWEEN SUBSEQUENCES

PROGRAMABINTO.PL. 204 CREATES A SET OF INDIVIDUAL SUBSEQUENCES COVERING REGION FOR GENOME COMPARISIONS

SCRIPTWUBL (INPUT FROM ABINITIO.PL). suiciences 1. GENOME COMPARISON WITH WU-BLASTN 206 HAVE BEEN 2. PROGRAMBLASTPARSE:FILTER AND ANALYZED CONDENSE OUTPUT TO HIT LIST BASED ON EMPRICALLY DERVED CRITERA

PROGRAM COUNTHITS, PLTAKES THE OUTPUT FROM BLASTPARSE.PL. 1. DISTILL HIT LIST FOREACHINTERVAL TO A COPY NUMBER 208 2. SORT BY SEQUENCE COORDINATE 3. IDENTIFY INTERVALS WITH MULTIPLE HITS (THESE CONTAIN REPEATELEMENTS) 4. RECORD SINGLE COPY INTERVALSAS SETA 210

1. GROUP ADJACENT SINGLE COPY INTERVALS INTO CONTIGS (L1...}, WHICHARE MEMBERS OF THE SETA 2. FOREACH CONTIG, CREATEA SERIES OF SUBSEQUENCES WITHSMALL OFFSETUPTOL FROM BEGINNING AND END OF CONTIG WITH PROGRAM SUBSEQ

SPAWN INDEPENDENT THREADS UPSTREAMBOUNDARY (U) DOWNSTREAMBOUNDARY (D) UNTIL COUNTHITS CALL PROGRAMS. PRODUCES. HIT COUNT 1. SCRIPT WUBL >1 (DEFINESSINGLE COPY 2. PROGRAMBLASTPARSE BOUNDARY) 3. PROGRAMCOUNTHTS

1. FOREACH CONTIG, RECORD COORDINATES OF SINGLE COPY INTERVALBOUNDARIES (U.D) 2. COMBINE WITH ADJACENT SINGLE COPY CONTIG TO DEFINE COMPLETE INTERVAL (A-UA+D) US 8,407,013 B2 1. 2 AB INTO GENERATION OF SINGLE COPY blocking their hybridization, or by deducing the single copy GENOMIC PROBES sequences by comparisons of known genomic reference sequences with comprehensive databases of consensus CROSS REFERENCE TO RELATED sequences that are representative of established repetitive APPLICATIONS sequence families and subfamilies (Jurka, Curr Opin Struct Biol. 1998, 8(3):333-7). This continuation-in-part application claims the benefit of Cot-1 DNA is often used to attempt to suppress cross U.S. Ser. No. 60/687,945, filed Jun. 7, 2005, non-provisional hybridization of repetitive sequences to probes. The problem application U.S. Ser. No. 1 1/324,102 filed on Dec. 30, 2005 with attempting to suppress repeat hybridization with Cot-1 and now U.S. Pat. No. 7,734,424 issued Jun. 8, 2010, and 10 DNA is that it can result in enhanced non-specific hybridiza continuation application U.S. Ser. No. 12/794,933 filed on tion between probes and genomic targets. Specifically, it has Jun. 7, 2010, also publication number US 2010-024.0880A1. been demonstrated that Cot-1 added to target DNA actually The contents each of these patent applications and publica enhanced hybridization to genomic probes containing con tions, and Disclosure Document No. 576,582, filed May 3, served repetitive elements (Newkirk, H. L. et al., Nuc. Acids 2005, are each hereby incorporated herein by reference. 15 Res. 2005, 33 (22):e 191). In addition to repetitive sequences, Cot-1 was also found to be enriched for linked single copy REFERENCE TO SEQUENCE LISTING sequences (Newkirk, H. L. et al., Nuc. Acids Res. 2005, APPENDIX 33(22):e 191). Adventitious association between these sequences and probes distorts quantitative measurements of Accompanying this application is an electronic EFS-web the probes hybridized to desired genomic targets. This also filing of a gene sequence listings in compliance with 37 CFR affects the reproducibility of hybridization assays with S1.52(e) and 37 CFRS1.821-1.825 in standard ASCII char Sources of genomic DNA, in particular, and can also impact acter and file formats. The file contains Appendixes A, B, C, hybridization to mRNAs that contain repetitive sequences D and E. (typically found in the untranslated regions of transcripts). 25 The increased non-specific hybridization that occurs when FIELD OF THE INVENTION using Cot-1 to block repeat sequence hybridization has par ticularly adverse effects on microarray studies which depend The present invention generally relates to ab initio methods on quantification of signals obtained by hybridization to the of computationally determining the locations of single copy unblocked presumably single copy sequences. intervals in genomes for use as probes. 30 The elimination of Cot-1 DNA, either by sequestering repeats or by blocking their hybridization, was accomplished BACKGROUND by direct synthesis of probes lacking repeat sequences. Knoll et al., U.S. Pat. No. 6,828,097 (termed '097 patent), discloses Conventional hybridization studies with genome-derived a procedure for determining the locations of single copy nucleic acid probes require unlabeled Cot-1 DNA fractions to 35 intervals and design of probes for hybridization to their block cross-hybridization of repetitive sequences contained complementary locations in the human genome. It is dis within these probes in eukaryotic genomes. This is necessary, closed that the procedure can be implemented for any genome because to achieve the specificity needed to identify, detector in which a comprehensive catalog of repetitive sequences is quantify unique sequences contained in nucleic acid probes, available. Presumed single copy sequences containing repeti confounding hybridization from repetitive sequences must be 40 tive elements will cross-hybridize to multiple locations in the eliminated. Repetitive sequences comprise at least 50% of the genome. Where hybridization occurs in too many genomic human genome and contain a diverse set of distinct families locations, the lack of specificity adversely impacts the utility (Smit, Curr Opin Genet Dev. 1996, 6(6):743-8). Despite the of the probes in diagnosing disease. Therefore, methods from lack of selection for their function and broad, often variable which single copy sequences can be deduced without requir degrees of orthology, such sequences often display sequence 45 ing a comparison of the genomic sequence with a compre conservation throughout mammalian evolution (Rogan et al. hensive database of consensus repetitive sequence family Mol Biot Evol. 1987, 4(4):327-42: Mottez et al. Nucleic members would represent an improvement over current in Acids Res. 1986, 14(7):31 19-36), principally because they silico methods of identifying single copy intervals and the have properties of semiautonomous transposable elements ensuing probes. that promote frequent amplification during host organism 50 Methods have been developed which can align the evolution, originally termed molecular drive by Dover (Do sequences of different, related, or the same complete ver, Trends Genet. 2002, 18(11):587-9). It is desirable to genomes from which the locations of individual repetitive remove Such sequences in most clinical diagnostic applica sequences in the genome can be inferred. One such example tions; because of their ubiquity throughout the genome, their is the maximal unique matching algorithm which builds Suf presence can interfere with the development of probes for 55 fix trees from all maximal length unique matches (MUM) unique regions of the genome that correspond to functional between sequence strings (Delcher et al. Nuc. Acids Res. genes whose structures must be preserved because they are 1999, 27:2369-2376). Repeats can be detected in a genome essential for normal development and health. because they are found in overlapping MUMs that are not Repetitive sequences are often interspersed with unique or necessarily contiguous in that genome. Once repeat sequence single copy genes, especially in eukaryotic genomes, and 60 elements are identified through Such comparisons, families of their removal from genomic probes is essential to ensure that related repeat sequences can be identified through compari diagnostic probes specifically recognize only a single loca Sons of individual family members with the genome sequence tion in the genome. These sequences can be eliminated by itself Another popular method, the BLAT algorithm (Kent et laboratory techniques designed to sequester them away from al. Genome Res. 2002, 12:656-64), is a rapid alignment labeled probes containing both single copy and interspersed 65 method that uses a hash-index algorithm to quickly find repetitive sequences (Lichter et al. Hum Genet. 1988, 80(3): sequences similar to a particular test sequence in a genome; it 224-34; Craig et al. Hum Genet 1997, 100:472-476), by is not, however, an ab initio approach for single copy US 8,407,013 B2 3 4 sequence identification. Other comparative alignment tools individuals or species. These approaches do not involve the useful for detecting repeat sequences include ASSIRC (Vin use of repetitive sequences to infer the presence of single copy cens et al. Bioinformatics 1998, 14:715-725), DIALIGN sequence intervals (between adjacent repetitive sequences in (Morgenstern etal Bioinformatics. 1998, 14(3):290-4.), DBA the genome) for the development of useful single copy probes (Jareborg et al. Genome Res. 1999, 9(9):815-24), GLASS from the intervening regions between the deduced repetitive (Batzoglou et al. Genome Res. 2000, 10(7):950-8), LSH sequences. These algorithms therefore produce libraries ALL-PAIRS (Buhler, Bioinformatics. 2001, 17(5):419-28), similar to that used in the 097 patent, and the sequences MEGABLAST (Zhang JComput Biol. 2000, 7(1-2):203-14), contained in these libraries will be similar to those already PIPMaker (Schwartz et al. Genome Res. 2000, 10(4):577 known. These algorithms do not describe inferred single copy 86), SSAHA (www.sanger.ac.uk/Software/analysis/ 10 intervals, or in particular, the use of probes obtained from SSAHA), and WABA (Kent and Zahler Genome Res. 2000, those deduced intervals. 10(8): 1115-25). U.S. application Ser. No. 10/229,058 discloses that SUMMARY OF THE INVENTION sequences can be screened for the presence of known repeti tive sequence families (e.g., Alu elements); however the 15 The present invention relates to the computational design details of these screening procedures are not disclosed. U.S. of nucleic acid probes that exclusively contain sequences application Ser. No. 10/132,002 discloses a procedure for found at a single location in a reference genome sequence. detecting repetitive sequences experimentally, but does not A method is described to identify single copy regions in a disclose the identification of single copy sequences. U.S. target genome interval of known sequence and then preparing application Ser. No. 10/833,954 discloses that in situ hybrid probes from these regions, principally for the detection of ization of a mixture of single copy and repetitive sequences chromosomal and genomic abnormalities by nucleic acid can be performed in the absence of blocking nucleic acids that hybridization. The method divides the target genome interval prevent cross hybridization of repetitive sequences. A formu into consecutive sequence Subintervals and compares each of lation of a hybridization reagent and washing conditions that the Subintervals with the reference genome sequence. Those could mitigate Such cross-hybridization are disclosed, but no 25 subintervals which are found once within the reference information is provided regarding the location of single copy genome sequence, typically referred to as single copy inter and repetitive sequences within the probe segment. U.S. Ser. vals, serve as sequences that serve as a starting point for No. 10/132,993 discloses laboratory chromatographic meth Subsequent analysis. To more precisely localize the single ods to remove repetitive sequences from genomic DNA to copy sequences, i.e., the single copies of sequences that make probes that are Substantially complementary to single 30 appear within a single copy interval, these Subsequences may copy intervals. In this application, the locations or the specific either be further resected into non-overlapping sub-subinter single copy sequences are not determined prior to experimen vals or they may be modified by selecting windows that tally removing the repeat sequences. A very similar approach overlap the original single copy Subintervals, but which are is described in U.S. application Ser. No. 10/798,949, in which displaced by one or more nucleotides from the original repetitive sequences are Subtracted by hybridization, and 35 genomic coordinates in either the telomeric or centromeric single copy sequences are Subsequently amplified using so direction. Typically, as series of overlapping Sub-Subintervals called unique sequence primers. Subtraction hybridization is are derived from the original sequence by extending the Sub not a robust technique, because low- to middle-reiteration interval at one end of the Sub-Subsequence and shortening the frequency repeats are not completely eliminated under the Sub-Subsequence by the same length at the other end. The hybridization conditions typically used in these studies. 40 directionality of the overlapping Sub-Subsequence set is dic Therefore, the selection of these primers could result in the tated by the orientation of the single copy Subsequence adja production of probes that are contaminated with repetitive cent to the Subsequence that contains one or more repeat sequence elements. Similarly, in U.S. application Ser. No. elements. The overlapping Sub-Subsequences are selected so 10/229,058, the repetitive sequences are fractionated by that their displacement moves toward the location of the hybridization methods prior to library production and 45 single copy Subsequence. The overlapping Sub-Subsequences sequencing. Presumably, the single copy sequences would be are compared with the genome reference sequence and the revealed after library enrichment; however U.S. Ser. No. procedure is iterated by progressively decreasing the degree 10/229,058 does not teach how to identify the precise bound of overlap until either the overlapped interval demonstrates aries of these sequences in the genome, and it does not teach multiple regions of similarity in the reference genome or the the method of determining how to identify single copy 50 end of the chromosome is reached. The single copy sequences sequences for use as probes. U.S. Ser. No. 10/330,089 is the thus obtained are then used to prepare probes either by direct most recent of several continuation applications which infer nucleic acid synthesis, amplification or by retrieval and puri the single copy nature of cloned sequences by their lack of fication of these sequences from recombinant clones or hybridization to total genomic DNA, which is highly genomic DNA. enriched in repetitive elements. The specific single copy 55 In the present application, the probes are labeled and then sequences are not revealed by this approach. Furthermore, the hybridized to chromosomes from patients or cell lines. How present applicants have demonstrated that the single copy ever, those of skill in the art will appreciate that the probes can sequences produced according to this method are contami be fixed on a surface or matrix and hybridized with genomic nated with repetitive sequences, since they are particularly DNA or cDNA from patients or control specimens that have insensitive to the detection of low- to moderate-abundance 60 been labeled by chemical, fluorescent, or radioactive modifi repetitive sequence family members. See U.S. Pat. No. 6,828, cation. With the present invention, it is not necessary to Sup 097, Prosecution History. press hybridization of repetitive sequences with unlabeled While several of these approaches can find locally similar Cot-1 nucleic acids when annealing these probes to their repetitive sequences without comparison to a library of unique chromosomal locations in the genomes of patient sequences (as in Knoll et al., U.S. Pat. No. 6,828,097), their 65 samples or cell line chromosomal DNA. objective is to identify repetitive sequences and multiple cop The ab initio methods described in the instant invention are ies of related sequences found in the genomes of different capable of identifying both the same repeat families that have US 8,407,013 B2 5 6 been previously catalogued in the art and new repeat and Subtelomeric chromosome rearrangements associated sequence families that have not been previously recognized in with idiopathic mental retardation, sex chromosome aneup the art. loidy, and monosomy chromosome 22. See, for example, Another advantage of the present invention is that Suchab U.S. Ser. No. 09/854,867. initio methods can be used to deduce single copy sequences in The probes are in the form of nucleic acid fragments or a instances of biological species for which catalogs of repeti collection of labeled nucleic acid fragments whose hybrid tive sequences have not been previously derived. ization to a target sequence can be detected. The invention also pertains to methods of developing, generating and label PARTICULARADVANTAGES OF THE ing or chemical modification of Such probes, and to uses INVENTION 10 thereof. Chemical modifications of such probes can be used to permanently attach them to Solid Surfaces Such as polystyrene Co-pending application U.S. Ser. No. 12/794.933 claims a microspheres or glass slides for Subsequent hybridization to method to identify and produce a single copy sequence in a nucleic acids obtained, for example, from a patient for diag target reference complete genome sequence by Successive nosis of a genetic disorder, such as, for example, the Syn division of the target reference genome sequence into Sub 15 dromes described in U.S. Ser. No. 09/854,867, or of various intervals and comparison of the Subintervals to the target cancers, such as, for example, breast cancer associated with reference sequence using various claimed hybridization con amplification of the HER2/NEU gene, neuroblastoma asso ditions. The invention claimed herein is a method of produc ciated with amplification of the N-myc gene, melanoma asso ing a hybridization probe of a target reference complete ciated with chromosome deletions of p16/CDKN2A gene, genome sequence where the probe is limited to probes con chromosome translocations activating oncogenes associated taining single copy and at least one divergent repetitive with Chronic myelogenous leukemia (BCR/ABL1), Acute sequence. The limitation of “identifying a single copy interval lymphocytic leukemia, B-cell lymphoma, prostate carci and at least one contiguous divergent repetitive interval of the noma, chromosome inversions such as that found in Acute target reference sequence wherein at least one Subsequence in Myelogenous leukemia Type M4, and losses of heterozy the target sequence contains a divergent repetitive element 25 gosity for example, monosomies for chromosome 7q., 1p, Suitable for use as a probe that hybridizes to a single location 17p, and 8p. This list of chromosome abnormalities is pro in the target genome' is patentable distinct and an improve vided for purposes of illustrating the types of abnormalities ment over the prior art because it produces an entire category suitable for detection with probes of the art. There are many of probes that the prior art eliminated with the older tech other art-recognized abnormalities which are diagnostic for niques, expanding the repertoire of single copy probes in the 30 neoplasia that involve gain or loss of copies of other genes and genome. One particular advantage realized by the method chromosomes, but result from the same or similar common claimed herein is the production of probes from within the mechanisms of chromosome rearrangement presented in boundaries of short oncogenes and tumor Suppressor genes. these examples. No such probes are currently available commercially and Various aspects of the present invention obviate the need to could not be produced with the methods of the prior art. In 35 compare the sequence of the genomic interval from which particular, the closest known prior art is that of inventor and single copy intervals and probes are derived with a database these previously patented methods would have produced pro of existing repetitive sequences. Generally, a genomic Subse duce individual probes that are too short (<1.5 kilobase pairs) quence is compared with the sequence of the complete hap to be used in FISH, since the density of fluorescent labels loid genome that contains that genomic Subsequence. Assum incorporated in such probes is insufficient for reliable and 40 ing the Subsequence is Sufficiently long, there is a high routine visualization by epifluorescence microscopy. probability that it will contain at least one repetitive element, Sometimes also referred to as a repetitive or repeat sequence. BRIEF DESCRIPTION OF THE DRAWINGS Repetitive elements are detected by counting the number of times that the Subsequence occurs in the genome. Typically, FIG. 1 is a block diagram illustrating a user interacting with 45 the presence of more than one copy of a sequence would a computing environment in one embodiment of the inven exclude that sequence from being defined for use as an ab tion. initio single copy probe; however, the presence of the same FIG. 2 is a flow chart depicting exemplary operations for sequence tandemly repeated fewer than 10 times at a single deriving the locations of single copy intervals used in probe location, preferably fewer than 8 times, more preferably production. 50 fewer than 5 times, and still more preferably fewer than 3 times, in the genome may still be useful for detection of DETAILED DESCRIPTION OF THE INVENTION chromosome abnormalities if such internal tandem repetition does not display copy number polymorphism in populations. The present invention is concerned with nucleic acid (e.g., The locations of the repetitive elements are determined by DNA or RNA) hybridization probes for detection of genetic 55 aligning the Subsequence with each of the genomic copies or neoplastic disorders, such as for example Monosomy 1 p36 and determining the boundaries of the common multicopy syndrome, Wolf-Hirschorn Syndrome, Cri-du-Chat Syn sequence intervals. Single copy intervals will only align to a drome, Williams Syndrome, Langer-Giedeon Syndrome, single genomic location. Accordingly, repetitive sequences, Chronic myelogenous leukemia, Acute lymphocytic leuke and therefore, single copy sequences as well, are deduced by mia, Aneuploidy for chromosome 13 (eg. Patau Syndrome), 60 ab initio methods rather than being derived from a preexisting Prader-Willi Syndrome, Angelman Syndrome & Chromo repetitive sequence database. Some 15 duplication Syndrome, Acute Myelogenous leuke One aspect of the invention, therefore, is probes that mia Type M4, Rubenstein-Taybi Syndrome, Smith-Mage hybridize with the deduced single copy sequences. The nis Syndrome, Charcot-Marie Tooth Disease Type 1A, probes hereof may be used with any nucleic acid target that Miller-Dieker Syndrome, Alagille Syndrome, Down Syn 65 contains the complementary single copy sequence as well as drome, DiGeorge/Velocardiofacial Syndrome, Schizophre potentially repetitive sequences. These target sequences may nia, Kallman Syndrome, Turner and Leri-Weill Syndromes, include, but are not limited to chromosomal or purified US 8,407,013 B2 7 8 nuclear DNA, heteronuclear RNA, cDNA or mRNA species the diploid genome. Therefore, as additional reference that contain repetitive sequences as integral components of genome sequences from different individuals are publicly the transcript. In the ensuing detailed explanation, the usual available, genomic probes of the art are compared with each case of a DNA target sequence and DNA probes is discussed; reference genome to Verify their single copy nature in each of however, those skilled in the art will understand that the the populations for which the probe is to be employed. discussion is equally applicable (with art-recognized differ Repeat sequences occur in multiple copies in the haploid ences owing to the nature of the target sequences and probes) genome. The number of copies of any family of related repeti to other nucleic acid species. tive sequences can range from ten to hundreds of thousands, One characteristic of the probes of the present invention is depending on a number of factors, including, for example, that they are made up of “single copy’ or “unique' DNA 10 mechanisms of slipped mispairing during DNA replication, sequences which are both complementary to at least a portion amplification by unscheduled DNA replication, expansion or of the target DNA region of interest and essentially free of contraction through unequal or illegitimate crossover or gene sequences complementary to repeat sequences within the conversion, transposition, transduction, or viral integration, genome of which the target region is a part. Accordingly, a or retrotransposition. The Alu family of repetitive DNA are probe made up of a single copy or unique sequence is comple 15 exemplary of the latter numerous variety. The copies of a mentary to essentially only one sequence in the correspond repeat may be clustered or interspersed throughout the ing genome. As used herein, a “repeat sequence' or “repeti genome. Repeats may be clustered in one or more locations in tive sequence' is a sequence which appears at least about the genome, such as, for example, repetitive sequences occur twice in the genome of which the target DNA is a part. ring near the centromeres of each chromosome, and variable Typically, a repeat sequence will appear in a genome at least number tandem repeats (VNTRs; Nakamura et al. Science, about 5 times, preferably about 50 times, more preferably 1987: 235: 1616); or the repeat sequences may be distributed about 200 times, and even more preferably about 1000 times. over a single chromosome, Such as, for example, repeats Factors affecting the number of times a repeat sequence found only on the X chromosome as described by Bardoni et appears in a genome include, for example, the size of the al., Cytogenet. Cell Genet., 46:575 (1987); or the repeats may genome, evolutionary age of the repeat (degree of divergence 25 be distributed over all the chromosomes, such as, for from other related sequences), the mechanism(s) of copy example, the Alu (SINE), and L1 (LINE) families of repeti number increase, and the relevance of pathogens which inte tive sequences. grate into the host genome, horizontal genetic transfer (if Simple repeats of low complexity can be found within any), and associative mating between individuals who are genes but are more commonly found in non-coding genomic heterozygous for repetitive sequence copy number. A repeat 30 sequences. Such repeated elements consist of mono-, di-, tri-, sequence will generally have a sequence identity between tetra-, or penta-nucleotide core sequence elements arrayed in repeats of at least about 60%, preferably at least about 70%, tandem units. Often the number of tandem units comprising more preferably at least about 80%, still more preferably at these repeated sequences varies at the identical locations least about 90%, even more preferably at least about 95%, and among genomes from different individuals. These repetitive most preferably about 99%, and will be of sufficient length or 35 elements can be found by searching for consecutive runs of have other qualities which would cause it to interfere with the the core sequence elements in genomic sequences. desired specific hybridization of the probe to the target DNA, As used herein, “sequence identity” refers to a relationship i.e., the probe would hybridize with one or more copies of the between two or more polynucleotide sequences, namely a repeat sequence. Generally, a repetitive sequence appears at reference genome sequence and a test sequence from a least about 5 times in the genome, preferably at least about 50 40 genomic region of interest, i.e. containing one or more poten times, and most preferably at least about 200 times and has a tial probe sequence(s) to be compared with the reference length of at least about 20 nucleotides, preferably at least sequence. Sequence identity is determined by comparing the about 40 nucleotides, more preferably at least about 50 nucle test sequence to the reference sequence after the sequences otides, still more preferably at least about 75 nucleotides, and have been optimally aligned to produce the highest degree of even more preferably at least about 100 nucleotides. Repeat 45 sequence similarity, as determined by the match between sequences can be of any variety, including, for example, tan strings of Such sequences. Upon Such alignment, sequence dem, interspersed, palindromic or shared repetitive identity is ascertained on a position-by-position basis, e.g., sequences (with some copies in the target region and some the sequences are “identical at a particular position if, at that elsewhere in the genome), and can appear near the cen position, the nucleotides are identical. The total number of tromeres of chromosomes, distributed over a single chromo 50 such position identities is then divided by the total number of Some, or throughout some or all chromosomes. This defini nucleotides or residues in the reference sequence to give a tion of a repeat includes closely related members of the same percent sequence identity. Sequence identity can be readily multigene family, since the utility of the probes is related to calculated by known methods including, but not limited to, the unique locations on chromosomes. However, typically, those described in Computational Molecular Biology, Lesk, repeat sequences are Sufficiently degenerate Such that most 55 A. N., ed., Oxford University Press, New York (1988), Bio elements do not express physiologically useful proteins. Nev computing: Informatics and Genome Projects, Smith, D. W., ertheless, repeat sequences may exhibit length polymorphism ed., Academic Press, New York (1993); Computer Analysis of Such that they may be present in Some individuals and absent Sequence Data, Part I, Griffin, A.M., and Griffin, H. G., eds., in others. However when this is the case, complex repeats Humana Press, New Jersey (1994); Sequence Analysis in must be distinguished by copy number polymorphisms 60 Molecular Biology, von Heinge, G., Academic Press (1987); (which may contain multiple repeat elements and single copy Sequence Analysis Primer, Gribskov, M. and Devereux, J., sequences, and indeed, complete genes, in some cases). The eds., M. Stockton Press, New York (1991); and Carillo, H., instant invention utilizes the current assembly of a single or and Lipman, D., SIAM J. Applied Math., 48: 1073 (1988). composite genome. One of skill in the art would recognize Preferred methods to determine sequence identity are that polymorphisms that duplicate or delete repetitive 65 designed to give the largest match between the sequences sequence in different individuals will require that probes tested. Methods to determine sequence identity are codified in derived therefrom may not be present at a single location in publicly available computer programs which determine US 8,407,013 B2 10 sequence identity between given sequences. Examples of to detect virtually any type of chromosomal rearrangement, Such programs include, but are not limited to, the GCG pro Such as, for example, deletions, duplications, insertions, addi gram package (Devereux, J., et al., Nucleic Acids Research, tions, markers, inversions or translocations. 12(1):387 (1984)), BLASTP, BLASTN and FASTA (Alts In addition to FISH, computationally determined single chul, S. F. et al., J. Molec. Biol., 215:403410 (1990). The 5 copy genomic hybridization probes may be used in a quanti BLASTX program is publicly available from NCBI and other tative microsphere suspension hybridization assay to deter sources (BLAST Manual, Altschul, S. et al., NCBI, NLM, mine copy number of a specific sequence relative to a refer NIH, Bethesda, Md. 20894, Altschul, S. F. et al., J. Molec. ence sequence or standard curve (Newkirk et al. Human Biol., 215:403410 (1990)). These programs optimally align Mutation, in press (2006)). Those of skill in the art would also sequences using default gap weights in order to produce the 10 recognize that single copy probes used as probes for microar highest level of sequence identity between the test and refer rays would have properties similar to microsphere hybridiza ence sequences. As an illustration, by a polynucleotide hav tion, since in both platforms the probes are attached to a solid ing a nucleotide sequence having at least, for example, 95% phase substrate and hybridized to either labeled genomic "sequence identity” to a reference nucleotide sequence, it is DNA or to cDNA. Single copy probes have been shown to be intended that the nucleotide sequence of the given polynucle 15 more accurate for copy number determination than probes otide is identical to the reference sequence except that the containing repetitive sequences that utilize Cot-1 DNA for given polynucleotide sequence may include up to 5 differ Suppression of cross hybridization of repetitive elements ences per each 100 nucleotides of the reference nucleotide (Newkirket al., Nucleic Acids Research 2005, 33(22): e191). sequence. In other words, in a polynucleotide having a nucle Sufficient accuracy is achieved to distinguish normal copy otide sequence having at least 95% identity relative to the number which is generally two for autosomes from hemizy reference nucleotide sequence, up to 5% of the nucleotides in gosity or from three or more alleles. This assay allows for the the reference sequence may be deleted, inserted, or Substi direct analysis of whole genomic DNA (or RNA) using flow tuted with another nucleotide, or a number of nucleotides up cytometry and if necessary can follow routine cytogenetic to 5% of the total nucleotides in the reference sequence may analysis without requiring large patient sample quantities, be inserted into the reference sequence. Inversions in either 25 additional blood draws, locus-specific amplifications, or sequence are detected by these computer programs based on time-consuming genomic purification methods. It is notable the similarity of the reference sequence to the antisense Strand therefore that copy number determination at a single locus of the homologous test sequence. These variants of the refer can be carried out within a complex background of sequences ence sequence may occur at the 5' or 3' terminal positions of consisting of the complete genome. This exquisite level of the reference nucleotide sequence or anywhere between those 30 discrimination achieved by computationally-defined single terminal positions, interspersed either individually among copy probes can also be used to determine copy number of nucleotides in the reference sequence or in one or more con rare transcripts against the background of the complete tran tiguous groups within the reference sequence. scriptome, or for detection of extremely dilute or low concen It should be understood that BLAST, BLAT, and similar trations of specific nucleic acid sequences within heteroge heuristic algorithms do not provide the sequences of all of the 35 neous Solutions of nucleic acids. matches (in the genome) above the specified expected value In order to develop probes in accordance with the inven threshold; however, they tend to indicate the degree to which tion, the sequence of the target DNA region must be known. a sequence may be repetitive. Sequences which match numer The target region may be an entire chromosome or only ous genomic locations (generally on the order of hundreds) portions thereof where rearrangements have been Suspected tend to be quite abundant and well conserved. Sequences 40 or identified. With this sequence knowledge, the objective is which match several genomic locations tend to be either less to determine the boundaries of single copy or unique common or less well conserved between paralogs. Sequences sequences within the target region. This is preferably accom which match a single location in the genome are expected to plished by inference from the locations of repetitive be single copy, since the stringency of recognizing pairwise sequences within the target region. An important distinction matches with the WU-BLAST algorithm has been deliber 45 between the method of the instant invention and the other ately relaxed to detect weakly similar genomic copies of any methods is that the target region sequences of the present input sequence. invention are not compared with known repeat sequences The single copy probes of the invention preferably have a from the corresponding genome, using available computer length of at least about 25 nucleotides, preferably at least software. With the instant invention, a catalog of known about 40 nucleotides, more preferably at least about 50 nucle 50 repeat sequences is, therefore, not a prerequisite to computa otides, still more preferably at least about 75 nucleotides, and tional recognition of single copy intervals with this software. even more preferably at least about 100 nucleotides. Probes of Therefore, single copy sequences can be derived with the this length are sufficient for Southern blot analyses, bead instant invention from any complete genome sequence, so Suspension hybridization, and microarray hybridization. long as a determination of that sequence is completed. However, if other analyses Such as fluorescence in situ 55 Initially, a genomic or mRNA sequence is identified from hybridization (FISH) are employed, the probes should be which one or more single copy intervals and probes are somewhat longer, i.e., at least about 500 nucleotides, prefer desired. This test sequence, sometimes also referred to as a ably at least about 1000 nucleotides, and even more prefer target sequence, typically contains at least one repetitive ele ably at least about 2000 nucleotides in length. Factors used in ment; however, it is not a requirement that the test sequence determining the length of the probes include, for example, the 60 contain a repetitive sequence. In the latter instance, the type of analysis or hybridization method to be used, sequence method does not eliminate any sequence from consideration specificity (i.e. complexity of the probe), nucleotide content as a potential probe; it simply verifies that the entire test (which dictates the optimal annealing temperature of the sequence is non-repetitive. This test sequence is Subsequently probe), the amount of secondary structure that the probe may compared with the reference sequence of the same genome adopt (which can be predicted with available software pro 65 from which the test sequence is derived. Using homology grams), and replication timing (synchronous VS asynchro search algorithms common in the art, Such as, for example, nous) of the genomic target sequence. The probes can be used BLAST or BLAT (see details below), this approach will iden US 8,407,013 B2 11 12 tify matches with at least 80% identity to genomic sequences. sequence with at least one repeat per kilobase pair in the test Often weaker orthologies with as little as 70% or 60% identity region, windows of 0.5 kb sequences are used to determine can also be detected, although this typically requires few or no locations of repeats. gaps to be present in the sequence alignment. This level of First, end-to-end window comparisons of about 500 by to sensitivity is more than adequate for detection of single copy 5 about 1000 base pairs (bp) are performed across the entire test sequences, since highly divergent repetitive elements form sequence. This is akinto a pre-screening function. The length heterologous duplexes that are easily eliminated by hybrid utilized in this embodiment was selected because it is consis izing and washing the probe under high Stringency conditions tent with studies indicating the average distances between (e.g., 0.1 xSSC, 42°C.). These comparisons identify at least interspersed repetitive elements in the human genome. The one region of the genome that matches (or nearly matches, 10 optimal window lengths may be different for other genomes due to genomic polymorphism) that test sequence. The exact since they would be based on overall repetitive complement and similar matches to the test sequence are termed “hits.” in those genomes (determined from kinetic reassociation When multiple hits are obtained, the test sequence contains studies) and the respective genome sizes. This information is one or more members of a repetitive sequence family or one available from published sources (Lewin, Eukaryotic Gene or more low-copy segmental duplicons. In principal. Such 15 Expression, Wiley, 1983). Other factors affecting the selec intervals are not preferred for probe design since a probe tion of a window length include, for example, the degree of designed using Such intervals could potentially hybridize to resolution desired to determine the boundaries of a single more than a single genomic locus. copy sequence, the efficiency (i.e., the amount of time) There are mitigating circumstances in which multiple hits desired to determine the boundaries of a single copy may still be Suitable for probe design, such as, for example, if sequence, the density of repetitive sequences in the genome the two hits occur at nearly contiguous locations on the chro sequence of interest (i.e. containing potential probe mosome. This can be deduced from the chromosomal coor sequences) and the accuracy of sequences in this region of the dinates of the sequences in the genome that are similar to the genome. Accordingly, the test sequence may be divided into potential probe interval. For hybridization by FISH to test segments (i.e., window lengths) of about 20 by to about metaphase chromosomes, these coordinates may be up to 25 5000 bp, preferably about 100 by to about 2500 bp, more approximately 3 million nucleotides apart (it can be more or preferably from about 250 by to about 1500 bp, still more less than this quantity depending on the level of condensation preferably about 500 by to about 1000 bp, and most prefer of the particular genomic region), and the probe signals ably about 1000 bp. obtained by FISH will be coincident even at the highest power Alternative faster ab initio approaches for detection of magnification. For either array-based or microsphere suspen 30 repeats have been described based on exact word-matching sion hybridization, however, much higher levels of granular algorithms based on nucleotide sequences (for example, ity, i.e., genomic resolution, may be required to precisely Healyet al. Genome Res. 13:23.06-15, 2003). Here, words are localize a genomic targetin, for example, a patient specimen. defined as overlapping or non-overlapping sequences of a Typically, 100,000-400,000 by intervals are tested to short uniform length. However Such approaches are not com design single copy probes in a reasonable length of time (i.e., 35 prehensive. It also stated in this paper that this is not sufficient within 1-2 CPU hours on a modern cluster computer), how to ensure that repetitive sequences are completely eliminated ever it can be appreciated by those of skill in the art that this from the microarray. Follow up approximate homology approach could be applied genome-wide, given Sufficient searching is performed so that the algorithm is carried out on computational power. An advantage of genome-wide pre a single human genome reference sequence. Of course, the computation would be that Subsequent probe development 40 human genome is highly polymorphic and the word match would only involve looking up relevant single copy intervals algorithm does not consider words containing the polymor to identify the most appropriate primers for amplification of phic variants. Therefore, a genomic microarray based on this single copy probes using the polymerase chain reaction algorithm alone may fail to detect repetitive sequences that (PCR) (see U.S. Pat. No. 6,828,097 for details of the PCR contain polymorphic words. Of course, Some of the reaction to amplify products from deduced single copy 45 sequences in the patient DNA hybridizing to those oligo genomic intervals). nucleotides will be repetitive. This will result in incorrect While it is possible to conduct an exhaustive genome (vastly increased) copy number measurements. Since this is search of every Subsequence window in the test sequence, the signature of what they are trying to detect, i.e., abnormali such that the windows overlap and differ by a single nucle ties, it would result in false-positive identification of copy otide, this procedure is slow and inefficient. Certain embodi 50 number changes in these oligonucleotides. However, a low ments employ a more efficient approach. The genomic fre stringency approximate homology search by conventional quency of sequences with test genomic sequence region can repeat masking will pick up these sequences. This is why the be determined to establish optimal parameters of window exact word match procedure must be followed up with con sizes and displacements based on estimates of the local dis ventional repeat-masking (as was done in Healy etal Genome tribution of repetitive sequences in the test sequence interval. 55 Res. 13:23.06-15, 2003; see U.S. Pat. No. 6,828,097) to ensure Initially, the test genomic sequence region is prescreened by that single copy sequences are synthesized on the microarray comparison with the reference genome sequence in order to chip. determine local density of repetitive sequences within the There are three possible outcomes of the prescreen for region. This density can vary considerably within local repetitive sequences: (1) the Subsequence can be entirely regions across the euchromatic genome and it is not adequate 60 composed of repetitive sequence, (2) one or more portions of to assume an average density for any particular region. This the Subsequence may be repetitive, or (3) the Subsequence density dictates the granularity of the overlapping sequence may contain no detectable repetitive sequences. Efficient windows needed to comprehensively find all repetitive methods for comparison of test sequences with complete or sequences in aparticular region. A higher density of repetitive near complete reference genomes are well known in the art sequences necessitates that windows of less than this length 65 (BLAST and BLAT). If the genome comparison reveals the be used in the Subsequent step of defining the precise loca presence of sequences with high percentages of similar con tions of the repeats. In a preferred embodiment, for a secutive nucleotides to the test sequence at multiple genomic US 8,407,013 B2 13 14 loci, this indicates the presence of one or more repetitive An optional step that would reduce future computational sequences within the test sequence. expense is to bootstrap a catalog of repetitive elements A detailed description of how the method handles each of derived from the ab initio procedure. Rather than discarding these outcomes follows: (1) if the paralogous (related or the sequences found to be present more than once per similar) copies span the entire length of the Subsequence, then genome, the interface between single copy and repetitive this Subsequence is eliminated as a potential hybridization sequence elements could be defined using the aforemen probe. For this class of subsequences, the objective then is to tioned procedure, which would determine the coordinates of determine how far upstream and downstream of the subse the repeat, and the repeat sequence then catalogued. This quence the paralogous repeats extend. The adjacent Subse could be accomplished by storing the sequences of the repeti quences within the test sequences are then analyzed to deter 10 tive sequences detected in a separate database for Subsequent mine whether these sequences are similar to multiple searches. Similar repeats could then be sorted into families genomic loci within the genome over their entire length. The and Subfamilies by multiple alignments. Subsequent searches process of analyzing contiguous adjacent Subsequences is will first compare a new sequence with the repeat sequence iterated until, either (a) the adjacent Subsequence is found at database, and then to the genome reference sequence as only a single genomic location, or (b) only a portion of the 15 described above. Although this step is not required, it will Subsequence shows similarity to multiple genomic locations, significantly improve performance of the algorithm to detect that portion determining the boundary of the single copy and single copy intervals, especially as the repeat catalog grows in multilocus Subsequences; (2) pursuant to (b). Such partially size. repetitive Subsequences are again analyzed to determine Repetitive sequence elements defined by the above method which portion is contiguous with the relevant adjacent single can then be deposited in an electronic database where they copy interval. Segments of the Subsequence can either be can be subsequently retrieved for comparisons with other sampled to and compared with the genome reference to deter potential sequences containing single copy and repetitive mine the approximate locations of repetitive domains which intervals. Since each matched segment contains an individual are then fine mapped by additional short sequence compari repetitive element, the element in most instances will not be Sons, or a relative series of consecutive, short or overlapping 25 identical to the consensus sequence of the corresponding sequence windows are progressively tested against the repetitive sequence family representative found in, for genome sequence until coordinates that match a single loca example, Knoll et al.’s 097 patent, because consensus tion in the haploid genome sequence are found; (3) Subse sequences are derivative sequences that are compiled by quences that match only a single location in the genome are selecting the most common nucleotide at a particular position considered single copy sequences, however exceptions, for 30 among a set of elements. Various embodiments can be used to example, including non-polymorphic tandemly repeated screen sequences contained within current repeat libraries in sequences of no more than about 10 copies, preferably no order to ensure that a repetitive sequence is not misassigned more than about 8 copies, more preferably no more than about as a single copy sequence. Finally, this procedure may iden 3 copies, and still more preferably no more than about 5 tify repetitive sequences that are not otherwise recognized copies foundata single location in the genome may be treated 35 with the technology described in other approaches reliant as single copy intervals especially in FISH studies, because of upon an established repeat library because the newly identi their consistent, unequivocal patterns of hybridization to the fied sequences are not necessarily represented in existing genome. databases. Fine mapping of the approximate repetitive sequence? Defining the boundary of the single copy interval can occur single copy interval within a Subsequence is performed on 40 as follows. As the window moves, the repeat sequence bound overlapping sequence intervals by iteratively and unidirec ary should shift by the length of the sequence displaced tionally displacing the sequence window by a fixed, constant through each step. When sufficient steps in one direction have length of for example, 1 to 20 nucleotides. The new sequence been performed so that there is no longer a match to a repeat is compared with the reference genome sequence and the sequence, this defines the other boundary of the repeat. Defi number of significant matches in the genome (based on length 45 nition of the repeat sequence boundaries on both ends makes and percent of identity to the new sequence) is determined. the repeat sequence eligible for optional deposition into a After each comparison, the window is again displaced by this repeat sequence database. length, compared with the reference genome and this process The resolution of the single copy window is defined by the is iterated until the end of the subsequence is reached. length of the Smallest sequence displacement (i.e., the nucle If multiple hits are detected in the genome, then the range 50 otide word length) between iteration cycles used in the defi of coordinates within the Subsequence that contains the nition of the repeat/single copy boundary. The single copy repetitive sequence is then refined. This is done by perform interval sequence can be shortened by at least one word at the ing a low stringency comparison of the genome and Subse repeat boundary to ensure that the entirety of the region quence, preferably with the Smith-Waterman algorithm, selected for probe development is single copy. however other algorithms may also be used such as BLAST or 55 Single copy sequences defined by this approach can be BLAT. The location of the matching terminal coordinate used to detect chromosome rearrangements including dele within the query is determined and this coordinate is tions, insertions, additions, translocations, inversions and any recorded. The window is again shifted by 1-20 nucleotides. combination of these chromosomal modifications by hybrid The length of the pairwise match may increase, remain the ization. Often, Such rearrangements are diagnostic for the same, or decrease. If this length increases, the matching coor 60 detection of genetic diseases and cancer. dinate is again recorded and the window is shifted in the same Accordingly, among the various aspects of the present direction. If it stays the same, the window is also again shifted invention is a method to identify a single copy sequence in a in the same direction. If the length decreases, then the com target reference genomic sequence. The method comprises plete repeat has been found (both boundaries). The final coor determining a number of matches between at least one Sub dinates of the centromeric and telomeric boundaries of the 65 sequence of a first screened sequence and a target reference repetitive sequence are then recorded (and the prior interme sequence, wherein the target reference sequence comprises diate coordinates are discarded). the first screened sequence, the first screened sequence is US 8,407,013 B2 15 16 divided into at least two Subsequences, and a Subsequence of sequence with at least two matches to the target reference the first screened sequence with a single match to the target sequence as a Subsequence containing a repetitive element reference sequence or a group of contiguous Subsequences of wherein the single copy sequence is located adjacent to the the first Screened sequence each with a single match to the repetitive element. In another embodiment, the method fur target reference sequence is identified as a single copy inter 5 ther comprises the step of identifying a second, distinct Sub val of the first screened sequence; determining a number of sequence of the screened sequence with at least two matches matches between at least one Subsequence of a second to the target reference sequence as a Subsequence containing screened sequence and the target reference sequence, wherein a different repetitive element, wherein the single copy inter the second screened sequence comprises a single copy inter val is located between the first and the second subsequences val of the first screened sequence; the second screened 10 containing the distinct repetitive elements. sequence overlaps the single copy interval of the first Another aspect of the present invention is a single copy screened sequence; the Subsequences of the first screened hybridization probe as described herein. Such probes may sequence are either (i) consecutive non-overlapping Subinter comprise at least one single copy interval or single copy vals of the second screened sequence or (ii) overlapping non sequence identified according to the methods disclosed identical Subintervals of the second screened sequence, each 15 herein. In one embodiment, the probes comprise at least two containing one nucleotide homologous to the reference contiguous Subsequences of a screened sequence, each hav sequence that is not present in the adjacent Subinterval; and a ing a single match to the target reference sequence. Subsequence of the second screened sequence with a single Referring to FIG. 1, a block diagram illustrates a user 102 match to the target reference sequence or a group of contigu interacting with a computing environment in one embodi ous Subsequences of the second screened sequence each with ment of the invention. In the example of FIG. 1, the user 102 a single match to the target reference sequence is identified as interacts with a computing device 104. The computing device a single copy interval of the second screened sequence; and 104 has access to one or more computer-readable media Such identifying a single copy interval as a single copy sequence of as computer-readable medium 106. The computer-readable the target reference sequence Suitable for use as a single copy medium 106 stores one or more computer-executable com hybridization probe. In one embodiment, the Subsequences 25 ponents. In this example, the components include a first may be at least about 100 consecutive non-overlapping nucle genome comparison component 108, a second genome com otides, at least about 200 consecutive non-overlapping nucle parison component 110, and a Subsequence component 112. otides, at least about 400 consecutive non-overlapping nucle The first genome comparison component 108 determines a otides, at least about 600 consecutive non-overlapping number of matches between at least one Subsequence of a first nucleotides, at least about 800 consecutive non-overlapping 30 screened sequence and a target reference sequence. The target nucleotides, or even at least about 1000 consecutive non reference sequence includes the first screened sequence overlapping nucleotides. which is divided into at least two subsequences. A subse In one embodiment of the invention, the method further quence of the first screened sequence with at least two comprises the step of determining a number of matches matches (and preferably more than five matches) to the target between at least one Subsequence of a third screened 35 reference sequence can be identified as containing a repetitive sequence and the target reference sequence, wherein the third element. A Subsequence of the first screened sequence with a screened sequence comprises a single copy interval of the single match to the target reference sequence or a group of second screened sequence; the third screened sequence over contiguous Subsequences of the first screened sequence, each laps the single copy interval of the second screened sequence; with a single match to the target reference sequence is iden the Subsequences of the third screened sequence are either (i) 40 tified as a single copy interval of the first screened sequence. consecutive non-overlapping Subintervals or (ii) overlapping The second genome comparison component 110 deter non-identical Subintervals, each containing one nucleotide mines a number of matches between at least one Subsequence homologous to the reference sequence that is not present in of a second screened sequence and the target reference the adjacent Subinterval; and a Subsequence of the third sequence. The second screened sequence includes a single screened sequence with a single match to the target reference 45 copy interval of the first screened sequence. The second sequence or a group of contiguous Subsequences of the third screened sequence overlaps the single copy interval of the screened sequence each with a single match to the target first screened sequence. The Subsequences are either (i) con reference sequence is identified as a single copy interval of secutive non-overlapping Subintervals of the second screened the third screened sequence. In another embodiment, the sequence or (ii) overlapping non-identical Subintervals of the method further comprises the step of determining a number of 50 second screened sequence, each containing one nucleotide matches between at least one Subsequence of a fourth homologous to the reference sequence that is not present in screened sequence and the target reference sequence, wherein the adjacent Subinterval. A Subsequence of the second the fourth screened sequence comprises a single copy interval screened sequence with at least two matches (and preferably of the third Screened sequence; the fourth screened sequence more than five matches) to the target reference sequence can overlaps the single copy interval of the third screened 55 be identified as containing a repetitive element. A Subse sequence; the Subsequences the of fourth screened sequence quence of the second screened sequence with a single match are either (i) consecutive non-overlapping Subintervals or (ii) to the target reference sequence or a group of contiguous overlapping non-identical Subintervals, each containing one Subsequences of the second screened sequence each with a nucleotide homologous to the reference sequence that is not single match to the target reference sequence is identified as a present in the adjacent Subinterval; and a Subsequence of the 60 single copy interval of the second screened sequence. fourth screened sequence with a single match to the target The Subsequence component 112 identifies a single copy reference sequence or a group of contiguous Subsequences of interval as a single copy sequence of the target reference the fourth screened sequence each with a single match to the sequence Suitable for use as a single copy hybridization target reference sequence is identified as a single copy inter probe. val of the fourth Screened sequence. 65 Hardware, Software, firmware, computer-executable com In still another embodiment, the method further comprises ponents, and/or computer-executable instructions such as the the step of identifying a Subsequence of the screened exemplary components/instructions illustrated in the figures US 8,407,013 B2 17 18 constitute means for determining a number of matches Embodiments of the invention may be described in the between at least one subsequence of the first screened general context of computer-executable instructions. Such as sequence and the target reference sequence, means for deter program modules, executed by one or more computers or mining a number of matches between at least one Subse other devices. Generally, program modules include, but are quence of the second screened sequence and the target refer not limited to, routines, programs, objects, components, and ence sequence, and means for identifying a single copy data structures that perform particular tasks or implement interval as a single copy sequence of the target reference particular abstract data types. The computer-executable sequence Suitable for use as a single copy hybridization instructions may be embodied in any computer programming probe. language or scripting language including, but not limited to, An exemplary operating environment for implementing 10 aspects of the invention (e.g., the computer programs C, C++. C#, and Pert. The computer-executable instructions described herein) such as shown in FIG. 1 includes a general may be organized into one or more computer-executable purpose computing device Such as computing device 104 components or modules. Aspects of the invention may be executing computer-executable instructions. The computing implemented with any number and organization of such com device 104 typically has at least some form of computer 15 ponents or modules. For example, aspects of the invention are readable media. Computer readable media, which include not limited to the specific computer-executable instructions both volatile and nonvolatile media, removable and non-re or the specific components or modules illustrated in the fig movable media, may be any available medium that may be ures and described herein. Other embodiments of the inven accessed by the general purpose computing device 104. By tion may include different computer-executable instructions way of example and not limitation, computer readable media or components having more or less functionality than illus comprise computer storage media and communication trated and described herein. media. Computer storage media include Volatile and nonvola Aspects of the invention may also be practiced in distrib tile, removable and non-removable media implemented in uted computing environments where tasks are performed by any method or technology for storage of information Such as remote processing devices that are linked through a commu computer readable instructions, data structures, program 25 nications network. In a distributed computing environment, modules or other data. Communication media typically program modules may be located in both local and remote embody computer readable instructions, data structures, pro computer storage media including memory storage devices. gram modules, or other data in a modulated data signal Such In operation, the computing device 104 executes computer as a carrier wave or other transport mechanism and include executable instructions such as those illustrated in the figures any information delivery media. Those skilled in the art are 30 to implement embodiments of the invention. familiar with the modulated data signal, which has one or Referring next to FIG. 2, a flow chart depicts exemplary more of its characteristics set or changed in such a manner as operations for deriving the locations of single copy intervals to encode information in the signal. Wired media, Such as a used in probe production. FIG. 2 illustrates one exemplary wired network or direct-wired connection, and wireless implementation of aspects of the invention using computer media, such as acoustic, RF, infrared, and other wireless 35 executable instructions. Other implementations are within the media, are examples of communication media. Combinations scope of embodiments of the invention. For example, the of any of the above are also included within the scope of operations illustrated in FIG.2 may be organized into other computer readable media. The computing device 104 components or application programs. includes or has access to computer storage media in the form In FIG. 2, an ABINITIO.PL script creates a set of indi of removable and/or non-removable, volatile and/or nonvola 40 vidual Subsequences covering a region for genome compari tile memory. The user 102 may enter commands and infor sons. The Script takes as input the following at 202: a genomic mation into the computing device 104 through input devices sequence file, a length of Subsequence, a length of window or user interface selection devices such as a keyboard and a offset between Subsequences, a minimum length of match to pointing device (e.g., a mouse, trackball, pen, or touch pad). genomic repeats or paralogs (e.g., for filtering results of Other input devices (not shown) may be connected to the 45 genomic comparisons), and a minimum percentage of match computing device 104. The computing device 104 may oper to genomic repeats or paralogs. If the length of window offset ate in a networked environment using logical connections to is Smaller than the length of Subsequence, the Script produces one or more remote computers. overlapping windows. If the length of window offset is larger Although described in connection with an exemplary com than the length of Subsequence, the script produces Subse puting system environment, aspects of the invention are 50 quences separated by gaps having a length equal to the length operational with numerous other general purpose or special of subsequence minus the length of window offset. If the purpose computing system environments or configurations. length of window offset is equal to the length of Subsequence, The computing system environment is not intended to Suggest the script produces consecutive windows. any limitation as to the scope of use or functionality of aspects The ABINITIO.PL script outputs at 204 a set of individual of the invention. Moreover, the computing system environ 55 Subsequences (e.g., files named by Subsequence boundaries) ment should not be interpreted as having any dependency or to a WUBL script (e.g., a BLAST script) to perform genome requirement relating to any one or combination of compo comparisons. The WUBL script performs the genome com nents illustrated in the exemplary operating environment. parisons at 206 on a cluster computer (e.g., a separate parallel Examples of well known computing systems, environments, job is run simultaneously on a different node). Files indicat and/or configurations that may be suitable for use in embodi 60 ing the results of the WUBL genome comparisons are filtered ments of the invention include, but are not limited to, personal by a BLASTPARSE.PL script and condensed to a hit list computers, server computers, hand-held or laptop devices, based on user-provided or empirically-derived criteria. The multiprocessor Systems, microprocessor-based systems, set BLASTPARSE.PL script produces files of filtered output. top boxes, programmable consumer electronics, mobile tele The user 102 may confirm that the comparisons with the phones, network PCs, minicomputers, mainframe computers, 65 genome sequence have been completed using an application distributed computing environments that include any of the program, Such as qstat, which is a Sun-Grid Engine utility to above systems or devices, and the like. monitor processor status. In another embodiment, this con US 8,407,013 B2 19 20 firmation operation is automated and the user 102 is notified by those of skill in the art that the techniques disclosed in the when the comparisons have been completed. examples that follow represent approaches the inventors have The files of filtered output from the BLASTPARSE.PL found function well in the practice of the invention, and thus script are input into a COUNTHITS.PL script for summariz can be considered to constitute examples of modes for its ing. The COUNTHITS.PL script distills at 208 the hit list practice. However, those of skill in the art should, in light of from the BLASTPARSE.PL script for each interval to a copy the present disclosure, appreciate that many changes can be number and sorts by sequence coordinate. The COUN made in the specific embodiments that are disclosed and still THITS.PL script identifies intervals with multiple hits as obtain a like or similar result without departing from the spirit these contain repeat elements and records single copy inter and scope of the invention. vals as, for example, Set A. 10 One output of COUNTHITS.PL is a count which contains Example 1 the quantity of hits in the genome found with each Subse quence interval. If the quantity of hits exceeds one, the The following example illustrates how the probes designed sequence is not single copy based on the parameter definitions using the instant invention produce similar results to the that are acceptable by one of skill in the art. These definitions 15 repeat-free probes described in U.S. Pat. No. 6,828,097. Here aim to prevent cross hybridization between a single copy we rederive the single copy intervals shown in Example 1 of probe and other genomic locations that are partially paralo that patent with the present invention. First we determined the gous to the entire potential probe sequence or a portion locations of the repetitive sequences in the human HIRA gene thereof. and flanking regions (SEQ ID NO: 1) and subsequently The single copy intervals in Set A are grouped at 210 into inferred the locations of the single copy intervals therefrom. contigs L1 ... } which are members of the Set A. For each contig, a SUBSEQ program creates a series of Subsequences TABLE 1 with small offset up to the length of subsequence from the Results obtained using the method described beginning and end of the contig. in U.S. Pat. No. 6,828,097 Independent threads are spawned with the series of subse 25 quences having an upstream boundary (U) and a downstream POSITION IN POSITION IN REPEAT boundary (D). The WUBL script, BLASPARSE program, REFERENCE SEQUENCE REPEAT CONSENSUS SEQUENCE and COUNTHITS.PL script are executed at 212 until the COUNTHITS.PL script produces a hit count greater than one Begin Coord End Coord FAMILY Begin Coord End Coord (e.g., defining a single copy boundary). For each contig, the 30 633 653 GC rich 1 21 695 859 (CCG)n 3 172 coordinates of single copy interval boundaries (U. D) are 987 1008 GC rich 1 22 recorded and combined with adjacent single copy contigs to 647 1061 MLT2A1 436 1 define a complete interval (A-U, A+D) at 214. 291.3 3O14 MERS8B 239 340 Appendix A includes an example of the ABINITIO.PL 3053 3397 L1M4 2884 3209 35 3398 3698 Alub 303 2 script. Appendix B includes an example of the WUBL script. 3699 393S L1M4 3209 3451 Appendix C includes an example of the BLASTPARSE.PL 4002 446S L1M4c 1469 1003 script. Appendix D includes an example of the COUN 4466 4766 AY 300 1 THITS.PL script. Appendix E includes an example of the 4767 4861 L1M4c 1004 910 486S 5081 AluJo 5 220 SUBSEQ.PL script. SO82 5137 AluSq/x 86 141 40 In another embodiment, the operations for deducing single S138 5211 AuS 76 2 copy intervals use a single program set to analyze a larger S214 5713 L1MEc 2392 1876 sequence and produce a single table that gives the genomic 5740 6031 AuSX 295 6 6O77 6206 L 5O15 4879 copy number of each consecutive or overlapping Subse 6291 6557 L. 4686 4399 quence. Via this table, the system automatically detects the 6560 6600 L1M4c 1457 1497 transitions between repetitive and single copy intervals. The 45 6602 6663 MLT1E1 23 293 boundaries may be refined in increasingly higher resolution 6677 6743 MLT1E1 417 481 6774 6897 L1PB2 9 210 using a programmable iterative procedure. 6878 7S34 L1PB2 1113 1767 The order of execution or performance of the operations 7577 76SS All 312 234 illustrated and described herein is not essential, unless other 7656 8290 L1PB2 177 2376 wise specified. That is, the operations may be performed in 50 8291 8583 ASX 293 1 8584 9844 L1PB2 2376 3758 any order, unless otherwise specified, and the operations may 9845 O143 AuSX 298 include more or less elements than those disclosed herein. For O144 1262 L1PB2 3983 S142 example, it is contemplated that executing or performing a 1263 1282 (TAAAA)n 3 22 particular operation or element before, contemporaneously 1283 152S L1PB2 S142 5378 55 1526 1659 Alub 134 with, or after another operation or element is within the scope 1661 1964 Alub 306 of an embodiment of the invention. 1965 2896 L1PB2 5365 6313 Having described the invention in detail, it will be apparent 2897 3179 AuSX 282 1 that modifications and variations are possible without depart 318O 3675 L1PB2 6313 6805 3762 4060 Alub 288 ing from the scope of the invention defined in the appended 4136 4364 Alub 229 60 claims. Furthermore, it should be appreciated that all 4387 4502 FLAMC 117 2 examples in the present disclosure are provided as non-lim 4528 4584 L. 293 2987 iting examples. 4586 5758 L 3O4 4281 S989 61.91 MER1B 337 127 6.191 6223 MER1B 33 1 EXAMPLES 6449 6582 L1M 5265 5393 65 6728 6858 FLAMC 2 143 The following non-limiting examples are provided to fur 81.49 8455 AllSX 1 307 ther illustrate the present invention. It should be appreciated

US 8,407,013 B2 23 24 The present invention is now shown to provide similar TABLE 2-continued results to the above comparison of a sequence region with a predetermined library of repetitive sequences. The following Results of ab initio repeat detection for HIRA results were obtained using one embodiment of the present gene region from U.S. Pat. No. 6,828.097 invention. Begin coordinate Initially, the 103 kb HIRA sequence was divided into con SEQID No. 1 End coordinate Number hits genome secutive non-overlapping intervals of 1000 by in length to 2700 28OOO 1 determine the density of repetitive sequences across this 2800 29OOO 15799 genomic region. The sequences of each of these intervals 3OOO 31 OOO 1 were compared with the May, 2004 human genome reference 10 31 OO 32OOO 1 3200 33OOO 277 sequence using the WU-BLAST blastin program. The param 3400 3SOOO 47220 eters for these comparisons were modified from default val 3500 36OOO S639 ues to pick up the weakest similarities in the genome in order 3700 38OOO 21053 to ensure that even poorly conserved repetitive sequences 3800 39000 42981 15 3900 4OOOO 3 would be detected. The parameters of the search were: -d 4OOO 41 OOO 23551 human, Span2.cpus-2 (number of threads), lcmask, and hsp 41 OO 42OOO 7546 max=100. Each comparison required approximately 5.8 sec 4200 43OOO 1789 onds. 43OO 44000 22258 4400 4SOOO 23320 The 103 comparisons of 1 kb each required approximately 4SOO 46OOO 1 6 minutes on an 8 node dual CPU cluster computer, which is 46OO 47OOO 1 comparable or faster than the method described by Knolletal. 47OO 48000 1 in the 097 patent. 4800 49000 1 4900 SOOOO 21609 After filtering the output with a Blast parsing routine SOOO S1 OOO 1546S (called from the Bioperl implementation of the language; S100 S2OOO 12SO1 at www.bioperl.org), and counting the number of significant 25 S200 S3OOO 2 hits detected for each of the 1000 consecutive sub-intervals of 5300 S4OOO 2 S400 55000 22837 SEQ ID NO: 1, the results are summarized in the Table 2. 5500 S6000 23436 Regarding filtering, we have tested several minimum thresh S800 S9000 1 olds for repeat sequence detection in human genomic S900 6OOOO 1 sequences have and each gives similar results. The preferred 30 6100 62OOO 35227 62OO 63OOO 23960 minimum thresholds for detection are a pairwise match 63OO 64000 23119 between the test sequence and its genomic counterpart of at 6400 6SOOO 22933 least 100 nucleotides in length and 70 percent identity. 6SOO 66OOO Equivalent results were obtained, for example, using criteria 6600 67OOO 23787 35 67OO 68000 6095 of at least a 50 nucleotide length match with at least 65 6900 7OOOO 188SO percent identity, since these filters eliminated all but the 7000 71OOO actual genomic location of the probe. One of skill in the art 7100 72OOO 61 could appreciate that these criteria are of sufficiently low 7200 73000 2 7300 74OOO 2O364 stringency so as to identify even the weakest members of a 7400 75000 19815 potential cross hybridizing repetitive sequence. 40 7500 76OOO 7600 77OOO 3 TABLE 2 7700 78OOO 7800 79000 Results of ab initio repeat detection for HIRA 7900 8OOOO 23902 gene region from U.S. Pat. No. 6,828.097 8OOO 81OOO 7712 45 8100 82OOO Begin coordinate 82OO 83OOO 5 SEQID No. 1 End coordinate Number hits genome 8300 84OOO 8400 8SOOO 23677 1OOO 7535 8SOO 86OOO 23474 100 2OOO 2O 8600 87OOO 228O 200 3OOO 1 50 87OO 88OOO 21328 300 4000 S1045 88OO 89OOO 21216 500 6OOO 27.018 89.00 90OOO 21128 600 7000 901 9000 91OOO 22.559 700 8OOO 6853 91OO 92OOO 44018 800 9000 5504 93OO 94OOO 270 900 1OOOO 8337 55 9500 96.OOO 1 1OOO 11 OOO 17347 96.OO 97OOO 22715 1100 12OOO 20284 97OO 98OOO 129 1200 13OOO 21380 98OO 99000 154 1300 14OOO 14891 9900 1OOOOO 21398 1400 1SOOO 3O794 1OOOO 101 OOO 1 1800 19000 23772 101 OO 102OOO 1900 2OOOO 23741 60 2OOO 21 OOO 19360 2100 22OOO 5 2200 23OOO 1 Consider, for example, the first single copy interval iden 2300 24OOO 1 2400 2SOOO 1 tified with the present invention from positions 2001 to 2SOO 26OOO 1742O 65 3000. The method of the 097 patent shows that the interval 2600 27OOO 1 between positions 1062 and 2913 are free of repetitive sequences. The following demonstrates that the method of the US 8,407,013 B2 25 26 present invention confirms this result and independently can overlapping windows within this region with the genome identify a single copy intervals delimited by similar coordi reference sequence. This is a computationally efficient nates. approach for delineating repetitive sequence boundaries The present invention shows that there are sequences with (Vincens et al. Bioinformatics 2002: 18:446-451). The 1 kb multilocus representation within the flanking Subsegments. 5 Subsequences analyzed in the previous step were used to Within the subsequence defined by the coordinates 1000 produce a series of Subsets, each sequence 200 nucleotides in 2000 there is a match to at least 20 other genomic segments length, and each beginning 20 nucleotides downstream of the and within the sequence defined by 3000-4000 matches at previous sequence (adjacent members contain 160 nucle least 51,045 other genomic sequences. The latter interval otides in common). The minimum length pairwise match was contains numerous highly conserved SINE and LINE repeti 10 70 nucleotides and paralogous sequences were required to be tive elements. The short region containing a small portion of at least 65% identical. Each of these sequences was compared a MER58B repeat (2914-3000) contained within the corre with that of the reference genome in Table 4. The first two sponding single copy interval of the present invention is a intervals (positions 1001-1200 and 1021-1220) contain one highly divergent ember (24.8% of the sequence differs from a or more members of one or more repetitive sequence families, consensus MER58B subfamily repeat) that only includes a 15 because these Subsegments detect significant length matches small portion of the total repeat element (from positions 239 to (at least) 50 and at least 118 different genomic locations, to 340). Hence for all practical purposes, the 86 nucleotide respectively. By shifting the centromeric end of the subse region that is considered to be repetitive will not cross hybrid quence a further 20 nucleotides in the telomeric direction, the ize with other MER58B repeats in the genome, if the hybrid interval defined by positions 1041-1240 of the sequence ization conditions of the probe designed using the instant matches a single genomic location with 100% identity technology are set to be stringent (final hybridization wash (Query=1041 1240 HIRAcg; Min length of match-70; should be 0.1 xSSC, at least 42° C.). Similarly, positions Min percent identity=65; Number of total hits=3; Number of 22001-28000 are found to occur once in the haploid reference qualified hits=1; Hit=refNC 000022.7INC 000022, genome sequence using the method of the present invention. Length=200. Percent id=100, Start hit=17692626, To precisely define the boundaries of the single copy 25 End hit=17692825). This indicates that the single copy inter domain in this region, we then rerun the analysis of the Val is expected to begin approximately at this position and this subsegment defined by coordinates 1000 to 4000 of the initial finding is confirmed based on the method of the 097 patent 103 kb HIRA sequence at much higher resolution. This is (Table 1; see below). The degree of error in specifying the carried out either by comparing shorter consecutive Subseg precise coordinate of the single copy interval is dictated by ments or overlapping Subsegments from this region of the 30 the amount of nucleotide displacement of each window, HIRA gene. The following table indicates a comparison of which in this case, is 20 nucleotides. It will be evident to those consecutive Subsegments of 200 nucleotides with the genome of the art that the coordinates of the 3' or telomeric boundary reference sequence. The criteria for detecting a repeat was of this single copy interval can be refined using precisely the that the minimum length match is at least 60 nucleotides and same procedure as was used to define the 5' or centromeric at least 65% of the nucleotides matched. 35 end of this interval at 200 nucleotide resolution.

TABLE 3 TABLE 4 Hits in consecutive subsegments in coordinates 1000-4000 Detailed refinement of 5" centromeric boundary 40 of a single copy interval in the HIRA gene Number Begin End hits genome Number hits Begin End in genome 1001 1200 50 12O1 1400 1 OO 2OO 50 14O1 1600 1 O2 220 118 45 16O1 1800 1 O4 240 1801 2OOO 1 O6 260 2001 2200 1 O8 28O 22O1 2400 1 10 3OO 24O1 2600 1 12 32O 26O1 2800 1 14 340 28O1 3OOO 456 50 16 360 3OO1 3200 6 18 380 32O1 3400 136 2O 400 36O1 3800 1059 22 420 24 440 26 460 This analysis indicates that the interval from 1201 through 55 28 480 30 500 2800 (a length of 1599 nucleotides) was composed of a single 32 520 copy sequence (because each of the Subsegments in this inter 34 S4O Val were found to be present once perhaploid genome). The 36 S60 38 S8O centromeric and telomeric boundaries of the single copy 40 600 interval breaks were within the 1001-1200 and 2801-3000 60 42 62O nucleotide intervals. These results are consistent with the 44 640 initial analysis of the density of repetitive sequences indicat 46 660 ing that positions 1000-2000 and 3000-4000 were partially 48 68O 50 700 repetitive. 52 720 As an example, we illustrate how the boundary of the 65 S4 740 repetitive sequence within coordinates 1001-2000 can be 56 760 even more precisely defined by comparing the sequences of US 8,407,013 B2 27 28 TABLE 4-continued TABLE 7-continued Detailed refinement of 5" centromeric boundary Detailed refinement of the 3' telomeric boundary of a single copy interval in the HIRA gene of the single copy interval in the HIRA gene using overlapping windows (same interval as that analyzed in Table 4 Number hits 5 Begin End in genome Begin End Number hits 1581 1780 1 3O8 318O 2 16O1 1800 1 310 3200 1 1621 1820 1 312 3220 1 1661 1860 1 10 314 3240 1 1681 1880 1 316 3260 1 1721 1920 1 3.18 328O 2 1761 1960 1 320 3300 6 1781 1980 1 322 332O 11 324 3340 67 15 326 3360 63 328 3380 2O TABLE 5 330 3400 39 332 342O 36 Analysis of Intermediate Subsequence (minimum 334 3440 150 336 3460 610 50 nucleotides, 65% identity 338 3480 1936 Begin End Number hits 340 3500 2081 342 3520 2987 2001 2100 1 344 3S4O 3626 2101 2200 1 346 3560 330 22O1 2300 1 348 358O 3479 2301 24OO 1 350 3600 529 24O1 2SOO 1 25 352 362O 3473 2SO1 2600 1 356 3660 819 26O1 2700 1 358 368O 1406 2701 2800 1 360 3700 2044 28O1 2900 1 360 3700 2351 2901 3OOO 1 364 3740 1281 30 366 3760 1610 370 3800 22 372 382O 57 This moderate resolution (i.e. 100 nts) subsequence analy 374 3840 140 sis at low stringency of the interval containing positions 2001 376 3860 19 3000 confirms that the entire region is composed of single 378 388O 8 copy sequence. We then proceed to analyze the next 1 kb 35 380 3900 157 380 3900 163 Subsequence at moderate (Table 6), and then finally at high 382 392O 709 (Table 7) resolution. 388 398O 19 TABLE 6 40 The results of the detailed analysis of the subsequence Definition of telomeric breakpoint at moderate resolution covered by the positions 3001-4000 subsequence indicate that the end of the first repetitive sequence can be found Begin End Number hits between positions 3100 and 3120 (positions 3021-3120 was 3OO1 31 OO 1 present in 2 copies, whereas 3001-3100 is found only once 31 O1 3200 1 32O1 3300 1 45 per genome). Comparing with the results obtained in Table 1, 3301 3400 1 we find that the telomeric boundary determined with the 34O1 3500 2081 instant invention overlap highly divergent members of the 35O1 36OO 529 MER58B and L1M4 subfamilies. The element contained in 36O1 3700 1 3701 3800 1 the HIRA derived subsequence respectively 24.5% and 3801 3900 163 50 22.8% (with 13.2% insertion/deletion) different from proto 3901 4OOO 1 typic members of these families. Because of the level of divergence from the consensus elements in the genome, and The results shown in Table 5 suggest that the telomeric the limited length of the match to these elements (101 and 47 boundary of the single copy sequence interval resides nucleotides, respectively), probes containing these sequences between coordinates 3400 and 3500. 55 should not cross hybridize with other genomic locations. In this example, we have shown that the instant invention TABLE 7 enables the definition of a particular single copy interval spanning coordinates 1041 through 3100 within the 103 kb Detailed refinement of the 3' telomeric boundary HIRA complete genomic sequence. A probe prepared from of the single copy interval in the HIRA gene using overlapping 60 this interval would be of adequate length and suitable for use windows (Same interval as that analyzed in Table 4 as a genomic probe (for FISH, microsphere, microarray, Begin End Number hits MAPH, or Southern hybridization) using the method described in U.S. Pat. No. 6,828,097. 3OO1 31 OO 1 3021 312O 2 Although the non-homologous genomic location is still a 3O41 3140 7 65 very divergent copy, it nevertheless meets our minimum cri 3061 3160 4 teria for a repetitive sequence (65 nucleotides in length, and at leasta 70% identity). Such a stringent criterion is necessary in US 8,407,013 B2 29 30 order to eliminate the possibility of spurious cross hybridiza search interval are dependent on characteristics of the specific tion with divergent repetitive sequences in the genome. This repeat sequences that are being detected. There are many potential sequence similarly may not pose a problem of cross eukaryotic species with genomes with families of repetitive hybridization in actual laboratory experiments, however due sequences that are highly heterogeneous and contain short to the cost and labor associated with carrying out those 5 repetitive elements (e.g., SINE elements in the canine experiments, it is recommended that this sequence not be genome, which are often polymorphic in terms of their pres included in the probe. The match to the non-homologous ence or absence in different animals). The alternative strategy sequence is indicated below: ofusing precise word matching methods to identify repetitive

>ref NC 000017.8 NC 000017 Homo sapiens chromosome 17, complete sequence Length = 81, 860, 266 Plus Strand HSPs: Score = 189 (34.4 bits), Expect = 0.54, P = 0.42 Identities = 63/87 (72%), Positives = 63/87 (72%), Strand = Plus/Plus Query: 12 CTAACTAAAATAATTG-AGTAAAACTCATAGGTCAAAGGGGAATTCTAATTAAGTGAAAT 70 (SEQ ID NO: 4) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sbct: 1901.1641 CTAAATAACATACTTTTAG-ATAACCCATAGGTCAAAGAAGAAGTC-AA--AAGTGAAAT 1901.1696 (SEO ID NO. 5) Query: 71 TAAAAATGACTTGCAAGAGAATGGTAA 97 (SEQ ID NO : 6) | | | | | | | | | | | | | | | || Sbct: 1901.1697 TAAAAAGTATTTAGAACCAAATGAAAA 1901 1723 (SEO ID NO: 7) Score = 171 (31.7 bits), Expect = 3.5, P = 0.97 Identities = 63/87 (72%), Positives = 63/87 (72%), Strand = Plus/Plus Query: 13 TAACTAAAATAATTGAGTAAAACTCATAGGTCAAAGGGGAATTCTAATTAAGTGAAATTA 72 (SEQ ID NO: 8) | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sbct: 1294.1025 TAAGTAATATAAGTAAATAAT-C-CATAGGTCAAAGAGGAAAT-T--TTATGGGAAATTA 12941079 (SEO ID NO: 9) Query: 73 AAAA--TGACTTGCAAGAGAATGGTAA 97 (SEQ ID NO: 10) | | | | | | | | | | | | | | | | || Sbct: 1294.1080 AAAACATGTTTTG-AACTGAATGAAAA 1294.1105 (SEQ ID NO: 11)

Note that there are limitations to this precision of the break 30 sequences are themselves insensitive to weak homologies points that can be defined by this method. In order to detect between related family members and that lack of sensitivity is repetitive sequence elements that are highly degenerate, it is only amplified when the sequence being search is particularly not appropriate to continue to reduce the length of the search short. sequence to extremely short segments because the algorithms 35 Based on the results in Table 1, the boundaries of cataloged used to detect repetitive sequences are sensitive to the lengths repetitive sequence family members flanking this interval at and composition of divergent genomic copies of Such the centromeric and telomeric ends occur at positions 1061 sequences. Repetitive sequences in the human genome often and at 2913, which are completely consistent with the find differ significantly both in homology and length from one ings indicated in Tables 3 and 4. The minimum length of this another and consensus sequences derived from these repeat 40 single copy interval, i.e., 1599 nucleotides, would be quite families, and this degree of sequence divergence challenges useful for probe production for a variety of applications the sensitivity of most algorithms to detect repetitive including fluorescence in situ hybridization, microarray sequence. Sequence comparisons between short test hybridization, Southern analysis, and microsphere Suspen sequences and the genome using most of the common align sion array hybridization. ment methods can fail to detect shorter intervals (e.g., 50-75 45 This same procedure was then repeated for each 1000 by nucleotides) containing members of repeat sequence families Subsegment that was found to be present in single copy in the that are divergent from the majority of family members and initial screen that determined the overall density of repetitive thus the performance of the instant invention can be compro sequences across the HIRA gene region. These presumed mised by comparison of short Subsets of sequences. The single copy Subsegments and the immediately flanking Sub degree of similarity between a test sequence and other related 50 segments which contain repeat sequences are again selected sequences in the genome can vary widely across the length of for more detailed delineation of the boundaries of the single the test sequence. Particular Subintervals with low percentage copy intervals. These regions would include intervals defined identities can falsely indicate that a sequence is present once by positions 21001-26000, 25001-29000, 28001-33000, per genome, even though the overall Subsequence (which 44001-50000, 55001-62000, 64001-67000, 69001-72000, contains this interval) is actually present multiple times in the 55 74001-77000, 76001-80000, 80001-83000, 82001-85000, genome. 93.001-97000, and 100001-102000 (intervals derived from To demonstrate this phenomenon, we attempted to divide Table 2). the 1000 nucleotide subsegments from HIRA into consecu Upon identification of the single copy intervals with the tive, non-overlapping sequences as short as 50 nucleotides present technology, DNA products derived from these inter and search these sequences with the human genome. Most of 60 vals are then amplified, extracted or purified from genomic these 50 nucleotide sequence were found by both BLAST and DNA or from recombinant DNA clones known to contain BLAT only one in the human genome reference, despite these sequences. The derivation of Such products and their evidence showing that these sequences were Subsets of hybridization to other nucleic acids (from patients with chro known repetitive family methods. Thus, it might not be obvi mosome abnormalities, for example) by either Southern ous to one of ordinary skill in the art that short contiguous 65 analysis, fluorescence in situ hybridization, attachment to sequences cannot be used to search the genome with high microsphere suspensions, microarrays or other Solid phase efficiency, since recognition of limitations on the length of the surfaces are entirely conventional and well known by those of US 8,407,013 B2 31 32 skill in the art. Examples and procedures for synthesis of Such (with 80% identity) present four times per genome beginning probes that have been developed from computationally in the interval defined by 47651-47661. This boundary is 58 defined sequences of single copy intervals and hybridization nucleotides upstream of the boundary disclosed in U.S. Pat. applications of the instant invention have been carried out by No. 6,828,097. the inventor in the 097 patent. Previously determined single copy interval boundaries in U.S. Pat. No. 6,828,097: 76829-79310 Example 2 Centromeric boundary: Intermediate resolution analysis (1st) delineates single copy boundary between positions HIRA Gene 76801-76850. Fine resolution (2nd) analysis of nucleotides The same approximate 103 kilobase pair length interval 10 comprising the 100,836 by HIRA gene and flanking 76701-76900 indicates that the boundary of a repetitive sequences (SEQ ID NO: 1) was extracted from Genbank sequence occurs between 76880 and 76900. accession NT 001039. Position 1 of this interval corre In other words, the ab initio detects a low copy divergent sponds to position 798,334 of NT 001039. This approxi repeat (30% of the nucleotides are discordant) within the mate 103 kb interval was analyzed using the method of the 15 interval between positions 76829 and 76880 that is not found instant invention. The following indicates a comparison of by the method of the U.S. Pat. No. 6,828,097. While this results obtained for design of single copy probes using the indicates that in Some instances, the ab initio method may be method of U.S. Pat. No. 6,828,097 versus the ab initio method more sensitive for detecting single copy intervals than the of the instant invention. The coordinates provided correspond previous approach, one of skill in the art would recognize that to the 103 kb interval from which probes were previously divergent repetitive sequences with this level of sequence derived. divergence do not usually produce cross-hybridization to Unless otherwise noted, initially the sequence region to be other genomic locations under typical laboratory hybridiza tested for repetitive and single copy sequences was separated tion conditions. into consecutive 1000 by intervals, each of which were tested Telomeric boundary: Intermediate resolution (1st) analysis for similarity for other sequences in the genome using WU 25 (using a threshold of detecting repetitive sequences of 65% BLAST as described in Example 1. These were divided into nucleotide identity) indicates boundary between positions 100 nucleotide (nt) intervals usually overlapping one another 794.00 and 79450. Fine resolution analysis (2nd) narrows this by 10-50 nucleotides and each tested for repeats by determin interval to between 794.00 and 794.10, which is 90 nucleotides ing the number of genomic copies of each 100 nt Subsequence from the boundary detected using the method of the 097 with matches >70 nts in length and >=70% identity. 30 patent. The ab initio approach fails to detect a portion of an Previously determined single copy interval boundaries in extremely divergent MER3 repeat element which begins at U.S. Pat. No. 6,828,097: positions 55445-60803 position 79310 and ends at 795.01 (which is found using the The initiallow (1 kb) resolution survey of the 103 kb region method of the 097 patent). This element differs by 33% from defined a single copy domain by positions 56,001-60,000 is the consensus MER3 sequence and contains insertions and present in single copy in the genome. The repetitive 35 deletions comprising 13% of that sequence. Because of the sequences adjacent to this interval were identified as follows: weak similarity to other related elements, divergent repetitive Centromeric boundary: 1 iteration localized to positions sequences of this type would not cross-hybridize to other 55001-56000; 2' iteration to 55393-55484 (because 55442 genomic locations under typical laboratory conditions. 55541 is single copy and 55393-55492 is present in 1086 Therefore single copy probes containing Such sequences copies perhaploid genome); 3' iteration to 55.424-55.434. 40 would still hybridize to a single location in the human genome This single copy interval boundary is within 11 nucleotides of under moderately stringent post-hybridization wash condi the boundary determined with the method of U.S. Pat. No. tions. 6,828,097. Previously determined single copy interval boundaries in Telomeric boundary: Boundary iteratively defined with U.S. Pat. No. 6,828,097: positions 21423-25270 increasingly narrower intervals. Intermediate resolution (1): 45 Centromeric boundary: At intermediate resolution, the ab positions 60.001-61,000; Higher resolution analysis (2"): initio method finds the boundary between a centromeric we find that the interval from 60.687 to 60,786 is unique in the repeat and the adjacent single copy sequence within the inter genome (1 copy) and the interval from 60,786-60.884 is Val defined by positions 21 101 through 21149. At high reso repetitive (33 copies); Highest resolution (3): positions lution, the boundary is more precisely delineated between 60,767-60,777. This single copy boundary is within 26 nucle 50 positions 21 119 and 21 139 using the default conditions for otides of the boundary determined by the method of U.S. Pat. repeat detection. However, using a lower threshold of detect No. 6,828,097. ing repetitive sequences of 65% nucleotide identity, a weak, Previously determined single copy interval boundaries in highly divergent repetitive sequence (with 67% identity to U.S. Pat. No. 6,828,097: positions 44937-48722 one other element in the genome) is detected within positions Centromeric boundary: 1: Intermediate resolution analy 55 21301-21399. Under typical hybridization conditions, this sis shows that the 5' most repeat ends between positions unlinked repetitive element would not cross-hybridize with a 44991 and 45000. 2"Fine resolution analysis shows that the probe derived from this genomic interval. Application of the boundary is between 44911 and 44921. The interval down method used in the 097 patent indicates that the repetitive stream of 44937 (boundary within an AluJo repeat defined by sequence at the single copy boundary is an L2 element which method of U.S. Pat. No. 6,828,097) is single copy. The ab 60 ends at 21151. The single copy boundary found by the ab initio boundary is within 16 nucleotides of the 097 boundary. initio method is thus 12 nucleotides from the boundary dem Telomeric boundary: An L2 repetitive element was shown onstrated in the 097 patent. to begin at 47718, the boundary of the single copy interval Telomeric boundary: At intermediate resolution (1st), the defined by the 097 patent. With the instant invention: the boundary found with the ab initio method between single intermediate resolution (1st) analysis shows that a repeat 65 copy and repetitive sequences falls between 25199 and begins in the interval defined by positions 47601-47700. Fine 25297. The high resolution (2nd), this boundary occurs (2nd) resolution analysis shows that a repetitive sequence within the interval delineated between positions 25280 and US 8,407,013 B2 33 34 25300, which is 10 nucleotides away from the interval bound copy number analysis of these sequences. Using the method ary determined in the 097 patent (position 25270). of the instant invention, we first localized the centromeric CDC2L1 Gene boundary at intermediate resolution between positions 68051 The previously determined boundaries of single copy inter and 68101. This interval was then refined to between posi val based on the method of the 097 patent used to develop tions 68051 and 68061, which is within 20 nucleotides of the probes are positions 8145-17744 of GenBank accession previously determined centromeric single copy repetitive AL03182 (SEQID NO:3). sequence boundary (in the 097 patent). The telomeric bound Ab initio analysis of consecutive 1 kb intervals in AL03182 ary was first determined to occur between 75949 and 75.999 (SEQ ID NO:3) shows that positions 9001-17000 are single and subsequently refined to the interval between positions copy in the human genome. The sequences adjacent to this 10 75971 and 75981 using the ab initio method, which is within interval each contain repetitive sequences. Sequences from 23 nucleotides of the previously determined boundary using positions 8001-9000 are present in 117 copies per genome the method of the 097 patent. and sequences from 17001-18000 are present in 1672 copies. Previously determined single copy interval boundaries in To more precisely define the boundaries of the repetitive U.S. Pat. No. 6,828,097: positions 76241-78441 sequences centromeric and telomeric to the single copy inter 15 The second interval in the NECDIN gene region (corre val, each of the flanking regions were further analyzed by sponding to sequences for PCR amplification (SEQID NOS: comparing overlapping genomic intervals with increasingly 441 and 442 of the 097 patent)) has a centromeric bound at shorter displacement. position 76249 and a telomeric bound at 79221 of the same Centromeric boundary: The 1st analysis localized this Genbank accession number. Applying the ab initio method boundary to positions 8151-8200; the 2nd analysis to 8170 iteratively as shown in the previous examples, these intervals 8180. The minimum distance between the boundary of the were found to occur between positions 76241-76251 at the single copy interval determined with the ab initio method and centrometric end and between 78431-78441 at the telomeric the boundary determined by 097 patent is 25 nucleotides. end. Thus, the 10 nucleotide window containing the centro Telomeric boundary: The 1st analysis localized this bound meric bound of the repetitive sequence defined by the ab initio ary to positions 17651-17749; the 2nd analysis to positions 25 method contains the boundary determined using the method 17662-17672. The minimum distance between the boundary of 097 patent, i.e., they are essentially coincident. The ab of the single copy interval determined with the ab initio initio method locates a highly divergent repetitive sequence method and the boundary determined by 097 patent is 72 (70% sequence identity) that was not detected using the nucleotides. method of the 097 patent, which accounts for the 800 nucle This 9.5 kilobase interval was divided into two overlapping 30 otide difference between the respective boundary coordi single copy intervals in order to develop probes that could be nates. This divergent repeat would not cause cross-hybridiza easily amplified for hybridization. As in the 097 patent, the tion under the laboratory conditions used for probe interval sequences were used as templates for essentially hybridization. In any case, the interval defined by the ab initio conventional PCR primer selection methods, as described in method is more conservative than the one found using the the 097 patent. The resulting probes from these two intervals 35 method of the 097 patent. Using typical laboratory chromo Substantially overlapped the sequences comprising the somal hybridization conditions (described in the 097 patent), probes of the 097 patent and when labeled by nick transla one of skill in the art will understand that probes derived from tion, produce an identical genomic hybridization patterns this interval will produce hybridization to a single genomic previously obtained with FISH. Differences between results location. produced by the current invention and the 097 patent only 40 Previously determined single copy interval boundaries in occur for short probes (-100 nt) whose sequences fall at or U.S. Pat. No. 6,828,097: positions 94498-99.152 close to the deduced boundary between the single copy and The third interval in the NECDIN gene region (region repetitive sequences (for example, for single copy probes of (corresponding to sequences for PCR amplification (SEQID 100 nt typically used In microsphere hybridization assays). NOS: 439 and 440 of the 097 patent)) has a centromeric Probe design should avoid using probes comprised of 45 bound at position 94498 and a telomeric bound at 99.152 of deduced single copy sequences that are located close to the the Genbank AC006596. Applying the ab initio methoditera position of the single copy-repetitive sequence transition. tively as shown in the previous examples, these intervals were NDN (NECDIN) Gene found to occur between positions 94.661-94671 at the centro Three single copy probe intervals were derived from Gen metric end and between 97.691-97701 at the telomeric end. bank accession number: AC006596 (SEQID NO: 2) from the 50 The probe interval obtained using the ab initio method is NECDIN gene on chromosome 15. more conservatively determined than the single copy interval Previously determined single copy interval boundaries in defined by the method of the 097 patent, suggesting that the U.S. Pat. No. 6,828,097: positions 68031-75948 ab initio method identifies unrecognized repetitive sequences For the first interval in the NECDIN gene region, the pre not detected with the 097 method. Indeed the instant inven viously determined single copy interval boundaries (given in 55 tion detects a previously unrecognized highly divergent U.S. Pat. No. 6,828,097; amplified by PCR primers corre repetitive sequence which is present 23 times in the genome sponding to SEQID NOS: 437 and 438 of the '097 patent) are and shows an average 71% identity with the interval 97651 bounded on the centromeric end by position 68031 and at the 97750 in the Necdin gene region. This divergent repeat would telomeric end at position 75948 of AC006596 (SEQID NO: not cause cross-hybridization under the laboratory conditions 2). Sequences between these coordinates are considered 60 used for probe hybridization. Using typical laboratory chro single copy and are not similar to known families of repetitive mosomal hybridization conditions (described in the 097 Sequences. patent), one of skill in the art will understand that probes At 1 kilobase pair resolution, sequences between 69001 derived from this interval will produce hybridization to a and 75000 were found to be present at only this location on single genomic location. At the telomeric end of this interval, chromosome 15 as a single copy sequence in the genome. The 65 the ab initio method detects several contiguous simple repeti adjacent intervals consisting of positions 68001-69000 and tive sequence composed of imperfect runs of polynucleotides 75001-76000 contained repetitive sequences based on initial (GO or polydinucleotides (ITG). These are detected as well US 8,407,013 B2 35 36 by the methods of the 097 patent; however because these -continued sequences are relatively short interrupted runs of imperfect my Slen = $seqobj->length; homopolymers, they will not cause cross-hybridization under #print “length, Slen; the laboratory conditions used for probe hybridization and while(Slen > Send ) { can therefore incorporated in most probes developed using # print seen sequence seqob->display id(), start of Seq, i Substr(Sseqobj->seq,1,10),\n': the 097 invention. Nevertheless, the ab initio method does if(Sseqobj->alphabeteq 'dna): recognize even these short, divergent sequences as repetitive SSubseqin = $seqobj-> Subseq(Sbegin,Send): Sequences. Sid = $seqobj->display id(); As demonstrated above, the ab initio method of probe Sidsub = Sbegin. “ . Send. “ . Sid: 10 Snameseg = Sbegin. “ . Send; design can recapitulate in most cases the single copy probe open (OUT, “Snameseg); intervals deduced using the method of the 097 patent. In print OUT"> Sidsub, “\n, Ssubseqin, “\n: those instances where the two methods differ, in nearly all # print">''.Sidsub, \n, Ssubseqin, “\n': cases, the ab initio approach is more sensitive detecting even # insert system call for qsub of wublast job here # job runs the wubl script and then a perl program that has blast weaker similarities (of less than 70% identity) to known parser for each blast run. Results are repetitive elements in the genome than that found with the 15 # appended to a table prior method. The ab initio method may in Some cases pro Sfpresults = “-fDocuments. Snameseg. results: duce purer single copy sequence compositions than the meth system ("qsub-cwd -o Sfpresults -e idevinull ~/Documents.?wubl ~/Documents/Snameseg Sminlen Sminperc'); ods of the 097 patent. In the laboratory however, these weak # for example: qsub-o-fDocuments/test wubl sequences similarities are not relevant, since under even mod - Documents, 101 200 erate stringency post-hybridization wash conditions, any close (OUT): duplexes formed with such sequences will be disrupted and Sbegin = Sbegin + Sincr; eliminated, thus preventing cross hybridization between Send = Send + Sincr; these highly divergent repeats at other genomic locations and the designed probes. # Sseqout->write seq.(SSubseqin) All references cited above are hereby incorporated herein 25 Sdate=System (“date'); by reference. print Sdate: Appendix A The following script is an example of the ABINITIO.PL Appendix B Script. The following script is an example of the WUBL script. 30

# gets Subsequences of defined length and increment from input then # sequence P Rogan 2005 echo form: wubl sequence file min length match i min percent match use Bio:SeqIO: 35 echo "sequence name (fasta format): 'S1 use Bio:SeqIO::fasta; echo “Minimum length of match to repeat: 'S2 use Bio::PrimarySeqI: echo “Minimum percent match to repeat: 'S3 use Bio:SeqFeature::Generic; blastin -d “human-span2-i S1-cpus 2-lcmask-hspmax 100 # command line arguments: -warnings -errors -o S1 results # (1) Name of genomic sequence blaspars.pl S1 results $2S3 > blastparse # (2) Length of Subsequence # (3) Length of window increment 40 # (4) Minimum Length of Match to repeats # (5) Minimum Percentage Match to repeat Appendix C system(“date'); The following script is an example of the BLAST system(“pwd'); PARSE.PL script. # get name of sequence SARGV = shift (a) ARGV: 45 chomp SARGV: use Bio::SearchIO: print “processing SARGV.Vn'. use Bio:Tools::BPlite: # this program is called within the wubl script print Params: (1) Name of genomic seq., (2) Length of i Subsequence, (3) Length of increment, \ln(4) Min length of match and (5) 50 # command line parameters : name of blast result file, min length Min percent match to repeats\n': of match, min percent identity Sminlen = 100: Sminperc = 70; -format => Fasta): SARGV = shift (a) ARGV: #initialization of Subsequence extraction chomp SARGV: Sbegin = 1; 55 Sminlen = shift (a)ARGV: Send = shift (a)ARGV: chomp Sminlen; chomp Send; Sminperc = shift (GDARGV: if(Send-2) die “subsequence too short: chomp Sminperc; Sincr = shift (GDARGV: my Sin = new Bio::SearchIO(-format => blast, chomp Sincr; -file => SARGV): if (Sincr< 1) {die “beginning and ending nucleotides of print SARGV “\n: Subsequence are identical: 60 while(my Sresult = Sin->next result) { Sminlen = shift (a)ARGV; chomp Sminlen; print “\nOuery = , Sresult->query name, “\n': Sminperc = shift (GDARGV: chomp Sminperc; print “Min length of match = , Sminlen, “ Min percent identity = # Sseqout = Bio:SeqIO->new(-format=>Fasta, -file=>> ”, Sminperc; # output.fa); print “Number of hits = , Sresult->num hits, “\n': # print SARGV.“, Send, “”, Sincr: while(my Shit = Sresult->next hit) { 65 while(my Shsp = Shit->next hsp) { #length of full sequence if Shsp->length (total) > Sminlen ) {

US 8,407,013 B2 45 46 - Continued cctgcacact aaa.catat ct acaaacaagg act ct cacag agt citacgcc attcc cc togg 714. O Caccaccacc acagdaggtg Ctggt atcca cagctgggag atctgaagat ggat cacat c f2OO accgggttct ttgcagacgt t ccc.ca.gcat gggcc.ca.gag cctgg tagcc ccactgggtg 726 O gctagaccca gaagggcaat aataatcacc gcagt ctdgc ticataggaat citc catc cct 732O aggggaaggg gaagtgcacc aaatcaaggg at Caccctgt gggacaaaat aatct caa.ca 7380 gcagoctotg agttccagat titt to cactgaac tagt cta cccaaatgag aagtaat cag 744. O aaaagtaatt ctdgcaataa togacaaaa.ca aggttctata atacctic caa aagaccacac 75OO tagctic ct ca gcaatggat.c caaac caaga ataattacaa agtacattitt cattatttgg 756 O cittaaaaagg cagaacttitt toggctttitt c tttitt cittitt ttttittgaga cagggit citcg 762O

Ctctgttgtc. Caggctggag tec agtggcg tdatct ctga attgccaaag aattic agaag 768 O gctgattatt aagctactica aggagatacc aaaggtgaaa atcaact tca agaaattitta 774. O aaaaat at at aggatatgga tigaaaaatgc ticcagagaaa ticggitat cat aaagaaaaaa 78OO tcaaaaaatc aaaaatcaaa acttctggaa ataaaagaca cacttagaga aatacaaaat 786 O gCactagaaa gtttcaacaa tagaatcaaa gaagtagaag agaga actt C agaattcaaa 7920 gacaag actt tdaatcagac aaaaacaaag aaaaaataat tttittaaaaa aatgaacaaa 798 O gcct coaaga aatttgggat tatgttaaat ggccaaacct aagagtaaga ataaatggtg 804 O titcc taagaa gagaaatcta aaagttctgaa aaacg tattt gtggggatag titgaggaaag 81OO citt.ccctgac cittgctagag atctagacat coaaatacaa gaagctcaaa gaacacctgg 816 O gaaatttatc acaaaaagat catcacccag gtacacagtc atcaggittat ctaaagt caa 822 O gacalaaggaa agaatcttaa gagctgtaag gcaaaag cat Caggit aacct atacacgaaa 828O gcct at C9ga tttitttittitt tagacagag tdttgctttgtcatcCaggc tiggagtgcag 834 O tggtgcaatc ttggct cact gcaat citctg ccd.ccctggit toacgcaatt citcct gcctic 84 OO agcctic cc aa gtagctggga ctacaggc cc ctgccaccag gcctggataa tttttgt att 846 O tittatt agag gtggggtttc accgtgttgg C caggctggit Cttgaact cc taccttaaa 852O tgat coaccc accttggcct C cctaagtgt togattaca caatgagcc actgcgc.ctg 858 O gccagaatac citat cagatt aac agcagat ttct cagcag at accctaca agc.ca.gaagg 864 O gtttgggttc ctatttittag citt cotcaaa caaactaact gcc agccaag aatttagtat 87OO c cagcaaaat taagtgtcat atatgaagga gqcataaagt ctittitt caga caaatgctga 876O gaga atttgc caccaccalag cca.gcactac aagaaatgct aaaaggagtt ctaaatcttg 882O aaacaaaacc titgaaataca ccaaaataga acttic cittaa agcataaaac toacagggit c 888 O tataaaacaa taacaaaatgaaaaaaaaaa aaccaacaaa aaaagaaggt attcagg taa 894 O aaacaag.cat gigtaaataaa acagtacctic acatctogat act aa cattgaatgtaaata 9 OOO gtctaaatgc ticcacttaaa agatacagaa tdgcagaatg gatacaaatc. caccalaccala 906 O at atctgcta acacatatgg act cacataa gttgagggta aaggggtgaa aaaagatatt 912 O c catgcaaat acaaac caaa agcgagcaga aatagotatt cittatat cag acaaaacaga 918O ctittaaag.ca acaatagttgaaaaagacaa aaaggga cat tacataatga taaaaggat.c 924 O agtic caacag gaaaat atca caatcctaaa tatatatgca cctagoacgg gagct cocaia 93 OO atttataaaa caattagtact caacgtaag aaatgagata cacagcaa.ca cagta acagc 936 O ggggacttica acactalgaca ggt catcaag acagaaaagc aacaaagaala caatggactt 942O acactatacc ctagaacaaa toggacittaac acatatttac agaac attct acccaacaac 948O

US 8,407,013 B2 51 52 - Continued tggtggtgca CCCtgtaat CCC agctact cqggaggcag aggcaggagg attgattgag 434 O ccagaaggitt gaggccacag taggggaaa aaaaaaaaag agaga.gagag agagagt ctt 44 OO gctatottgc ticaggctggit citcgaatticc tacct caag tdatct tcc c acct cagctt 4 460 Cccaaagtgc tigggattaca ggtgtgagcc accacgc.ctg gctgaaaaaa Cacact atta 452O aacaaagtga gacaaatgaa aatgaaaata caa.catacca aaact tacag tatgcagtga 458 O aagctgat ct caaatcaata atcta acatt acaccittaag gaact agaaa aagaactata 464 O

Cctaaagcta gcagaagaaa ataataaaga taatgggaca agataaatgg aaaataatag 47 OO agataatcaa tdaaac caaa agttgattct ttgaaaagat gaacaaaatt gacaaactitt 476 O tagctagact acataataaa aagagaga.ca agatccaaat aatgaaaatc aaaaatgaaa 482O gcagggacat tacaac caat gccacaaaaa taaaaaagat tataaataag agaac agcat 488 O gaacaact at atgacaataa atctgataac ctacataaaa tdgaaacaac ttaccaagac 494. O tggct cataa agaaattaaa aatctggacg gat ct ctaat gagcaagaaa actgaat caa 5 OOO taaaacaaac cct ct cataa agaaaag.cct aggat catat agcttct ctd atgtatt cita 5060 c caaac actt agagaattaa caccaatcct c ct tccaaaa tagg taggaa cact tcc tat 512 O tt catt citat gaggacagca ttaccctgac aaa.gctagac aaagatact a caagaaaact 518O attagat caa tat cotttgt aaa.ca.gtgac ccaaaaatcc ticaacaaaat gcc agcaaac 524 O agaatticcaa agtacattaa aagaattata cac catgacc aagtgggatt tatt cottga 53 OO atgcaagaat gigtttalacat atgaaaacca at cactgtaa tacat cacat taatgaaata 536 O aaagaaaatt ttaaaatgac acgat catct taatgcagaa aaa.gcatctg agaaaatgca 542O acattctitt c ttgataaaag cact caacaa act aggaatg gaagaaaact atctogacat 548 O agtaaagaccataaataaaa agcc.cacago taa catcgta cittaatggta aaagactaaa 554. O agcttitt cct ttaatat cag gaacaagaga aggatgcct g c titccagcac taatatttaa 5 6.OO cgtagt atta agagtic ctag acagat caat taggcaagga gaagaaataa aaggcaiacca 566. O aattgggaaa aaagaagtaa aattatttct gttcacagat gacatgatct tatatatgga 572 O aaac cctaaa gatticagoga aaaac tacta taaacaaag.c aaaac attct gcc tigcc tdt 578 O. ggtact agga agaagctgca agaggactitg CCCtctggcc talaggcaat gtaaagagca 584 O gccaagtatt attgat attt cct caccott cqgct ct cag taaaggatgg tttitt coact 59 OO ctitt caggat gcgatgtata gct citttgta cagcc togcaa cacacaactt aat caccacc 596 O t citctggcca citgccacagg tottacagca gcagt cc cca accttitt cqg cacccaggac 6O20 tggitttittitt ttatggacca gtgggggaggggalagacggit ttcaggataa alactgttcca 608 O cct cagat catcaggcatta gattct cata aggagcacac aacctagat c tict catatgt 614 O gaagttcaca at agggitttgttgctic ctatg agaatttaat gttgctgctg actggtctgt 62OO ggcc.ca.gagg ttggggaccc ctgtc.ttaca Citgaagacca Cagcaaaggg aggct tcct a 626 O agaacagggc Ctggctgggg aggctggagc Cagaacaaag cccaggalacc talaaggtgt 632O ttgcttagtg ccc caacct t c togcttct catttitcct coc atgcacactgaac catgcaa. 638O aggat.ccttg aagttgaaag aaatctgaac Ctttggtgtc. cct gtggtgc actggcagct 644. O caaatcagag tatataaaga gct cotataa tatacataga gttcc tacaa accattgaga 65OO aaaacaaatg gcaacaagta tttcaataga tagttcaaaa aagggaalaca Caagcggctic 656. O ttaa.gcatgt gaaatgatgc tict Cotaiaca agc ctitcttg ggagggctgc tigagt cagca 662O tggctggttg gaagttccacc ctic calaccac agctatattt toccagttat ttgaaggat.c 668O

US 8,407,013 B2 73 74 - Continued gtgatctgcc cacct cqgcc ticc caaatta caggcatgag ccaccacgcc toggcc tigttt 40740 tgtttittatt cittgttittga gacagggit ct cactctgttg cccaggctgc agtggagtga 408 OO tittctgctta citgcaacctic togct tcctgc gtttalaggaa ttctgct gcc ticago: ct cot 4086 O gagtagctgg gactacaggc acgcaccacc atgcc cagot aattitttgta ttitt cagcag 4092 O agatggggitt to accatgat gaccalagctg gtc.tcaaact c ccgg actica agtgatccac 4O98O ctgcct cqgc ctic ccagaat gctaggattg act acaggca tdagccactg. caccaggcct 4104 O c cagtgatgg gtgttittaaa goggct cotcc tdgttitt cat tdagaat caa tacaagaaaa. 411 OO cago cacatc aagaaagtat ctdcatttitc tdtaaggcct ttgttaaaga tactggg act 41160 tgttittattt catttitcc to atataaggat coaac ccaac ctdagaattic agcaccaggc 41.22 O t cctagaa.gc. tcactatact cattctggct ggaaag.cagg aagct cagoc ccaagtgaat 4128O gct cact cac togccactittcaagtgagaaa citgaggatga caaatgtgat gigtggcc cct 41340 tattatacag atcttgctac agctg tattt atggctggac gigt cittgttga gaccctgtgg 414 OO acatagctgc tigaggaacca acccttgttgg caccagg acc caggatatac agt caagatt 41460 ctgtct coag aagccaatct aagggctato citcct ggctt ct cattcagg g tatgcact a 4152O catgagagat aagggccaca gaaagccaca aagcaaacat gag tdtcttic ct caataagc 41.58O aggactgtct gtggcacact gtag tittct c ticaggtgggg aatt cattitt atttattittg 4164 O ttcagaactit gct coatttic aaaatctgag gtttctt cac cittctggitaa goctottaac 417OO t ccct cittga caact acccc gcago acatc ctic ct ggggg gactic ct ct c tattgcc ctd 41760 tott cagotc. cacct cotca cct ctgtgcc actctggcca attttgtaga citgaatgttc. 41820 ttitt cagotg gttctgat cit gct citctgaa atgcttgttg acttittaaac togcaatgatt 4188 O attatttitt c tttitctaaaa gtatttitttt tdgttcattt ccaaatctgg tttittttittg 41940 at agtg tatt attcttittaa tttittattga gatatatata t cataaactt cqc cactitta 42 OOO aagtatagaa toattagctt ttagoatatt cataaggitta agcaaccatc at cactatot 42O6 O aatgcc agaa cattitt cata atticcaaaaa caaactctgt acccattggit act cact cot 4212 O tatic ct coct c cct cagacc ctdgcaacca citgttact ct actitt ctitta toggattittcc 4218O attctggaca ttt cotatica atggaatcat acaac citatg atcctttgttg actggcttct 4224 O ttct cittagc ataatgtc.tt taaggitttat cqttattgta gcacatgitaa goatt coatt 423 OO cct ctittatt gttgaataat attctattgt atggg taaac catattttgt ttatt catca 4236 O cctgatggac ttittgggcta titt coactitt ttggittatta totacagatg tdaac attca 4242 O tgitatgagtt attgcatgga tatatgttitt caattct cott ggg tatgtat c taggtgtgg 4248 O gaactgctag gtcagatgat aactictatat tttaccattt gaggaacttic cagactgttt 4254 O tccaaaggtg ctaaaagact ttacattcct accagoatgt atatgagggit tocctittaga 426 OO gctaatttitt gtttacagaa tagag tatta atc.tttic ctic acattttgtt ttagottt ca 42 66 O tgtc.cttcaa gacittaa.gca tactittatag togc catctga caattictatg at cacaggitt 4272O tataggacca acactt atta t ctdactato tdatcttitta attacagacic ticcaaatggg 42780 tatgaagitat tt cattgaag titttgctittg tattt coctd atggctaatg atgttgattg 4284 O aattitt tatg tacctgtgct ttgtatatat t citctgcagt ttct citt cag atc.ttittgct 429 OO catttittaag ctgttgttatt cqtctttitta ttgttgaatt gtaaaagtta tittatataat 42960 ttaa attctg gactittaatt agatgggatt togcaaatatt ttct c ccatt c to taggttg 43 O2O t ctittcaatt ctdata.gtgc tittggaagica aaaaagcttt taattitttitt tttitttittitt 43 080

US 8,407,013 B2 121 122 - Continued agttgggaag agc catgaat citcaacgaga ttittctaaac atacaaactic catatttcta 9834. O caaactittaa aaccagaaac ctagttgttt ttactittctg accittaacct aacttagaaa 984 OO taaagctitat tittggit caat t ctaaaataa aatatataga gaatact.gta t ccaagaac a 9846 O cgaagtaaaa totgitatgta togaacaataa citgacacatc ttctatataa atgatagota 9852O titat attt ca atttaacaat cittittatatt tdgtoccagg ataaaatgga agagg taagt 9858O cittattittac caaggcc.cag aagtagagitt ataagtic caa gaatttacag c tagtaaatg 98 64 O atgcagttitt gaaccoagt c tdatt coaca ttcaaggctic attt cactac aacgtgacaa 987OO cittcaagatt aacttattga taaccittaga caagtgctitc. tcaattgggg gtgattittgg 9876O ctaccalatgg acattggcaa tdactggaga tatttittggit tdatacaact agaagttgta 9882O t caact tact ctaagaaggt gtttgttgaag togctactggc atctggtggg tagaggccaa 9888O ggatactaga aac attctat aacatagatg acaat cacac aattittatgg cccaaaatgt 98.94 O caacct gctgaagttgggaa accctacctt agaggtgtct c tatt at cag ccaagaaaat 99 OOO gaagaccalaa gtacacccac atttcaataa at actitcttic ticcitgttcta gct accctict 99.06 O c cactittgaa toggatt coct gcc acaccag agaaaaatgc tigagctgtct gctact tcc c 9912 O c ctittagact citcgaaaaac agaaaaagta tott catgct gggtgcagtg gct cacgcct 9918O gtag toccag Ctact cagga tigctgaggca ggaagat coc ttgagcc cag gtatttgagt 9924 O ctagotctggg caacactgca agat.cccatt totattaaaa aaaaaagtta tdtttgttgct 993 OO t cacaaaatt ggagacatga atatagaa.ca aggaa.gc.cag aaagtgcaat gatggctitta 9936 O caaaagcaag cacaattitta Ct99ccacta cccaccc.ctc aggct tccgg Ctgagcagoc 99.420 cctgct coca gcagocacca atact cotac agtatttaaa gogoctdct cq gtagt gattic 9948O agaccagoat catcto ct co agggcagtgg taatttacct cittgacagtic cittctgaac a 99540 titccatgaga ttittaaaaga aaggaaattit aacaaagaac aagtttaaaa gaagttaaat 996 OO cc catcctgc cagagttgcc at atc.tttitt tttittttittt tttitttittitt gagatggagt 99 660 ct cactgtca toaggctgga gtgcagtggc gcgat ct cqg ct cactgcaa cct ctacct c 99.720 ccgggcticaa gcgatt ct co tdt ct cagcc ticcc.gagtag ttgggactac aggtgcgtac 99.78O caccacgc.cc agctaattitt td tatttitta gtagagacgg ggttt cacca tdttggc.ca.g. 9984 O gatggit ct cq atctottgac citcgtgat co acctgcct ca gcc tic ccaaa gtgctgggat 999 OO tacagg catg togccaccgtg cccagcct ct ttitt attatt act tatt atc. tcagatttct 99.96 O aaag cactta acagtttggg gag catttitt at atctacgc. tcc caactgagtat cacaag 100 O2O catact atga agtggittaga gtagatatta t catgct coa ttctagggct gtagagaggc 10008O Ccticagdaat aagggccact gttgttgcttg Ctctgcc agg Ctgatgcact tdgctgctga 100140 tctgagtgta gtct cittatt catttgtacc tagttcgt.ca cittaa.gcatc ttgact tact 10O2OO tatgtttaca tag cataaaa cittittggcct acaaaggagc titatcaaaaa gogaacaagtt 10O26 O atatgacagt agaagagagg Ctggagaagg gtaggctatt cacaaaaact gatttaggta 10032O aagcatttta aaataaaaat t citat cagac citt cacacag taaac tagtt cataataa.ca 10O38O tttitt.cctitt cotaat caca agcatcataa tt cataatta caatatttgt ccagttctic c 10044 O agggaaagat Ctgcacagta Cacat cagga aatgct Cttg gagat Caac a ggagaaatga 100500 aactaaaact caaggctt cq gottt catgc acagtaalacc taact tcatc ttgataaatt 100560 atgaaaacct cagctttctt acctgttitta gccaatattt at cacatacc cat catgtgg 10062 O aagaggttgt tdaatgcagt agggcataca gaagggactg aggcagt ccc taccctgggg 1.0068O

US 8,407,013 B2 135 136 - Continued ttattittaat atatgtc.cat a catactitt catat cattcc ctic caataag tdaaact tag 212 O titcc tagt cattgagtgtgg totggagitta gtggttt coc tictaccalaat ggaaagagaa 218O aatagtaact ttacagt caa gaatccagca gacat catcc tatgtgattic agcttaatat 224 O tagcaataga taaagttgat aacacataca ttctaatatg atatggtgag aagga cacct 23 OO gact tccatgttatt citt co ccaaaatcca tdgccaaag.c ataat catga gaaaa catca 2360 gatgaatata agtggg taga tattottcaa aatgcct cac ttggaccctt caaatgtgtc 242 O acagt cacat aggagaagtg Cagactgaga aactgtcaca gatgggggot aagaagaaag 248O ggtgactgac gcagtgttgta t ccttgatta gactic.cgaaa taaaaaaga Cattagcaaa 254 O agaatacaaa gtggatgaaa tatgtataca gtttittagtt ttgtt attag tattt cacca 26 OO at attaagtt Cttagtttgg gtaaatgtgg catgattatg taagatgct a gccttagggg 266 O aaacg.cggtg aagggtgtact caaattcac td tactatog titt to actict tctgtaagtic 272 O taaaattatt toaaaataaa aaaattaaaa ttgcaaacat tdt cagattg gatataaaaa 2780 gcaagattica accatatgtt act tatgaac ataatactitt tttitttittitt gag accqagt 284 O tttact ctdt citcc.caggct ggagtgcagt gigtgcaattt cagct cact g caatctotgc 29 OO citcc taggitt coagcaattic ttgtgcctica gcct c ccaag taactgaaat tacaggcatg 296 O atccaccacg to cqgctaat ttttgtattt ttagtaaaga cagggitttca atatgttggc 3O2O caggctggtc. tcaaactic ct gacct caa.gc gat cagocta cct cqgcctic ccaaagtgct 3O8O gggattacag goctagoca ccacaactgg cccaaacata at actittaaa tataatgaca 314 O taaataagtt aaaaggaaag gatggaaaaa tatagatt.cc atgct agtac taagtaaaac 32OO aaaactgtgg to ctattt C aat attggac aaagt catt to agaggaga gaaaact tcc 326 O agaagata at tttacaaata caaacatalaa tacaaatata aatatact at titatic cagaa 332O aacaaattaa cccaagatga gaatctgcaa gagga catgc ataaaactat atttataatt 3380 gacaattitca at atttitt ct citgaataatt gatggaaaaa gcagacagac aaat atccaa 344 O agtacaaaaa gcattgaata acattataaa ccaaattgac ctacctgaca tttatataac 3500 atgtcatcaa ataacaccaa catttctatt atttic ccaaa acacatttaa catgtgcaa.g 3560 ggtcCttgtt Ctgggc catt aaaaaaaaaa aaaaa.cagga ttcaagaagt acaaagtatg 362O gttatt ct ct gacct caagg gaattaaatt aagaatt cac aatatt coct act citttitta 3 680 aaagtaaatc ccacacttct aaataaccat giggccacacic aaacaaatat caatggggaa 37.4 O aaatcaaaat gtatgggata t cattcaaga agt ctittggg aggaat atta tagalactaag 38OO tgcctatatt atgaataaga agggit citcaa atcaatgacc ttagcttagg aatacaaaaa 386 O aggaatagca aattatacac ttagaaagta gaaaaaatga aataataaat gtcatgaaga 392 O acactcaaaa acaatagaga gtgtcaatga aacaacaaaa aaacaaaaac ctdgttt citt 398 O gaaaaattaa gaaaattggit aattatctag toaaactgct cagoatcaaa totagagaacg 404 O atcaaataat attaccaatgtcaataatga gqaaggcago agcactaaag gttcaaaatt 41OO attaac cqga t cqtaatgac ataaaatgaa taaactt cac aattalagaga tiggacaaagt 416 O cctagggtga caaacact cq taaac cctac toagcatacg cccatatata tatgtatata 422 O tccattaaat aagaagaatt tttagittaaa ataattic cqa ctaaggaaac tdttagagat 428O caataagaga gagatatctg. taaggitttitt gtaacatttgaaaattgatt aatgtaaccc 434 O acattagtaa caaagtaa.ca aagaaacaaa atcatcc cag tagggcagaa aaacacgata 44 OO cittaacggtgaaagcttgca toat citctica taatataata gtggtgaaat aataatttitt 4 460

US 8,407,013 B2 159 160 - Continued agagaatggit tagalagaca alactalaggca ggaagcc caa gaaataattt tdtgaaaagg 4O920 tgaaatttaa gctgataatt aattgaagga taacaagaga gttagcaaag atcaaaggga 4O98O agat caagat aaatcCaggc atgitatgtat gtatatataa attacgcatg tatacatata 4104 O tgttgttgtaat atatatacat atatatgcac atcatcc cat citgggcc titc atatatatgt 411 OO atatgtgitat aatatataca tatatatgcg cataggtgtg tatagtatat a catatatgt 41160 gcqcatagat gtgtatagta tatacatata totgcacata tatatgcaca tatatgtgta 41.22 O taataagtac acatatatat gcacatatgt gtgtataata tatacatata tatgcacat a 4128O tgttgtgcata tatata cata totgcacata tdtgttgtaat atata catat atgcacatat 4134 O atgtgttgtaa tatata cata tatgcacaaa tdtgt attitt atatgcacgt atgtg tatta 414 OO tatata cata tatgcacata totgtgtata atatata cat atatgcacat atgtgtgitat 41460 aatatataca tatatgcaca tatgtgtgta taatatatac acatatatgc acatatgtgt 4152O gtataatata tacacatata togcacatatgtgttgtataat atatata cat atatatgcac 4.1580 atatatgtgt ataatatata tacatatata tdcacatata totgitatgat atatata cat 4164 O atatatgaag ggccagagtgaat cacctag atttittctgg toggcc tttac catgagaaat 417OO agcattataa atgggctgag cagcatgtga cacccagttgtc.ttittcttg totgtct coa 41760 cagttgaggc tigcacaagtt aaatatttaa cittcttggitt titt cagotgt gttccagt ca 4182O agagatgtac agagaggttt atctgtgctt titcct tccta catcc tittitt citctitt cagg 4188 O gaatgtataa ggaaagticag gagct attgt totcgt atg atggcagtat aaaaa.ca.gct 41940 aaagaaatca tagagaggitt gag cctgaca t ct acaaact gctggacaaa taccalat agc 42 OOO cacct acttg tat ctatagt ttittggcatg tagaataaaa tot catt citt taagctattg 42O6 O t cittgttgggt tttittgcttg Ctttgttgcag citcaaag cat ccctaactgg taaagt citcc 4212 O aaaaaattct titt citcgt.ct c cc attctgt gtctgg tact cacatgaggg tattactgac 4218O cataggtgga ccc.cgattag gttatgacaa goagagtaat t citat ct cot togctgcagtt 4224 O cittagat cag atatgagaac ttaat cagtt ctdggcaatc agg to atgta gattagaact 423 OO tccatt catt to atggcaat gttcatgaga atagaattag ggcttctggc tictdaagttt 4236 O gtaccactitt ggcatttaga gttat citcag aaaaatgitat aattitttitta aaaatticago. 4242 O ttgttattta taa.gc.cagtt ttgttatttg citcaagaaat catactaata atggtggtgc 4248 O tittctggggt togaagggg aaagaaaggc ticagaac cag gagagagagg alaggt at Cag 4254 O ggcago Cctg. taggcaatgg taa.gcaggca gattgtattt aaagagtaaa tigaalaccac 426 OO taacgactitg cagact catc taattgacat taggcttitta aaatattgcc citcct tagta 42 66 O tact cagaat gaattgagaa gggaaagcat Caaagttgag agt ctgctaa gagatgaaga 4272O tgatgtagac atgatgaagg agggtatatt ttggct caa ttgaggaatg gaggatggat 42780 aggtaaggga catggaagat tagatctgga ttct Caggitt to aggcttga gcact cqgtg 4284 O aatagtgtga tittttitttitt ttittgagaca gag to tcggit ctdttgtcca ggctggagtg 429 OO tagtggcaca at catagotc attgcago: ct tacctic cta ggctcaagtg atcatcc cat 42960 cticago: ct co caagtagttg agactataga agcacac cat cacacctggc taattitttgt 43 O2O atttitttgta aaggcggagt ct cac catgt tdcc.caggct ggit ct cqaac toctdggct c 43 080 aagcgatcct cagcct Coca aagtgctggg attatagatg gtgagcc acc gcacctggcc 4314 O ataagtgttga tittgatgaaa tigagaaggg aggtgaaaaa Caggittittgg atgaaaa.ca.g. 432OO taaagagttc atacaaacac toagtgacat gtc.ctaaaag aaatatgagg ttcacaatta 4326 O US 8,407,013 B2 161 162 - Continued ttaaagatgc ctagotcaag atagagaatc atago cctdc actggagcaa cc catttatc. 4332O cagagtgaaa gcacagagta act agaag.cg gat attctgg gaalactalaga Cattaccacg 43380 tgtag tattg aaggaaaagc tigctacggag act aaaaaca gtacctggitt aagagataga 43.44. O aagtaa.gc.ca ggagagtgat agagatgaga atcaaaatag cagcatttca agc.calaagga 435OO agtggc caat agtgtcaaac actgttgagt tattagaagt atttgagggg tittatttgca 4356 O tittagtagga t ctittgctga taagagaagg gaataaagga gattaagttcaaaggcatga 4362O cgcgtgttca ccctt cactic aggtgagaga taatggtaac tittaact agg gaatgaataa. 43 680 tgaagatgga gattaattga aaaattgaga aataattggg ggitta cattg C caaaaatgg 4.374 O atgattgatgaaatgctaga aataaaaa.ca gggaggaatc aggtt tatgg C caggtttct 438OO gacatgcaca attttgtgtg tcgitat cagt tactgagttt gtgagaaaag agaaag.caga 43.860 tttacgtggg aggaggatga gttcagttitt atatattittg agtttaacgt aaatgcc agg 4392 O catctaaa.ca gagatgtc.ca tttgattagg gataaatgca taagaaaaga tigcagattaa. 4398O aatgtcatga accitatggat giggaagggat ggatttgcaa agg tatt ct c togctt cacct 4404 O gagcagttta ggcaggacag acact cittct gcttaatcto agacacttac accagctatic 44100 cacacttgat cittagccaaa aggcc.gagaa goaatacacic agctatoctic agg tact tac 4416 O attact ttitt gttcct aaag gcatatgagt ttgggaatct cagagtga gagggaagag 4422O gtgcaggatg gag cattgag gagaacaaat attacatgga aaa.gcagaaa gataact ct c 4428O aaataat atc aagaalacagt gggaagagat taggaataag tagatt tag agaatgtagt 4434 O tittagaaatg C caaaggagg gattggtcag ttgttaaatt tagttgaggc atgaa.gcaat 444 OO aaaataacta agaagtgttt actgaaatta ct cataaaga gattattitt c attittatgaa. 4446 O gaacaattitc aatgggatag ttgtcaacag aaaccaaact t cagogaatt aag tatggag 44520 caggatgtga atataaatga tigtatatatt caatggttga tiggagagata ccagtattga 4458O agacatggcg agatctatat tataaaatgg agttactata caggattggg aatgcatcgt 44 64 O c cataggaat gagacagaag tatgaaatga citgattgatg tatacctgtt gtatctgtgg 447OO cagaaagttg atggtgct to tattitt.ccca gaggagtgtc agggaaagtic aaaatttaag 4476 O acagagaagg aaagtgatga gagagaaaga cagtc.ccaga tigtgtcc cat agaatggaga 44 82O aggcagggga tict tcc cagg agaat citctic atgggag act c cagoagata ttagaaaatt 4488O taatttaccg atatgtacaa gogtaccacca citgcatttct tatttgttcc acaaatgcaa. 4494 O gactgtctica gtatatt cat catat ctdta at cittaagaa aaaccacatg at catgtcaa 45 OOO tgcatgcagg aaaggcatct gacaaaattic aactic ccatt cataacaaaa got ct cago a 45060 atctaggcac agaaaagagc attaccaa.cc tdgtaaag.ca cattataaaa gaaacaacaa 4512 O ctac tact at agittaa catt gct tagtgtg tittaatgacc aaaaactgga tigct tcc ct c 4518O taagattgga goggaaggg ta gag tatgctg. tcc actic tita t cact cottt to cacttggt 4524 O gatgaaagtic ctagdcagtt Caataagaca ggaaaaggaa gtaaaatgct tacaggctga 453 OO aaatgaagaa ataaagctac ttctatttgc agatggcata attgttctatg tagagaatct 4536 O caaataatgt ccaaaaaacc atacctgaat taagaagaga ctittagcaat gtcacaagat 4542O acgggg.tcaa cacacaaaac caattacatt totatat acc agcaataact cittggaaaca 4548O gaaatttalaa catttaaaac toagtaccat ttataataac toaaaaatac titatgaatac 4.554 O atacatcaaa acatatagga t ct ct attitt gaaaagctta taaag cact g attagaaaat 45 6.OO caaaaaatac ctaaataaat ggagagaaat at catgttca tagat cagaa gacticaa.cat 45 660 US 8,407,013 B2 163 164 - Continued ggtaaacaga t caaacagac atgtaggatt catgcaattt ttatcaaaat cccagcagtt 4572O tatctggaat tdt cittgatt ttggcaccag aagtic cc act ttctaggaat coccitctgtg 4578O ggatgtgaaa aaccocaaat ttittggc.cat gagtaaagaa gattggagaa aaaac tagaa. 4584 O aacccatatg gcatcaccca aacaagggct g tatgcattt tactgccaaa toggagacagc 459 OO acat attatc tdtttcttgt aattgctgtc actgtttittt toctdaccac taatgcgitat 4596 O aaccacgatt togcagttcac agtgat cagt gaattactgt gagctgcaaa togtgaatca 46O20 ttctaact ct tdtgacittaa atatgtaaat gaag catgtc gtaat catga gtgtttgtct 4608O gtatttgact ttagctgtgg attaactgtt c tactittgaa totaattttgt gctagttcag 4614 O tttitta actt tacaaacctt gagaccatat tittctaataa titcagatagt aaaaacacaa 462OO acaattacaa taccaatgca gcaaggcc.ca gaaggctaaa tdattgttgtt attittaatgg 4626 O tacatgaagg acacagacaa citg tattaca aaggtaagta aacaaaacag agcatattgc 4632O acaataggca gaaaaataat gtggggotgg gtatggtaga ggaggttaca tdatctgttgt 4638O gactittgcta gggctg.ccgt aacaaagtac catagattgg gtggcttaag caacaaaaat 4644. O citat ct cotc acagtt atgg aggttggaag ticc cagat.ca aggtgtcagt gggttggitt C 465OO cittctggggg cagtgagaac atgatctgtt cotggtc.t.ct ttgcttggct tdtagatggt 4656 O gcagatgact gtc.ttcttitt tdtgtc.ttitt cattatcatc cct ctdtgtg aagactaaat 46 620 tttaccattt aaggatgata taa.gcacgta attctaaaag gaacaaaagt ttcttitt ct c 46 68O tttittcttitt cittittcttitt atttctgtta tttitttggat ttttggtctic ctaaacaaac 46740 actgatgttc agttgaaaat ggcagocact gaattacctt tagtatacca aacaaaccag 46800 caca catcat tatat cattt tattgatttic tatttgaaaa tdagtaaagt tacattacct 46860 ttaaaattat t cqaac attc agtgacatat cotacaagag atatgaggitt cacagttaat 4692 O aaagatgcct agct caagac agagaatcat agc cctocac toggagcaa.cc catttatcca 46980 gaaagtgcag agtaactaga agtggatatt ctdgaaaact aaaac attgt attagttittg 4704 O gtatacaata caaaccagca cacat catta tat catttta ttgattitatgtta acctaca 47100 agttgcattgaaaatgtc.tt toaacaaaca aaatgggaaa ttittgataat agata cattg 4716 O gttctttaca gtgtagagct gactctgaca agt ct tactg. t caat catgc ticcict acaat 4722 O acagcaagtg atgcgtcaaa taatgataac caaaaaaaaa atgcact coa cattttagac 4.728O atgtttattt gaaaaatgga gctittaaatt atcttittggit ttctatogaaa cittitt catta 4734 O aaccacagaa alacatgaaac aaaagattat taa catc.ttt to caaatctgaac tagaatt 474OO tgct catcta tatgcatat c toggcagacag cacaaatgta aatttgc.cag act coattca 4746 O gtctatgaac ttctitat caa agaaaagata ttacctacta aatgcct cac acacatttaa. 4752O tatagaactg ctaaaaaggg gcc tdgtgtg citt acttgtg attittaaggc titt cataatt 4758O aaaatttitt c accacttitt.c agttittctta aaa catacag aaacaagaat cataactitcg 4764 O gctt tatgga aatggaagga gatagcatcc ttacaccitat gcc cacaaga cagcttgcat 477OO tgcggc.ca.gc cqtagaaaag ataccaaaat gttagcc tigc cataaaatca tdttitt caga 4776 O gtatgaaaga agaagaatgt tot ctaatct gaaagcaaat taaggatgag aataaagaga 4782O aggggagaaa aatgcaa.cag aagtgaatat gcttttitt co caaaactgtt ggtgat ctitt 4788 O gaagaggit ca tatggagcct agaaaatgat aagctggctg. catttgagtt acgtgatgtt 4794 O gtgttctggit togcaacaaaa actaatacag caaaaac agg atgaacaaaa accct catgt 48 OOO tittagggaaa tdatactatt to agaacacg agaaaagg to atcagaaaag atcagctaag 48060 US 8,407,013 B2 165 166 - Continued ttaaatagaa ctittct Ctga gatggagt ct ggctctgtca ccCaggctgg agtgcagtgg 48120 tgcgatctica gct cactgca accitctgcct c ccgggttca agc catcctic ctdcct cagc 4818O citcc tdagta gctagg actg cagg.cgtgca ccatcatgcc toggctaattt ttg tatttitt 4824 O agtagagatggggttt Cacc atgtgggit ca ggctggit ctt galacacctga cct Caagcaa. 483OO tctgcc tacg tdagcct c co aaagtgctag gattacaggc gtgaaccacc acaccaggcc 4836 O tgttittaaac agaattitt ct caatttcttt ttagaaattig taaattattt agaatacaaa. 4842O tittgattitca caacttcaaa ttacctctgt gctittgaagc cattitt catg acaaagaggg 4848 O ttaact tatgatagdatcca atacactitat gaatgttcat aaatcatgga citttitttaca 4854. O tgtcagcago ctatatgatg gat ct ctaga tigcaaatgat ct cattaa.ca aacagatagt 486OO citacgaaaat aaccctittaa atacaaagtg agtggtgttt ttittgaaagc tiggacatgaa 4.866 O tittggtcaaa ttcaaaactic togctgctgct ggtaagtaaa atcctaaata t cittatgtcc 4872O aaac actictt tttgtaaa.ca tatttagcta tdtttittaca toagacittac cactggaatc. 48780 aatgtaatgt gigacittgatg agaacagagc agcaagt caa agtgaattat atgtttgact 4884 O gtact caatt titat caccac ataaaataaa agaaagatat catgaaggct gtaggcagta 489 OO tagagaaata t tactaaaaa gogaaacagaa gaagaaaaaa tatatatatic ccactgtatic 4896.O actggacaga aataaaaatgtcatt cittac ttittaaattgaat attagaa tat cotatag 4902O t catttittaa tttacattct c ct cotaaaa gtcatatgat tacat attitt aagaataact 4908O gaatatagcc tacaatatat aag tatgcaa ttgggaatta aaataaattig citgitaacaag 4914 O aaatataaaa cattgttata tttitt catat at attacttg titt attaatc ctato attaa 492.00 ttac tactaa ttagcactgt taattagt ct ttgttttgtg taaaaaatgt caggaggctg. 4926 O aggcaa.gagg at Cactggag gcc aggggitt Caagcc.ca.gc ctaggcaa.ca tagtgagacc 4932O c catct ctac aaaaaattitt aaaattaact aagtgtggtg gcacatctitt gtagt cc cag 4938O Ctactic caga ggctgaggtg ggcagatcat gtgagcctgg gaggttgagg atgcagtgac 49440 c catgatcga gctgctgtact coagcctgg togacagagtg agaacct gtc. tctaaaataa. 495OO atatataaat aaataaataa atgcagttcg tdtaa cataa aaataagtga tatagaataa. 4956 O tagatattitt caaagaaacc tictatttitat atgttatatt aaagtaataa totgtataat 4962O tattatatgt tacatt atta tdatttatt c tdtctgggitt aactictaaaa agttggccac 4968O cittagatata gacaagctga ttctaaaatt aatattgaaa agcaaaggaa citagaac agc 4974 O taaagaaaaa ataacttgta aaaagtgaat taagttaaaa aagtgtgctic taccalattitt 498OO aaggcttaag gcacaattica gcaat caaga cagtgg tatt tag cagaggg ataga cacat 4986 O agat cactgg agcagaatag ataacticaga attagaacca cacaagtaca gccaactgat 4992 O ttittgacaaa gotgcaaaag taattcaatig gaaggatago cittittcaa.ca aatgatgttg 4998O gagcaattag acat cago at gcaccaacaa acc cc caaac Cttcaacata aacco cacac 5004 O tt catacaaa aataaattica aaatggatta cagct ctaaa toggaaaatgt gaatctataa 501OO aacttittaaa agaaaacaca ggggggaaat tt cataaaa tigtgttaga tigcagagat c 5016 O ttaggacacic aaaag cataa tocaccaaag aaagaacgga t caatttgac citcaacaaga 5022 O ttaaaagcta ttatt citctic aaaga cactg giggitttittitt togttgtttitt tttitttittgg 5028O tttgtttctt tttctttittg agacgg.cgtc. tcgct citgtt gcc caggct g gag togctgtg 5034 O gcacaatcto ggct cactgc aagct cogcc ticc caggttc acaccattct c ct gcct cag 504 OO cctic ccaagt agctgggact acaggcgc.cc gccaccacgc ccctictaatt ttittgcatct 50460

US 8,407,013 B2 172 - Continued attctittgga taaaaggttt ggaaaggact tctgggttgg aagctaaaag titccacacta 5532O at catalagat aaatatggat aaag.catalaa attat cagac agcaaagagc cittgttgcta 55380 ctittaatc cc ttctggatgt gatgcagaaa taggaatcta ct cac catgg atggctic cat 55440 tgaggaaatg gcc.gaact at taactaattt at attgactg cqtgttattt agcatgacag 555 OO attagaatta caaggagt ct gaaggaagtt gaaatccacc togccaaatca citcct tcagg 5556 O gctitt Cacta actgcataag aggagcacac caatggctga aag cagggct ggggaggtga 55 620 agagt catgg actictt tact Caacgcaagg tagtggctag gggaaggct g c catgataca 5568O ggacgctgag aaagaccttg tagggtgcac agatgct acc talagtgcaag gtggcagcCC 5574. O gtagggattt tdtgcagaaa aagagaactgaagcaatcct cocagggtgc acaaaagctt 558 OO CCttgagtgc acatcaatgg gcc.gacaaag act agaggca gagaattagg aagaagaat a 55860 atct cotaag gctacaggac aagcc aggaa aagagctgga gagtaaagaa aact citc cat 5592 O tgaacaataa caacaacaac aacaaaacta gcaagtgggit ttaaagaaaa aaaatacaag 5598O Caggagaaag gagaggacag caaaacacag agacagagag attcc ctata t t cataaaac 5604 O ataaaacaag cc.gctgggg.c taatgcatac agagatat cit gaagttct ct c tdgtottaaa 561OO tittgaagctic cqcttaaagg aatgtct tcc tict caccctic aaactatgtgaaattct cag 56160 tgtggagctgaatctaagta tacttgcaaa aaaatat citt atctaactica accqcg tatt 5622 O agattgtttic agt cccacgt attattgg to tdatagaaga aagcttgcac tttittctggg 5628O agtaaatat c actittctitca gttct c tattt gttcatacac attat ct cot gataaaaaat 5634 O attaagaaat a tagga agag gcagaaaaat gtgaagcatgaacaa.gcaag aaaacaacca 564 OO gtagaagaag tottagaggit aatct caatg tdtgaattgg caaggaaaaa ctittaaaaga 5646 O actatogataa atatgtttaa gitatctggtg caaaagg tag acaagataca tdagcaaatg 5652O aggattacgg caaaaagata act acaaaaa gogacaaacag aaatgttatt aaagaaaaat 5658O acaa catcaa aatgactaat t cattcaata ggctt catag cagactggat acago agcaa 5664 O gagaga caat Cagggalacat gaaggcagaa taaaag cat C caaattgaa acacalaagga 567OO agagagagag aaagagtgaa agc actgaag agc catgagg caatttaaaa atatagt cat 5676 O alacacatttg taatcggagg tacaggacaa taggagagaa aataagc gag aagaatt Ctt 5682O tgcagtgata atggc.catgg attttgtaaa aatggtgaaa totattatcc tacct accca 5688 O atacgcticag tdaatcc.ca.g. tcaggataaa tacatagaaa citcgc.gctitt togct tattitt 56940 tgtc.caactg ctgaaaatca aagataaaat cittaaaagta gctgaatggg agagaacatt 57 OOO acaatacagt gaaacaaaga ataatgatgt cittattgcaa goaga caatg acaaaac at c 57O6 O tttgttcatt ttcaaattta ttt tacttitt gaaattgaca gataaaattig tatatgttta 5712 O t cct acacag catactgttt tdttittgcag tatatgtcta catacgtaga taacactgtg 5718O gaatagittaa atctagotaa taaagaaatg cattatctoa catagittatc attitttgttga 5724 O tgagaacagt taatat coac to cqttaa.cc atttittcaag aaaacaatat at caccatta 573 OO actgtagt ca ccattttgta caatagat ct cittgaactta titcct catat ctaactgtaa 5736 O atatgtat cotttgaccaac at atc cctaa ccctic cc titt cittctagoca citctago ct c 5742O tgataatcac cattct actic tict acttcta tdtgatc.ttt ttaaaattic c acatgagtga 5748O aat catacag tatttgtc.tt totgtgtctg cctaatcc.ca tittaacgcaa tot cotccag 5754 O attcatc.cat gttgctgcaa at cacaggac titcct tcctt titatggctga at agtatto a 576OO gctgttgtata taalacacatt ttatttatcc attitatic cat tdatggacac ttaggttgat 5766 O

US 8,407,013 B2 175 176 - Continued cittgttatat agtaagtgct cagtaaattt togcttaatga gaaggtatat acaatgtaca 6O12 O cacaattatgtttatattag tittaatatat ttgatataat atgctttaca tat attacat 6O18O gcaaat attt atataattaa tattaaacac acacatatat ataatgcagg tagt cct cat 6024 O tittgcgcaat agtgttgttga citgcaactica tdcgtgttitt gctittgcaga agat citcagc 603 OO taccacagaa ttgttgcaaag taagttacac ttitt cotgtg togtgtaaatgtcagttaaat 6036 O attgttttgt ttaaagcaag gactt catag acatagatag atagatgata gatagataca 6O42O tagatggata cacagataga tacatagata t citt catatg tatatataca tataattata 6048O taaaagctica citggct attt aaagtaaaag gaatagotac ctattgtggg gttittaaagc 6054 O aaatatagat gtaaaataaa togacaatgag alacacaa.cag ttgtaagaag tittaactgga 6 O 6 OO attata catt gtaaagct ct tattaatgaa gtagatagta tttittgaggit agacago act 60 66 O aaattaaata totatatt at ataatgtgta accattgaaa aaacaaataa gatataa acc 6O72O aaaa.gc.caat gcagaagata aaatggaata ctaaaatact taattagcca taaatgtgaa 6O78O acaaatgata acacacagat aaaagaagta ccaatacagt caacttatac totacccitat 6084 O cagtaattac gittaaatgtt coatgaaaaa tdgaaga cat ttctaaaata gatttaaaaa 609 OO ttataacctic agtgcgatga ttaaattitat gtgtgaattt ggcc catgca gtgcc cagat 6096 O atttggittaa acattattitt gtgtgtttct gaaaggatat titcagatgcc taatgtc.ctt 6102O ttaactgaga cat caggtgt titt cotgtct ttggattgga actgacatat tdacttittitt 6108O atggittatga gcc togc.ca.gc ctittggactt aaact acacic ct cagcc titc caagttctta 61.14 O gct Cttgggc acatactgga act aaaccag tiggctgtcct aagattic cag Cttgctgagt 612OO citcacggtgt agattittggg acttgc.ca.gc ctic cataatc acatgagcca atticcittata 61260 atgtc.tttct citc cct ttct tt catatata tatatataga gagagagaga gagtttggca 6132O ggttgatagt taaaaggatg caaataaa.ca gagga cataa ttacaaagct taagaaagct 61380 ggggtggctic tictag tatt agacaaagta gacttcaaaa galaggagtat gag cagaact 6144 O aagggagata ttttgtgaga atgaagaacc aatt cactaa aaaga catag toacacatgt 615 OO aatgtg tatg acctaataaa agagataa.ca ctaagttaaa ttgccaaact aaggagggaa 6156 O aatagataag totacaactg tdtttggaaa totta acata cittctittgag tdattgatag 6162O aaggaccaga ttittaaaaat cagtaaac at actgaagatg togaac agitat gttcaac caa 6168 O c tag acctag gcgacattta tagatatat c ticact acago agaatgcatt ttcttittcaa 61740 gtgct cacag aatattittaa acagacticaa togctgggcat taalaccaggit totaatataa 618OO aaattatt.ca agc.cgaatcg ggcatact ct tat cataatg g tattatatt cataattaat 61.860 attattaaaa tttgtataaa at atcctaat gtttggaaat taalaccaaaa aataaaaa.ca 61920 gcc catgagt caaagaagaa at cacagtaa aaattatatt ttgaaataac toggc.cggg.cg 61980 cggtggctica Catctgtaat C cago actitt gggaggc.cga cqcagatgga t cacgaggit c 62040 aagagat.cga gaccatcc td gccaa.catgg tdaaaccctg. t ct citactaa aaatacaaaa 621 OO attagctggg catggtggcg ggtgcctgta gtcc.ca.gcta Ctcaggaggc tigaggcaaga 6216 O gaat cacttg aaccctggag gtggaggitta Cagtgagcca agattgcgcc actgcactcc 62220 agcc toggcaa cagagtgaga citctgtctica aaaataaata aataaataaa taaataaatt 6228O attaaaaaat aacaaaccat aatgtgtaag atgctgctat aatagggcta agagata act 6234 O ttatagctitt aaatat citgt act acaaaag agaatatatt taaaatcaat gacataagct 624 OO tata cattaa gaatgcaaaa aac aggatta aaataaaaag aaact agaaa aatgtacaga 624 6 O