US 20080O85284A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2008/0085284 A1 Patell et al. (43) Pub. Date: Apr. 10, 2008

(54) CONSTRUCTION OF A COMPARATIVE (52) U.S. Cl...... 424/184.1; 435/6: 514/44; DATABASE AND IDENTIFICATION OF 536/23.7:536/24.33: 707/102 VIRULENCE FACTORS COMPARISON OF POLYMORPHC REGIONS IN CLINICAL (57) ABSTRACT SOLATES OF INFECTIOUS ORGANISMS The present invention is directed to novel nucleotide (76) Inventors: Villoo Morawala Patell, Bangalore sequences to be used for diagnosis, identification of the (IN); K.R. Rajyashri, Bangalore (IN); strain, typing of the strain and giving orientation to its Marc Rodrigue, Marcy (FR); Guy potential degree of virulence, infectivity and/or latency for Vernet, Marcy (FR) all infectious diseases more particularly tuberculosis. The present invention also includes method for the identification Correspondence Address: and selection of polymorphisms associated with the viru SALWANCHIK LLOYD & SALWANCHK lence and/or infectivity in infectious diseases more particu A PROFESSIONAL ASSOCATION larly in tuberculosis by a comparative genomic analysis of PO BOX 142950 the sequences of different clinical isolates/strains of infec GAINESVILLE, FL 32614-2950 (US) tious organisms. The regions of polymorphisms, can also act (21) Appl. No.: 11/632,108 as potential drug targets and vaccine targets. More particu larly, the invention also relates to identifying virulence (22) Filed: Apr. 9, 2007 factors of M. tuberculosis strains and other infectious organ isms to be included in a diagnostic DNA chip allowing Publication Classification identification of the strain, typing of the strain and finally (51) Int. Cl. giving orientation to its potential degree of virulence. A6 IK 3L/70 (2006.01) Although the present invention has been illustrated with A6 IK 39/00 (2006.01) specific reference to the polymorphic region in the Myco A6IP 43/00 (2006.01) bacterium tuberculosis, the said invention is not to be C7H 2L/04 (2006.01) understood and construed as being limited to Tuberculosis CI2O I/68 (2006.01) but is applicable to all infectious diseases.

Identification of Single Nucleotide Polymorphisms (SNPs) in M. tuberculosis strains H37Ry, CDC1551 and M. bovis BCG

A total of 1829 SNP's have been identified in the three . Of these 825 SNPs are identical in H37RV and CDC1551, with a different nucleotide in BCG. 1579 of these are ORFs While the rest (246) are in non-coding regions. The SNPs in the ORF are categorized into synonymous, non-synonymous SNPs. The latter are further categorized 'On the basis of the change in primary structure of the protein that results - Conservative for no-change and non-conservative for changed primary structure of proteiin encoded. Patent Application Publication Apr. 10, 2008 Sheet 1 of 31 US 2008/0085284 A1

Figure 1: Entity Relationship Model

SNP SEQ SNP Ref pos (PK} SNPid {FK} Ref annotation id{FK} Annotation id FK} Ref base Query pos Refaa Query base In cdc1551 Sequence id In h37rv Is insSNP

annotation SNP analysis Id{PK} ref pos function Organism BCG annotation amino Version ref base class Accession no BCGAA start query name Gene end query pos tag query annotation Product query base Se Protein id query_aa Username EC number is nsSNP Password DBXref qryorf DB xref GOA boworf Type is non cons Strand is iden base Gene name funcannoid Gene link function note

long poly indels Accession no Accession no Ref start Ref start Ref end Ref end BCG annotation BCG annotation BCGorf BCG orf Query name Query name Query start Query start Query end Query end Query annotation Query annotation Query orf Query orf Funcanno id Funcanno id Patent Application Publication Apr. 10, 2008 Sheet 2 of 31 US 2008/0085284 A1

Figure 2: Identification of Single Nucleotide Polymorphisms (SNPs) in M. tuberculosis strains H37Rv, CDC1551 and M.bovis BCG

A total of 1829 SNP's have been identified in the three genomes. Of these 1825 SNPs are identical in H37Rv and CDC1551, with a different nucleotide in BCG. 1579 of these are in ORFs while the rest (246) are in non-coding regions. The SNPs in the ORF are categorized into synonymous, non-synonymous SNPs. The latter are further categorized on the basis of the change in primary structure of the protein that results - conservative for no-change and non-conservative for changed primary structure of protein encoded. Patent Application Publication Apr. 10, 2008 Sheet 3 of 31 US 2008/0085284 A1

Fig 3: Identification of indels in M. tuberculosis strains H37Rv, CDC1551 and M. bovis BCG

A total of 794 indels have been identified in the three genomes. Of these, 237 are present in both H37Rv and CDC1551 with respect to BCG, 178 in ORF and 59 are outside the ORF. Patent Application Publication Apr. 10, 2008 Sheet 4 of 31 US 2008/0085284 A1

Fig 4: Identification of long plymorphisms in M. tuberculosis strains H37Rv, CDC1551 and M. bovis BCG

136 polymorphisms are present in the three genomes, 30 of them being identical to CDC1551 and H37Rv. 22 of these polymorphisms are present in the ORFswhile 8 are outside the ORF. Patent Application Publication Apr. 10, 2008 Sheet 5 of 31 US 2008/0085284 A1

Figure 5: Display showing a region of 10kb of the BCG with three types of annotations: BCG ORF's, SNP's in H37Rv, and SNP's in CDC1551 The details of the different color codes used is as follows: A - Synonymous 0 - Non Synonymous or truncated protein H - Indel or long polymorphisms H - alignment of H37Rv and CDC1551 4 - Polymorph in coding region of only BCG O Polymorph in the non-coding region of both BCG and query.

3ik MF's in FRV indel's in SFRv j962 30957 NPs in hsfry i104. 4.32608 lignment with H37RV NC00-0962:30757. .30983 NC000962:32625.33804 HNC000982:3098.3105.7 NC000.95 NC000962:31058, .3262: - nnotated M0028 M30.031 0.032 A mum M30029 Haam-m-m-m-mail-a-minum mus-amus man-a-mb lignment with CDC1551. NC_00275:30738.30924 NC002755:3233.32558 NC00275 HT NCO255:30925.30954 NC002755:32569.33748 Hcoo2755:30962.31036 NC002755:31059.32332 - NP's in CDC1551. t 30923.303. 08 4 inde's in coc1551 30952.3095 134 3232 P's in CC15 oding Regions M0027.01. b0030.001 3003.co. lu-T-- - - CONSERyEHYPOTHETICAL PROTEIN CONSERVED HYPHETICAL FRTEIN HYPOTHETICAL FROTEIN MyO028.co. Mb0032.01. H. --"w. HYPOTHETICAL FROTEINb9029.gi PTATIVE REMNAKT HYPOTHETICAL PROTE Patent Application Publication Apr. 10, 2008 Sheet 6 of 31 US 2008/0085284 A1

Figure 6 Comparative genomics browser displaying BCG in the upper panel and H37Rv in the bottom panel.

* The segments labeled MUM-* are the perfect matches generated by the MUMmer tool, and the vertical lines show the alignment of the MUM segments in both genomes. The color coding of the ORF's is used to indicate the length of the ORF. This is very helpful to researchers because if an ORF in H37 aligns with an ORF in BCG but they have different colors, then there is a mutation that makes them have different lengths (see for example the genes in the MUM-1280 region).

NC02:45 . ------

BCG1987; genes 1933; 1589; 1930k 1931k 1932k 1333k1933k agg5k 1536; 1937R i558k 1959k 2060k 2001: 20dpk:2003k: 2004k 2005k 2006 1793. 19cic Polz.95c PEPSRS31 i lib1798 A. holso Mb1801 Mbi 803, Mo1805, gigore 19809 ; ------" - er re.'. . : 4 SES HES6 Mo199 bi802c Plb1804 cyp144, 1808 pal I 2 -rosta - wom: e- m- : indels in H37R ships1986939...1988743 in H37Ru !

snps in H37Rw | 96970:

MJ-1278 Patent Application Publication Apr. 10, 2008 Sheet 7 of 31 US 2008/0085284 A1

Fig 7.1 : PRIMER designed to amplify the polymorphisms ( ID: 593)

FORWARD REVERSE PRIMER ORF

PRIMER 1375.153

1373397 YES

1373907 1373629.1373649 1374,080.1374O1 YES

Upper Primer: 24-Tier 5 GCCGACGCTGCTTGGATGATGAT jer Piter: 3-the 5' CASCGTTGCCCCGTTGGTAT 3. DNA 250 plul, Salt 50 mM Upper Printher

Priter T Biff PC Printer Orer all Stability - 4. Frirmer Location 56.53 Frt T. Pit TT Pirtles TT Difference OptiTal Annealing Teliperature

Product Length 43 bp Product Trn & G. Method) 85.3 PC

Product G. Cotent F. Prst Tit SSC Fg C

135000 3500 375200 375300 135400 ship's in H37Rw indel's in H3ARw SNP's in 37Ry 137553 i375309 4. - alignment with H37Rw -NGO00962:37-1605..137390? NC000982:1374054.375722 NC00-0962:373908.374,063 Annotated Genes 262c alignment with CDC1551 NC002.55:137,095...i.373397 NC002755:373398..i.37469 SNPs in coc1551. 37553 4. indel's in CdCl551 SF's in CDC551 Patent Application Publication Apr. 10, 2008 Sheet 8 of 31 US 2008/0085284 A1

Fig 7.2: PRIMER designed to amplify the polymorphism (Polymorphism ID: 639) STRAIN SNP START END FORWARD REVERSE PRIMER PRIMER

BCG 1476918 1476.917 | 1.4771.53 14767.01.1476719 14771.54...1477.177

1478424 1477169 1478659 14782O7.1478225 1478660.1478683 147888. 1477626 147916 1478664...1478.682

Upper PrirTer: 19-mer 5' CGGCAGGTTCGTGGTCTCG Lower Primer: 24-mer 5' GTGGGCGGGTGTAATGTTGAAGG DNA 250 pM, Salt 50 mM upper Priner Pitar TT 58. 6. PC Primer Overall Stability 4.kct -48.4 kci. Prit. Location 24.3 S.-FF Product Tin - Piner TT 28. C. Printers Trn Differ: 3.4 OptiTal Annealing Tertiperature 3.4C Product Length 477 bp Product Trn & GC Method) 88C Product 6 Content fg. Product TT at fixSS 18.4

4F5800 4:5900 i477000 47,00 4200 ShP's in H3FRy indel's in H37Rw SNP's in H3Ry {}47698 477.2071. alignment with H3FRw NCQ00962:4783.i.17888. - NC000962:479084,4809FO NC000982:1478882...i.479083 H Annotated Genes alkA alkAa Mk1352 ma-ma (Hr alignment with CDC1551 NC002755:1477854,1478424 NC_002755:1478627.4484212 NC002755:1478425,1478626 SNP's in CDC55. 47698 d 712 indel's in CDC1551 ShP's in CDC155 Patent Application Publication Apr. 10, 2008 Sheet 9 of 31 US 2008/0085284 A1

FORWARD REVERSE ORF

PRMER PRIMER 2042298 2041.737

TRUNCATED

YES 2052686

Upper Primer: 20-iter 5 GCACCGCCGCAGACCACCAA List Frits: 2D-ther 5 GC:CGCGCT&AGCCTGACC DMA250 ph, Salt 50 mt. Upper PriTer Jet Pitt: First TT 65. PC 545 C. Printer reall Stability 4.5 kct F.4 kit riter location Ea.283 589. Product T - Fire T OS PC Pithers Tiffers a Optimal innealing Termperature 54. Product Length 426 bp Product Trn & GC Method) 85. C Product G Content 53. Product that fixSSC 15. PC

204200 2042200 204230) 204200 2042500 ship's in H37Rw indel's in H37Rw SNP's in H3Ry 201 lose 2042490 alignment with H3FRw NC000962:2050912.2051744 NC000952:2051842.2052033 NCO)0962:205745.2051841 NC000.952:2052 - H Annotated Genes PFE33a -Ha-au-Hau-Haua-a-m- PPE33 alignment with CDC155i NC002755:2048243.2049263 NC002755:2049264.20501.07 SNP's in CDC155. 2042298 () indel's in CDC155. ShP's in CC1551 Patent Application Publication Apr. 10, 2008 Sheet 10 of 31 US 2008/0085284 A1

Fig 7.4 : PRIMER designed to amplify the polymorphisms (Polymorphism ID: 880)

STRAN SNP START END FORWARD REVERSE ORF PRMER PRIMER BCG 2043142 2042301 2043149 2042978... 2043002 2043297.2043320 YES CDC 20501.07 2048024 2050.108 2049944.2049967 2050262.2050285 TRUNCATED H37 2052685 20512802052686 2052524.2052545. 2052840.2052863 TRUNCATED

Upper Primer: 24-mer 5' CTACAAAAAACCGGACGCAGTG toug Printet: 3-Tier 5' CAGAAGCCTTATACCATG MA. 250 ph?, Salt SD. Thi Upper Prither Pitt 35 PC Primer Overall Stability -4.7 kct PriTe?t Location R..15 Product T. Pitt T. Fitners it lifference OptiTal nealing Tertiperature Product Length 343 bp Product Trt (GC rulethod) 8.3: Product GC Cott 5a. Product That SSC - g

2042900 2043000 204300 2043200 2043300 SP's in H3Ry indel's in 37RV SNP's in 3Ry 2043142 alignment with H37Rw MC 000962:2052034,2052685 C-000062:2052686.205435 Annotated Genes PPE338 wimum-m-m- alignment with CDC1551 NC002755:2049264. 20500 - NC002755:2050.08.2051727 SNP's in Coc1551 204342 0. inde's in C1551 SF's in CC15 Patent Application Publication Apr. 10, 2008 Sheet 11 of 31 US 2008/0085284 A1

Fig 7.5: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 1230) STRAIN | SNP START END FORWARD REVERSE ORF PRIMER BCG 2963.874 2963854 296.5845 2963556.2963576 CDC 3002180 3002178 30041513001862.3001882 TRUNCATED TRUNCATED

Upper Primer: 21-tier 5' ACCCGGCCCGCACTCGTTCA jer Pier: - 'Agg. STTGTGTCT MA. 25 phi, Salt 5 thi pper Primer Friter T 58. PC PriTier erall Stability -cik Printer Location 46.43F Fit TT - Fife it rites TT Difference OptiTal Annealing Te Tperature Product Length 43 bp Product Trn & GC Method 86.4 °C Podt i Content 58.4. Product rat SSC 18. PC

€------2963k 2964k. ShP's in 3FRy indel's in H37Rw SF's in 3FRV 2963534 29888 d t alignment ith H37Rw -NCO00962:3005183.3006895 cos 300,235.300.9689 NC000.952:30.05896.3007235 H Annotated Genes M2708C N209 alignment with CDC1551 --NC (2755:3000561..30018d0 NC002755:30.02181,3005362 NC002755:3001841.300280 H SNP's in CDC1. 4298.3534 is, indel's in CDC1551 ShP's in CC1551 Patent Application Publication Apr. 10, 2008 Sheet 12 of 31 US 2008/0085284 A1 Fig 7.6: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 1348) STRAIN SNP START END FORWARD REVERSE PRIMER PRIMER BCG 32399.12 3239908 || 3241650 323.9747...3239765 CDC 3277659 3279397 YES H37 3283566.3283585

Upper Finner: 19-rmer 5 GCGGCGCC&C&TCACATCC 3' | Lower Primer: 20-iter 5 GCCTGCAACGCGCCGAGA8A. 3' DNA.25 phi, Sait 50 nM Jpper Printher Frither it bag PC Primer Overall Stability -di.F.C.T. Prither locatio S.48 Fost T - Fir Tt Pirnes TT Difference Optitial nealing Tertipeature Product Langth 415 bp Product Tit (&GC Flethod) 846 PC oduct (GC content 34. Froit TT at SSC 6. FC

3239FO) 32398.00 3239900 3240000 324,000 SF's in HSFR indel's in H37RV SNP's in 3FR 323992 3240055 { alignment with H37Ry NCOOO952:3282077.3283335 NC000982:3283490.328358. - H NCO00962:3283337,3283489 Annotated Genes Fac28 alignment with CDC1551 -NC002755:3275400.3277659 NC002755:327783.3277942H NC002755:3277660. .327782 SNP's in CDC551 323992 324005 th 4 indel's in CDC1551 Shf's in CDC1551 Patent Application Publication Apr. 10, 2008 Sheet 13 of 31 US 2008/0085284 A1

Fig 7.7: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 1351)

FORWARD REVERSE ORF PRMER PRIMER

3242680 3241643 32444053242424. . .3242446 3242893. . .3242916

CDC 3280427 3279390 3282152 3280171.32801933280640.3280663 YES H37 328610432850673287829 3285848.3285870 3286317.3286340 YES

s

Upper Primer: 23-mer 5 TCGGGGACACGATTGGGGAACC Litijar Primer: Re-tig 5' ACTTGTGCGCCGCCTTGCCTTGTC

Firther Tt 685 PriTier real Stability -5.8 kT. PriTier Location 55.53 Product Tin - Friter TT Files Tifference Optimal Annealing Temperature Product Length 493 bp Product TT & GC Method) 86. Product G Cortant F.3. Fridt TT at SSC F. C.

322.3k 3242.4k 3242.5k 3242.5k 3242.k. 32d2.8k 3242.9k 3243k 3243.1k SP's in H3FR indel's in H3Ry SF's in 3FRy,

alignment with H37Ry 26 33 NC000962:3283590. .3285304 W NCOOC --- H NCO00962:3286105.3285563 Annotated Genes MML7 alignment with CDC1551 NCO(2755:32779t3.3280427 NC00 - H NC (02755:3280428, .328O886 SNP's in CDC55. H

indel's in CDC551 28 3.3 Sh's in CC15 Patent Application Publication Apr. 10, 2008 Sheet 14 of 31 US 2008/0085284 A1

Fig 7.8: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 1561) STRAN SNP START | END FORWARD REVERSE ORF PRMER PRMER BCG T 3736,194 37351413738380 || 3736038.37360593736493.3736510 YES 3774.786 3773733 3777029 3774630.377.4651. 3775085.3775103 || YES 378.2550 3781497 3784736 3782394.37824-15 3782849.3782867 YES

Upper Priner: 22-tier 5 GAACTCAGTGCGTGGCTCTCG Lower PriTer: ig-Ter 5' GTTC:GTCGGGTGT DNA.25 phi, Salt 50 Tf pper Prither Fire T 58. Frither reall Stability it.8 ki: a.k. Pitt sation F4. i-g Product Tin - Frirner Trn Friers T. Differce OptiTal Areating Temperature Product length F4 bp Product Trn & GC lethod) 8. Product GC Content B.F. Put it SS 1F8 C. {----H-----H ------He 37.3500 335200 3,35300 SPs in HSFR inde's in 3FRu SNFs in HSFR 3 3594 alignment ith H3FRw NC000962:37F9F92.378.2550 NC000962:37.8255.3F828O1 H Annotated Genes a2 alignment with CDC1551 NC002755:37.2775.3774.786 NC 002:55:37.7478F.3775037 --- SNP's in CC1551 33694 A indel's in CDC1551 Sh's in CDC5L. Patent Application Publication Apr. 10, 2008 Sheet 15 of 31 US 2008/0085284 A1

Fig 7.9: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 228) SNP FORWARD REVERSE stan, Sr. PRMER PRIMER 456342 45.5438 456403 456113.456130 456464...456487 455414 454510 455475 455181.455.205 455536,455559 45.5323 454419 455,384 455094.455111 4.55445.455468

Upper Primer: 21-Ter 5 GATGCCGCCGCGCTGGAGGAG Liljer Fig.: e-let 5' GTG.G.C.A.G.A.TGG.T.T. DNA35 ph?, Salt 50 ITM lipper Primer Piter TT Primer Overall Stability Priner Location Product TT - Pier Trn Finners T. Difference Optimal Annealing Temperature Product Length 375 hp Product Trn (GC Method) 86 PC Product GC Cortant 8.D. Product that xSSC if PC

456.300 456.30 45.5320 ASS330 458340 A56350 45.5350 455370 455380 5539c. SF's in 37Ry indel's in H37Rw SNP's in H37Rw 456342 + alignment with H3FRw * NC000952:454294,455.323 MC O00962:455328, .457450 Annotated- Genes - - M0384 alignment with CDC1551 NC002F55:45.4082,455414 C002F55:4554:9. 457038 SNP's in CC351 -- 455,342

indel's in C2C1551 46 SP's in CDC1551 Patent Application Publication Apr. 10, 2008 Sheet 16 of 31 US 2008/0085284 A1

Fig 7.10: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 231)

SNP REVERSE

PRIMER PRIMER BCG 4674.02 464428 467685 467248...467268 467534...467555 466474 463479 466757 4663.20...466340 || 466605. 466627 466383 4634.09 466666 466229.466249 || 466514.466536

Upper Prirner: 21-mer 5 AGCGACGCCGGCACCCACCAT Lower Primer: 23-mer 5 GCCAGTCCCTCGCCAACCAGTCG DNA.25 plot, Salt 50 mM Upper Primer Fire T B.4 Primer Jeral Stability -49.9 kit PriTier Location 335-34 Product TT - Pither Tin Priners TT Differen: Optimal Annealing Temperature Product Length 318 bp Product TT & GC biathod) 84.1 F. Product GC Contant fia.3. Product That SS 5. C.

(5.7300 46.00 45.7500 SPs in H3Rw indel's in H37Rw SNP's in H3FRV 45F343 d67402 A t alignment with H3; Rw ---NC000.952:459398.45632d NC000962:456384.- .467499 NC000982:466325.466383 - Annotated Genes MO393 alignment with CDC1551 -NC002755:460077.46645 NC002755:466475.467668 NC002755:46645.466,474 SNP's in coc1551 H 45.7343 4674.02 { indel's in CDC1551 ShP's in CC1551 Patent Application Publication Apr. 10, 2008 Sheet 17 of 31 US 2008/0085284 A1

Fig 7.11: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 274) STRAIN SNP STARTT END T FORWARD REVERSE ORF PRIMER PRIMER

561876 561867 562730 561646.561668 562047.562069 YES 562305 562296 563159 562075.562097.562476.562498 YES

560855 560846 561709 560625.560647 561026. 56.1248 YES

Jpper Primer: 23-mer 5 ACCGCGCCATTGCTCTTCAGTCC life Frits: 3-2 "GGGTCACGCGGTCATTTG

Fier T Primer verall Stability Firther Locatio

Product T - First T. Fites Tifference Optimal Anrealing Temperature Product Length 424 bp. Product Trn & GC Method) 83.3 C Product GC Conte?t BO3& Fit Titt SSC 4.

5600 , 56.800 56900 55.2000 SF's in HSFR indel's in 3FR SNP's in 3FRy 561876 {} alignment with H37Rw NC000962:55.7298,560855 NC000962:560855,55295 Annotated Genes -am-m-m-)urai cal alignment with CDC1551 NC002755:56.1397.552305 - NC002755:562305.562746 SNP's in Coci - 561876 0. indel's in CDC1551 Sh's in CDC155. Patent Application Publication Apr. 10, 2008 Sheet 18 of 31 US 2008/0085284 A1

Fig 7.12: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 285) STRAIN END FORWARD REVERSE ORF PRIMER PRIMER

592121 59 1775.591798 592160.592.162 592131 591451. 592,338 591992.592015 592377.592,396 YES 59.0761 - 590081. 590968 590622.590645 591.007.591,026

Upper Frimer: 24-mer 5' CGGGTCCGGCCTATTTCTTTCTGC gar Pier: - 'AGGGGACAGTGGGGTTAC

Pigi B4.4 °C Primer Overall Stability -5.ii kri Frither location 2.35

Foduct T - Piter Tt Piss TT Difference OptiTal Annealing Teliperature Product Langth 45 bp Product TT (.GC rulethod) 34. C. Product GC Content 6.4.28. Product TT at 6SSC is. :

59.800 9900 592000 592.00 SF's in H3FRw indel's in H3FRV SNPs in H3FRy 5994 { . alignment with H37Ry NCO00962:59.0435.590,761 NCO00962:59.0762.59755i. He Annotated Genes proC alignment with CDC155i NC002755:585880,592.13 - NC002755:592132.596689 SNP's in CDC551. 591.94 + ide's in CDC1551 Ships in CDC1551 Patent Application Publication Apr. 10, 2008 Sheet 19 of 31 US 2008/0085284 A1

Fig 7.13: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 313)

START FORWARD REVERSE PRIMER PRIMER 671788 671091 672395 671522.671543 671958...67 1977 YES 671996 671290 672603 671730.671751 672166.67285 YES H37 670543 669846 671150 670271670298 670713,670732 YES

Upper Primer: 22-Tier 5' CGCTTATCGCTGGACGGTGACT | Lower Primer: 20-mer 5 GCCAGCGAGCGCAGCAAATG DNA 250 phi, Salt 50 inhi lipper Primer Fitner Primer herall Stability PriTier Liation

Frt TT - First TT Firties T. Difference Optimat Annealing Tenperature

Product Length 456 bp Product Trn (&GC Method) 36.4

Product C Cote?t 8.

Product T at fixSSC 108. C

6700 6800 6,900 SF's in H37RV indel's in 37Ry SNP's in H3FRy 67,788 0. alignment with H3FRV NC000962:669397.570543 NC000952:50544.67208 Annotated Genes M3059. alignment with CDC1551 NC002755:670.159,.671996 NC002755:671997.673531 SNP's in CC155 6788 4 indel's in CDC1551 SP's in CDC155 Patent Application Publication Apr. 10, 2008 Sheet 20 of 31 US 2008/0085284 A1

Fig 7.14: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 433)

END PRMER PRIMER

H37 94.0888 940454 941104 940582.940600 94O959...940977 YES

Upper Frirms: 19-ine 5' TCGCGGGGATGCTTTGACC 3" Los Pitt: 9-e S GGCCGGGCCCGGGTT 3' hia. 25 phi, Salt 5 inhi upper Prither

F.g. :

Primer Trn Prither Overall Stability - 8.5: Pits. Locatio F.

Fit T - First TT Fiers Triffere: OptiTal nealing Tertiperature

Product length 39 bp

8. Product Tim & GC lethod Post G. Cotst .. Put at SSC 83 C

941500. 94.600. FOO 9,800 SHP's in HS7Rv indel's in H37Rw SNP's in HS7Rv 945 alignment with H3FRy NCOOO962:94032.94.9888 NC 00962 9,0889,941.3 Annotated Genes harl d alignment with CDC1551 NC00275:94024.94O790 C_00275:940791...94:1745 SN's in CC1551 { 4645 indel's in CDC55. SHP's in CDC1551 Patent Application Publication Apr. 10, 2008 Sheet 21 of 31 US 2008/0085284 A1

Fig 7.15: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 874)

STRAIN START END FORWARD PRMER

2037036 2037034 2037297 2036851...2036874 2037365.2037385 TRUNCATED 2044001 2042343 2044295 2043816.2043839 2044330...2044350

2046579 2044921 2046840 2046394.2046417 2046908.2046928

Upper Primer: 24-mer 5 GCCACCGTTGCCGTATAGCCATCC 3' Lower Pine: -et 5' GGGTCCGC&AACTGTA&G 3' DNA 250 phi, Salt 5 mil Upper Printher Pitar Tt 5.8 °C 5g. FC Primer Overall Stability -51. It 44.1 kct Fiter Location ...FF 588.68 Product TT - Primer TT First Differee Opti that Annealing Te Tiperature Product Length 535 bp Product Trn &GC tiethod 845 PC oduct G. Conte?t 3. Product Tt at xSSC 15.1 PC

203000 203100 ShP's in 3RW . indel's in H37Ry

SAP's in H37Rw - 2037036 alignment with H3FRV NC000352:2045309.20465.79 NC 00962:20.6580.204857 Annotated Genes PEPGR332b PEFGRS323 alignmentpowwaa-wowownwaruwww.awmwrouw with CDC1551 m-m-m-m-m-m-Hamm-mm-He NCO(2755:2042823.20d4001 KCOO2755:20d4002.2044279 SP's in CDC15 2037036 0. - indel's in CC155 SP's in CDC551. Patent Application Publication Apr. 10, 2008 Sheet 22 of 31 US 2008/0085284 A1

Fig 7.16: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 2074)

STRAN START FORWARD PRIMER 3.042938 3042939 3042688.3042.707 3043072.3043092 YES CDC 308108630811113080836.3080855.3081244.3081264 YES H37 308.636130863863086111.3086130 3086519.3086539

Upper Primer: 2)-ner 5' CGTATACCASGTCGGGATTG File: - SGTTGGTGGCCTCGGCGTG DNA.25 pill, Salt 5 ITM Upper PriTier Fig Tt Primer Oerall Stability | Primer location

Frt TT - Fief Friers TT Diffence Optimal innealing Temperature Product Length 439 bp. Product TT.G. Method 84. FC Product i Cote?t 3.E. Prot T at SSC 8

3042920 302:30 3.2940 3042950 30.429 SP's in SFR indel's in H3FRw 30,2938-3O2939 --- SF's in HSFR alignment with HSFR NC000962:308.1926. .3085362 tossessess.cstas Annotated Genes 20c alignment kith CDC1551 C002755308.0931. .308.08 NC002F55:3081.i.10.308:450 SNP's in CDC1551 inde's in CD551 30,2938-30.293 - SF's in CDC55. Patent Application Publication Apr. 10, 2008 Sheet 23 of 31 US 2008/0085284 A1

Fig 7.17: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 1874) STRAN START | END FORWARD PRMER PRIMER

570941 570942 570701.570723 571209. 571226 571370 571411 571120.571152 571694.571711

56.9920 5699.61 569680.569702 570244.570261

Upper Primer: 23-mer 5' GTCGCCTCAGGAAACACCACCAG Loja Pris: 8 S' GTAGGCGCGAGGAGACC DNA 250 ph?, Salt 50 mill Upper Primer Fit TT 55.4 PC Primer Overall Stability -39. kict PriTier Location 5...d. Product TT Printer TT Priners Trn Difference Optimal Annealing Temperature Product length 582 bp Product Tim & GC Method) 8F. Froduct GC Cotent 89.1. Fiduct TT at SSC 8. PC

, 5F0890 570900 57(310 50920 570330 5F0940 50950 5F096) 57090 SF's in H3FRw indel's in H37Rw , 57094;1..50942 H SNP's in H37Rw alignment with H37Rw NCO00962.565654.55992: - NCQ00962:569952.573250 Annotated Genes 0490c 13049 alignment with CDC55. NC002755:567617.5737. ors :571.402.5.425 SNP's in CDC1551 indel's in CDC55. 57094.50942 Ship's in CDC55. H Patent Application Publication Apr. 10, 2008 Sheet 24 of 31 US 2008/0085284 A1

Fig 7.18: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 1848) STRAIN START END FORWARD

PRMER 162534...162557 162930... 162953 NO

162514...162537 162911...162934 YES

162341.162364 162738.162761

Upper Primer: 24-Tier 5' CGACGTGCGCCTGTACTTGGACTG 3' File: - 5 ATCTTGTGGGGGTTCGTG.A. 3' DNA 250 phi, Salt 5 mM Upper Primer Lower Printer Frits. T 44 C. Primer Overall Stability 48.4 kT. PriTiar Location ii. 34 Product TT - Primer TT Pites Tifference Optimal Annealing Temperature Product length 421 bp Product Tim & GC tiethod) 86. Product GC Content B. Product Tati SSC F. PC

E250 62560 , , it 62570 62680 152590 5200 520 15272 62F30 52F40 Ships in HSFR inde's in H3FRy , , 62694.62595 H SNF's in 3FRw alignment with H37Rw NC000962:162338. 162502 NCO00952:162502,165507 Annotated Genes H 1039-3Ramam-m-m-mamma 0139-4 alignment with CDC1551 NC002755: 525ii.162675 NC002755:162675, 153744 SNP's in CC1551 - indel's in CDC155 52694.62595 H ShP's in CDC1551 Patent Application Publication Apr. 10, 2008 Sheet 25 of 31 US 2008/0085284 A1

Fig 7.19: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 1840)

FORWARD REVERSE PRMER PRMER

BCG 138784 138786 138588.138609139089. 139009 YES

CDC 138737 138738 138541.138562 139041.139060 YES 139048.139067

Ipper Frimer: 22-Tier 5 GACTCGGCGGTGGCGGGAAGA 3' Lower PriTis: 2-T. S" &GCGGATTCACGATTC 3' DNA 250 phy, Salt 50 ITM Ipper Friner Pier T 5. "C Printer herall Stability -39. kit Fifter Location F.28

Put T. Pit TT Fit Tiff OptiTal Anrealing Tertiperature Product Length 520 bp Product Trn (GC Method 8. C. Product GC Cotent fi3. Product TT at fix8SC 5.3 : 138770 138 sc 13890 isko Sh's in H3FRy indel's in H3FR 138F84.138786 H SNP's in HRy alignment with H3FRw NC00062:38800.3845 - NC000952:138745...i.39558 Annotated Genes M301:19 alignment with CDC1551 NC (02755:38293.138738 NC002755:138738,139551 SNPs in CEE155 indel's in CDC1551 38F84.38783 H ShP's in COC551 Patent Application Publication Apr. 10, 2008 Sheet 26 of 31 US 2008/0085284 A1

Fig 7.20: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 2044) STRAN START FORWARD PRIMER

266.1152 266.1153 2660966.2660986 2661378...2661,397

2690410 2690420 2690224.2690244 2690639.2690658 2693078 2693088 2692892.2692912 26933.07.2693326 YES

Upper Primer: 21-mer 5' GGCCGCCCCGACAACAGAGAT OWr Firther: R-Ter 5' CCGCCGATCCGA&GA 3' DNA250 ph?, Salt 50 ITM Upper Primer

Pier T 8. "C Frimar Overall Stability -49.2 kin Primer location 53.5F Product TT - Fifer TT PriTiers Trn Difference OptiTal Annealing Temperature

Product Langth 435bp Product Tim & GC Method) 88:

Product GC Comte?t 38.

Product TT at 688 i. PC

2853) 2554) 25550 2865) 2560 SP's in H3FR indel's in H37Ry 26552.285.153 H SNP's in H3FR - alignment with H37Rw -NCO00962:26928,4,26930.79 MC-00952:2593O88.2693375 Annotated Genes PEPGRS4 alignment with CDC155i. -NC002.55:2690206. .26904i. NC002755:269042).2690F07 SNP's in CC15 indel's in CDC1551 2651152.26553 - SHP's in CDC1551, Patent Application Publication Apr. 10, 2008 Sheet 27 of 31 US 2008/0085284 A1

Fig 7.21: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 2296)

STRAN START FORWARD PRIMER BCG CDC 1414273 1414435 1414070... 1414093 1414463... 1414486 1414785 1414947 1414582.1414605 1414975.1414998

Upper Primer: 24-rmer 5' GGCGCCCAACACGAGGACGAGCAC 3' Lowe Pier: 2-tat AGACCACCGCCGACATCCTG 3' DNA 250 phy, Salt 5 mild Upper Primer Prita T F.8 C. 6g. PC Primer Overall Stability -53.4 kcm -52.3 kct First Location 8.15 498.4F5 Product TT - Pither in . Frities TT lifference 11 PC OptiTal Annealing Tenperature 688 C. Product Length Product Tim & GC Method Product GC Content Product That SSC

143000 4,300 143020 4303) 1413040 id:3050 430E SHP's in H37Rw 1413023,43095 indel's in H37Ry SF's in H3FRy alignment with H37Rw NCOC0952:idd2,44785 Annotated Genes pkth alignment with CDC. NC002.55:14:4140.144274 SNPs in CC indel's in CDC1551 SP's in CDC1551 13023.1413095 H Patent Application Publication Apr. 10, 2008 Sheet 28 of 31 US 2008/0085284 A1

Fig 7.22 : PRIMER designed to amplify the polymorphisms (Polymorphism ID: 2318) STRAIN START FORWARD

PRIMER BCG | 3204423 32044573204313.. .320434

CDC 324.2170 3242204 3242060. . .3242081

3247847 3247881 3247737.3247756 3248204. . .324.8224

Upper Primer: 32-mer 5 CCCACTACCGAGGCGTTGATTG Let File: 2-ng 5' AGGGTCCCGAEGGCTGGTGTAT

Pier TT 6. C 58.3 Primer Overall Stability -44.9 kit ad3.2 kn Printer Location 29.318 F84..FB4 Product T - Piner TT Piers. Tifference OptiTal Annealing TerTperature Product Length 488 bp Product Tm (GC Method 8.5 Product GC Content 3.3 Product in at SSC 5.

SHF's in 3204400H37Rw 320440 3204420 3204430 320440. . . 320450. . . 3:6 3204470 3204480 3204423.3204457 indel's in H37R i.e. well--w - -, woul------SNF's in H3FRy alignment with H37Rw NC 00962:3247377.3247848 NC 00962:3247881.3248.7 --- H AnnotatedpsA Genes alignment with CDC1551 NC002755:324:900, .32424.7t. RCO(2755:3242204.3242458 --- - SNP's in CDC1551 indel's in CDC55. SF's in CDC1551 3204423.320445 Patent Application Publication Apr. 10, 2008 Sheet 29 of 31 US 2008/0085284 A1

Fig 7.23: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 2341) STRAN START | END FORWARD REVERSE PRIMER PRIMER

BCG 1619416. 1619437 161914.1.1619106 1619609. 1619631NO 1622970 1622972 1622.705. 1622725 1623143... 1623166

1623086 1623088 1622821...1622840 16232.59.1623282

Upper Primer: 21-Tier 5'TGCGGGTGACATCGGAAGGTG | Lower Primer: 24-mer 5' CGGGTCATCGAGGAGCTGGAGGTC DNA 250 plot, Salt 50 mM Ipper Primer Fire T Friner Overall Stability Pier Location Product TT - Friner TT Pites TT Difference Optimal Arealing Temperature Product Length d62 bp Product Tm &GC Figthod) 8. C. Product GC Conte?t R. Product TT at fix SSC 5. FC

159330 59.400 6940, 69420 59.430 if 94.40 i5.9450 isis-50 16470 ShP's in HBRy 5.9415, 16943? s indel's in H37Ry SNP's in H3Ry alignment with H37Ry --NCOCO952:158977.1523087 COO0962:1623088.624789 Annotated Genes alignment with CDC15 NC (02755:628882...d62297. MCOO2755:1622972,530253 SNP's in CDC55. indel's in CDC155i SP's in CDC55. 51945.59.437 H Patent Application Publication Apr. 10, 2008 Sheet 30 of 31 US 2008/0085284 A1

Fig 7.24: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 2297) STRAN START

1466961 1466963 1466693... 1466711 1467170... 1467189

1468455 1468468 1468187... 1468205 1468675... 1468,694

468912 1468925 1468644...1468662 1469132.1469150

Upper Primer: 19-iner 5' CGTGGTGCGGCGTTCTTAC Louie Frite: -e 5' GGCTGGCGGTGGCGTGTTCG DNA, 25?. phi, Salt 5 till Upper Primer Fiter T B. C. Primer Ore all Stability - d.3 PriTet cation 31.51 Product TT - Fite TT Fites T. Differen: ptimal realing TerTperature Product Length 508 bp Product Trt (&GC blethod) 85. PC Product C Cott 68. . Put Tit SSC 83

146690 1466.950 A58960 455970 1455.980 SF's in HFR 146696.1.1466963 indel's in H37Ru SNP's in HFR alignment ith H3FRV NC000962:468106.1468913 NC-00952:1463925.1471.245 Annotated Genes S55-2 3.345 M1348c alignment with CDC155i NC002755:467750. 458456 NC (02755:468468.1470758 SNP's in CDC1551 indel's in CC1551 sh's in CDC155. 1d6596.455953 - Patent Application Publication Apr. 10, 2008 Sheet 31 of 31 US 2008/0085284 A1

Fig 7.25: PRIMER designed to amplify the polymorphisms (Polymorphism ID: 2310) STRAN START END FORWARD REVERSE PRIMER PRMER

2279216 2280345 2278971...2278994 2279.525.2279548

2297635 2298764. 2297396.2297413 2297944, 2297967 YES

Upper Pirt: 8-tet S CGCGGCACGAGTCA, Loyer Pie: 4-Tel 5'TATCGGTTCGCATCCTGTC 3' DNA 25 p?u, Salt 50 ml, Upper Primer Pither T Pine herall Stability Fite lication Fruit T. Fist T Frits TT Difference Optimal Annealing Temperature

Product Length 512 bp Product TrT (&GC Method) 84g C Product C Content 64.1. Put Titt SSC 1. 2279.4k 2279.2k 2279.3k 2279.4k 2279.5k 2279.6k 2279.7k 2279.8k 2279.9k 2280k 2280.4k 2280.2k 2286.3k 2286.4k SF's in H37Ry

3.22792.6.2280345. re-rw-sm-ma-.. were re-r- -wn ar. six -e-, - a-- rom ev, r-i. re-worm-wim-owmyrm. . www.r- s indel's in H37Rw SNP's in H37Ry alignment NC 00962:2295045.2295302 - - MCOOO952:2, Annotated Genes pks12 alignment with CDC1551 NC002F55:2297380.2297.836 - NC002755:2. SNP's in CDC1551 - indel's in CDC1551 ShP's in CDC55. 22F925.2280345 H US 2008/0085284 A1 Apr. 10, 2008

CONSTRUCTION OF A COMPARATIVE infection is often arrested by developing cell-mediated DATABASE AND IDENTIFICATION OF immunity (CMI) resulting in the formation of microscopic VIRULENCE FACTORS COMPARISON OF lesions, or tubercles, in the lung. If CMI does not limit the POLYMORPHC REGIONS IN CLINICAL spread of M. tuberculosis, caseous necrosis, bronchial wall SOLATES OF INFECTIOUS ORGANISMS erosion, and pulmonary cavitations may occur. The factors that determine whether infection with M. tuberculosis results FIELD OF INVENTION in disease are not well understood. 0001. The present invention is directed to novel nucle 0007. The tuberculosis complex is a group of four myco otide sequences to be used for diagnosis, identification of the bacterial species that are so closely related genetically that strain, typing of the strain and giving orientation to its it has been proposed treat they or combined into a single potential degree of virulence, infectivity and/or latency for species. Three important members of the complex are Myco all infectious diseases including tuberculosis. The present bacterium tuberculosis, the major cause of human tubercu invention also includes method for the identification and losis; Mycobacterium africanum, a major cause of human selection of polymorphisms associated with the virulence tuberculosis in Some populations; and Mycobacterium bovis, and/or infectivity in infectious diseases by a comparative the cause of bovine tuberculosis. None of these mycobac genomic analysis of the sequences of different clinical teria is restricted to being pathogenic for a single host isolates/strains of infectious organisms. The regions of poly species. For example, M. bovis causes tuberculosis in a wide morphisms, can also act as potential drug targets and vaccine range of animals including humans in which it causes a targets. More particularly, the invention also relates to disease that is clinically indistinguishable from that caused identifying virulence factors of M. tuberculosis strains and by M. tuberculosis. Human tuberculosis is a major cause of other infectious organisms to be included in a diagnostic mortality throughout the world, particularly in less devel DNA chip allowing identification of the strain, typing of the oped countries. It accounts for approximately eight million strain and finally giving orientation to its potential degree of new cases of clinical disease and three million deaths each virulence. year. Bovine tuberculosis, as well as causing a small per 0002 Although the present invention has been illustrated centage of these human cases, is a major cause of animal with specific reference to the polymorphic region in the Suffering and large economic costs in the animal industries. Mycobacterium tuberculosis, the said invention is not to be 0008 Antibiotic treatment of tuberculosis is very expen understood and construed as being limited to Tuberculosis sive and requires prolonged administration of a combination but is applicable to all infectious diseases. of several anti-tuberculosis drugs. Treatment with single antibiotics is not advisable as tuberculosis organisms can BACKGROUND OF THE INVENTION develop resistance to the therapeutic levels of all antibiotics that are effective against them. Strains of M. tuberculosis 0003 Microbial pathogens use a variety of complex that are resistant to one or more anti-tuberculosis drugs are strategies to subvert host cellular functions to ensure their becoming more frequent and treatment of patients infected multiplication and Survival. Some pathogens that have co with such strains is expensive and difficult. In a small but evolved or have had a long-standing association with their increasing percentage of human tuberculosis cases the tuber hosts utilize finely tuned host-specific strategies to establish culosis organisms have become resistant to the two most a pathogenic relationship. useful antibiotics, isoniazid and rifampicin. Treatment of 0004. During infection, pathogens encounter different these patients presents extreme difficulty and in practice is conditions, and respond by expressing virulence factors that often unsuccessful. In the current situation there is clearly an are appropriate for the particular environment, host, or both. urgent need to develop new methods for detecting virulent strains of mycobacteria and to develop tuberculosis thera 0005 Although antibiotics have been effective tools in p1es. treating infectious disease, the emergence of drug resistant pathogens is becoming problematic in the clinical setting. 0009. There is a recognized vaccine for tuberculosis, New antibiotic or antipathogenic molecules are therefore which is an attenuated form of M. bovis known as BCG. This needed to combat such drug resistant pathogens. Accord is very widely used but it provides incomplete protection. ingly, there is a need in the art for screening methods aimed The development of BCG was completed in 1921 but the not only at identifying and characterizing potential reason for its avirulence was and has continued to remain antipathogenic agents, but also for identifying and charac unknown. Methods of attenuating tuberculosis strains to terizing the virulence factors that enable pathogens to infect produce a vaccine in a more rational way have been inves and debilitate their hosts. tigated but have not been Successful for a variety of reasons. However, in view of the evidence that dead M. bovis BCG 0006 The mycobacteria are rod-shaped, acid-fast, aero was less effective in conferring immunity than live BCG, bic bacilli that do not form spores. Several species of mycobacteria are pathogenic to humans and/or animals, and there exists a need for attenuated Strains of mycobacteria factors associated with their virulence. Tuberculosis is a that can be used in the preparation of vaccines. worldwide health problem, which causes approximately 3 0010) A variety of compounds have been proposed as million deaths each year, yet little is known about the virulence factors for tuberculosis but, despite numerous molecular basis of tuberculosis pathogenesis. The disease is investigations, good evidence to Support these proposals is caused by infection with Mycobacterium tuberculosis, lacking. Nevertheless, the discovery of a virulence factor or tubercle bacilli are inhaled and then ingested by alveolar factors for tuberculosis is very important and is an active macrophages. As is the case with most pathogens, infection area of current research. Such a discovery would not only with M. tuberculosis does not always result in disease. The enable the possible development of a new generation of US 2008/0085284 A1 Apr. 10, 2008

tuberculosis vaccines but might also provide a target for the PRIOR ART design or discovery of new or improved anti-tuberculosis 0.015 Patent application WO 02074903 describes a drugs or therapies. method of selection of purified nucleotidic sequences or 0011 Present methods for the identification and charac polynucleotides encoding proteins or part of proteins carry terization of mycobacteria in Samples from human and ing at least an essential function for the Survival or the animal diseases are by Zeil-Neilson staining, in-vitro and in virulence of mycobacterium species by a comparative Vivo culture, biochemical testing and serological typing. genomic analysis of the sequence of the genome of M. These methods are generally slow and do not readily dis tuberculosis aligned on the genome sequence of M. leprae criminate between closely related mycobacterial Strains and and M. tuberculosis and M. leprae marker polypeptides of species particularly, for example. Mycobacterium paratuber nucleotides encoding the polypeptides, and methods for culosis and Mycobacterium avium. Mycobacteria are wide using the nucleotides and the encoded polypeptides are spread in the environment, and rapid methods do not exist disclosed. for the identification of specific pathogenic strains from 0016 U.S. Pat. No. 6.228,575 provides oligonucleotide amongst the many environmental strains, which are gener based arrays and methods for speciating and phenotyping ally non-pathogenic. Difficulties with existing methods of organisms, for example, using oligonucleotide sequences mycobacterial identification and characterization have based on the Mycobacterium tuberculosis, rpoB gene. The increased relevance for the analysis of microbial isolates groups or species to which an organism belongs may be from Crohn's disease (Regional Ileitis) in humans and determined by comparing hybridization patterns of target Johne's disease in animals (particularly cattle, sheep and nucleic acid from the organism to hybridization patterns in goats) as well as for M. avium strains from AIDS patients a database. with mycobacterial Superinfections. Although recognition of 0017 Patent application No. WO9954487 and U.S. Pat. the causative agents of human leprosy and tuberculosis are No. 6,492.506 describes a method for isolating a polynucle clear, clinico-pathological forms of each disease exist. Such otide of interest that is present or is expressed in a genome as the tuberculoid form of leprosy, in which mycobacterial of a first mycobacterium strain and that is absent or altered tissue abundance is low and identification correspondingly in a genome of a second mycobacterium strain which is difficult. Improvements in the specific recognition and char different from the first mycobacterium strain using a bacte acterization of mycobacteria may also increase in relevance rial artificial chromosome (BAC) vector. This invention if current evidence linking diseases such as rheumatoid further relates to a polynucleotide isolated by this method arthritis to mycobacterial antigens is substantiated. Emerg and recombinant BAC vector used in this method. In ing drug resistance to mycobacteria including M. avium addition the present invention comprises method and kit for isolates from AIDS patients, any Mycobacterium tubercu detecting the presence of a mycobacteria in a biological losis from TB patients is an increasing problem. sample. 0012. There is no data or technical information in the 0018 U.S. Pat. No. 5,783.386 describes polynucleotides prior art, which permits to select specifically potential new associated with virulence in mycobacteria, and particularly targets and protective antigens for new drugs and vaccine a fragment of DNA isolated from M. bovis that contains a compositions to treat and prevent infectious diseases, par region encoding a putative sigma factor. Also provided are ticularly tuberculosis. Furthermore, there is a need for the methods for a DNA sequence or sequences associated with development of new tools for the selection of genes which virulence determinants in mycobacteria, and particularly in encode for essential proteins or regulatory nucleotidic M. tuberculosis and M. bovis. In addition, the invention sequences in the Survival or infection of mycobacterium provides a method for producing strains with altered viru species and useful for the design of anti-tuberculosis drugs lence or other properties, which can themselves be used to and vaccines based on the knowledge of comparative myco identify and manipulate individual genes. bacterial genomics. 0.019 U.S. Pat. No. 5,955,077 relates to novel antigens 0013 A method of using DNA probes for the precise from mycobacteria capable of evoking early (within 4 days) identification of mycobacteria and discrimination between immunological responses from T-helper cells in the form of closely related mycobacterial strains and species by geno gamma-interferon release in memory immune animals after type characterization is essential. The method of genotypic rechallenge infection with mycobacteria of the tuberculosis analysis is further applicable to the rapid identification of complex. The antigens of the invention are believed useful phenotypic properties such as drug resistance and pathoge especially in vaccines, but also in diagnostic compositions, nicity. especially for diagnosing infection with virulent mycobac teria. Also disclosed are nucleic acid fragments encoding the 0014. The invention aids in fulfilling these needs in the antigens as well as methods of immunizing animals/humans art. The method according to the invention has the advantage and methods of diagnosing tuberculosis. to reduce drastically the number of potential new targets and protective antigens by giving for the first time an exhaustive 0020 U.S. Pat. No. 6,596.281 describes two genes for description of conserved SNPs in the tuberculosis. The proteins of M. tuberculosis have been sequenced. The DNAS isolated polynucleotides described in the present invention, and their encoded polypeptides can be used for immunoas which are highly conserved in genomic sequences of both says and vaccines. Cocktails of at least three purified recom virulent and avirulent, are by this characteristic essential for binant antigens, and cocktails of at least three DNAS encod the survival or the virulence of these mycobacteria in the ing them can be used for improved assays and vaccines for host. The identification of antigens and potentially thera bacterial pathogens and parasites. peutic targets has been made by a method of comparative 0021 U.S. Pat. No. 5,700,683 provides specific genetic genomic analysis. deletions that result in an avirulent of a myco US 2008/0085284 A1 Apr. 10, 2008 bacterium. These deletions may be used as phenotypic related Strains including mycobacterial strains and for defin markers of providing a means for distinguishing between ing virulence and resistance patterns. disease-producing and non-disease producing mycobacteria. 0030 The method according to the invention has the 0022 U.S. Pat. No. 5.225,324 relates to a family of DNA advantage to reduce drastically the number of potential new insertion sequences (ISMY) of mycobacterial origin and targets and protective antigens by giving for the first time an other DNA probes which may be used a probes in assay exhaustive description of conserved SNPs in different M. methods for the identification of mycobacteria and the tuberculosis strains, which cause tuberculsosis. The isolated differentiation between closely related mycobacterial strains polynucleotides described in the present invention, which and species. The use of ISMY, and of proteins and peptides are highly conserved in genomic sequences of virulent encoded by ISMY, in vaccines, pharmaceutical preparations strains are essential for the survival or the virulence of these and diagnostic test kits is also disclosed. strains, in particular mycobacteria, in the host. The identi fication of antigens and potentially therapeutic targets has 0023 WO0066157 patent application provides for been made by a method of comparative genomic analysis. polypeptides encoded by open reading frames present in the genome of Mycobacterium tuberculosis but absent from the 0031. The invention is directed to identifying virulence genome of BCG and diagnostic and prophylactic method factors in M. tuberculosis & other infectious diseases, using ologies using these polypeptides. both strands of DNA, RNA and/or proteins associated with the virulence factors, allowing identification of the strain, 0024 U.S. Pat. No. 6,458.366 discloses compounds and typing of the strain and finally giving orientation to its methods for diagnosing tuberculosis. The compounds pro potential degree of virulence, infectivity and/or latency. vided include polypeptides that contain at least one antigenic portion of one or more M. tuberculosis proteins, and DNA 0032. Accordingly this invention provides a nucleotide sequences encoding such polypeptides. Diagnostic kits con sequences for diagnosis, identification of the Strain, typing taining Such polypeptides or DNA sequences and a Suitable of the strain and giving orientation to its potential degree of detection reagent may be used for the detection of M. virulence, infectivity and/or latency of all infectious diseases tuberculosis infection in patients and biological samples. having a SEQ ID nos 1 to 2531. Antibodies directed against Such polypeptides are also pro vided. 0033. The invention is further directed to a method comprising of aligning the genomic sequences of different 0.025 S. T. Cole has sequences the complete genome mycobacteria species to sequence of the best-characterized strain of Mycobacterium tuberculosis, H37RV. The sequence has been analyzed in 0034) a. Select a polynucleotide sequence highly con order to improve our understanding of the of this served amongst the virulent strains and corresponds to an slow-growing pathogen and to help the conception of new essential gene for the survival or the virulence of mycobac prophylactic and therapeutic interventions. Nature 393, terium species 537-544 (1998) 0035) b. Select polymorphisms between virulent and 0026. In a multicomponent analysis to determine the avirulent strains to identify genes and regions conferring association of polymorphism to the degree of virulence and virulence to the former strains infectivity is in progress. These polymorphisms constitute a 0036 c. And optionally, testing the polynucleotide set of putative virulence markers that are being validated in selected for its capacity of virulence or involved in the 120 clinical isolates of tuberculosis. The study results in a set Survival of a mycobacterium species said testing being based of virulence markers, which could be used in predicting the on the activation or inactivation of said polynucleotide in a degree of virulence and infectivity of Mycobacterium infec bacterial host or said testing being based on the activity of tions. the product of expression of said polynucleotide in vivo or 0027. There is no data or technical information in the in vitro. prior art, which permits to select specifically potential new 0037. The invention further comprises of identification of targets and protective antigens for new drugs and vaccine following polymorphisms, having potential to be used as compositions to treat and prevent infectious diseases includ reagents and in diagnostics, drug and vaccine development ing mycobacterial diseases, particularly tuberculosis and leprosy. for infectious diseases: 0038 i. Identical nucleotide in. virulent strains/species, SUMMARY OF THE INVENTION but a different nucleotide in avirulent strains/species at the same position 0028. The object of the present invention is to identify genes which encode for essential proteins or regulatory 0039) ii. Some of the virulent strains differ in the nucle nucleotidic sequences in the Survival or infection of myco otide sequence at specific positions and share the nucleotide bacterium species as also all infectious diseases and which sequence with that of avirulent strains. could be useful for the design of drugs and vaccines based on the knowledge of comparative genomics. 0040 Yet another object of the present invention is to provide for the identification of strains including mycobac 0029. Yet another object of the present invention is to terium in disease samples, for the specific recognition of provide for the identification of strains including mycobac pathogenic strains, for precisely distinguishing closely terium in disease samples, for the specific recognition of related Strains including mycobacterial strains and for defin pathogenic strains, for precisely distinguishing closely ing virulence and resistance patterns. US 2008/0085284 A1 Apr. 10, 2008

0041. The method according to the invention has the causing organisms; which can be utilized for developing advantage to reduce drastically the number of potential new drugs and vaccines effective against tuberculosis and other targets and protective antigens by giving for the first time an infectious diseases, plays a important role in gene therapy, exhaustive description of conserved SNPs in different M. RNAi technology and imaging. tuberculosis strains, which cause tuberculsosis. The isolated 0053. The invention is also directed to a process for the polynucleotides described in the present invention, which production of recombinant polypeptides and chimeric are highly conserved in genomic sequences of virulent polypeptides comprising them, antibodies generated against strains are essential for the survival or the virulence of these these polypeptides, immunogenic or vaccine compositions strains, in particular mycobacteria, in the host. The identi comprising at least one polypeptide useful as protective fication of antigens and potentially therapeutic targets has antigens or capable to induce a protective response in vivo been made by a method of comparative genomic analysis. or in vitro against mycobacterium infections, immunothera 0042. The invention is directed to identifying virulence peutic compositions comprising at least Such a polypeptide factors in M. tuberculosis & other infectious diseases, using according to the invention, and the use of such nucleic acids both strands of DNA, RNA and/or proteins associated with and polypeptides in diagnostic methods, vaccines, kits, or the virulence factors, allowing identification of the strain, antimicrobial therapy. typing of the strain and finally giving orientation to its 0054 SEQ ID Nos. 1 to 1829 are single nucleotide potential degree of virulence, infectivity and/or latency. polymorphisms. 0043. Accordingly this invention provides a nucleotide sequences for diagnosis, identification of the Strain, typing 0.055 SEQID Nos. 1830 to 2286 is an insertion/deletion of the strain and giving orientation to its potential degree of (indel) virulence, infectivity and/or latency of all infectious diseases 0056 SEQ ID No 2287 to 2531 are regions of long having a SEQ ID nos 1 to 2531. polymorphism. 0044) The invention is further directed to a method 0057 The present invention also includes primer comprising of aligning the genomic sequences of different sequences for amplifying the region around the polymor mycobacteria species to phism SEQ ID nos 1 to 2531 0045 a. Select a polynucleotide sequence highly con 0058. The nucleotide sequences flanking the polymor served amongst the virulent strains and corresponds to an phisms of SEQ ID Nos. 1 to 2531 to a length of 35 essential gene for the Survival or the virulence of mycobac nucleotides on either side are used in reagents and in terium species diagnostics, drug development, RNAi gene therapy and 0046 b. Select polymorphisms between virulent and other such technologies. avirulent strains to identify genes and regions conferring 0059 SEQID Nos 1 to 2531 are used as targets for drug virulence to the former strains design using bioinformatics and other tools, drug develop 0047 c. And optionally, testing the polynucleotide ment, for gene therapy and vaccine development. This selected for its capacity of virulence or involved in the invention also includes the use of proteins, RNA, DNA and Survival of a mycobacterium species said testing being based metabolites encoded by the region carrying the polymor on the activation or inactivation of said polynucleotide in a phisms having a SEQID Nos. 1 to 2531 for RNAi technol bacterial host or said testing being based on the activity of ogy and antisense technologies. the product of expression of said polynucleotide in vivo or 0060. This invention also includes a database for identi in vitro. fication and selection of the polymorphisms having SEQID 0.048. The invention further comprises of identification of noS. 1 to 2531. following polymorphisms, having potential to be used as BRIEF DESCRIPTION OF THE FIGURES AND reagents and in diagnostics, drug and vaccine development TABLES for infectious diseases: 0049 i. Identical nucleotide in virulent strains/species, 0061 FIG. 1 describes Entity Relationship Model. but a different nucleotide in avirulent strains/species at the 0062 FIG. 2 illustrates the identification of SNPs in M. same position tuberculosis strains H37RV, CDC 1551 and M. bovis BCG. A 0050 ii. Some of the virulent strains differ in the nucle total of 1829 SNPs have been identified in the three otide sequence at specific positions and share the nucleotide genomes. Of these 1825 SNPs are identical in H37RV and CDC 1551, with a different nucleotide in BCG. 1579 of these sequence with that of avirulent strains. are in ORFs while the rest (246) are in non-coding regions. 0051. The invention relates to the identification and The SNPs in the ORF are categorized into synonymous, analysis of Non-synonymous SNPs to predict conservative non-synonymous SNPs. The latter are further categorized on and non-conservative amino acid substitutions. The effect of the basis of the change in primary structure of the protein the substitution on the function of the proteins encoded that results - conservative for no-change and non-servative provided a powerful insight in predicting SNPs correlating for changed primary structure of protein encoded. with virulence and infectivity in infectious diseases for 0063 FIG. 3 illustrates the identification of indels in M. example M. tuberculosis. tuberculosis strains H37RV, CDC 1551 and M. bovis BCG. A 0.052 The invention further relates to proteins, RNA, total of 794 indels have been identified in the three genomes. DNA and metabolites encoded by the region carrying the Of these, 237 are present in both H37RV and CDC1551 with polymorphisms in tuberculosis and other infectious disease respect to BCG, 178 in ORF and 59 are outside the ORF. US 2008/0085284 A1 Apr. 10, 2008

0064 FIG. 4 illustrates Identification of long plymor 0077. The total numbers of sequences retrieved are as phisms in M. tuberculosis strains H37RV, CDC 1551 and M. follows: bovis BCG. 136 polymorphisms are present in the three genomes, 30 of them being identical to CDC 1551 and H37RV. 22 of these polymorphisms are present in the ORFs Species name No of sequences retrieved while 8 are outside the ORF. Mycobacterium africanum 16 Mycobacterium canetti O3 0065 FIG.5 display shows a region of 10 kb of the BCG Mycobacterium microtii 24 genome with three types of annotations: BCG ORFs, SNPs Mycobacterium tuberculosis 1274 in H37RV, and SNPs in CDC 1551. Mycobacterium bovis 183 0.066 FIG. 6 shows the comparative genomics browser displaying BCG in the upper panel and H37RV in the bottom 0078. The complete genomes of Mycobacterium tuber panel. The segments labeled MUM-* are the perfect matches culosis strains H37RV (referred to as H37RV) and CDC1551 generated by the MUMmer tool, and the vertical lines show (referred to as CDC1551) - both of which are virulent and the alignment of the MUM segments in both genomes. The infective to humans) and Mycobacterium bovis BCG color coding of the ORFs is used to indicate the length of (referred to as BCG)—non-virulent and non-infective in the ORF. This is very helpful to researchers because if an humans - were aligned and a database constructed. The ORF in H37 aligns with an ORF in BCG but they have structure of the database is given in FIG. 1. different colors, then there is a mutation that makes them have different lengths (see for example the genes in the 0079 Sequences were aligned using the pairwise align MUTM-1280 region). ment tool “MUMmer-3.08 (www.tigr.org). 0080. The use of MUMmer required three distinct steps: 0067 FIG. 7.1-7.25 are the primers used for the ampli fication to encompass the regions of polymorphisms. 0081 1. running MUMmer for each of the target genomes (CDC1551 and H37RV) against the reference 0068 Table 1 gives the list of Single Nucleotide Poly genome (BCG) morphisms in Mycobacterium tuberculosis/M. bovis BCG. 0082) 2. parsing the MUMmer output using to produce a 0069 Table 2 gives the list of Insertions/deletions list of polymorphisms, and loading these data into a poly (Indels) in Mycobacterium tuberculosis/M. bovis BCG. morphism database. 0070 Table 3 gives the list of long polymorphisms in 0083. 3. generating feature files for visualization, and Mycobacterium tuberculosis/M. bovis BCG. loading these features into a feature database. 0071 Table 4 lists Polymorphisms in genes involved in 0084 BCG was chosen as the reference genome and cell wall synthesis. compare the two tuberculosis strains, CDC1551 and H37RV, against the reference. MUMmer uses fasta files as input and 0072 Table 5 lists Polymorphisms in transcription fac was run using the following command line: tOrS. 0.073 Table 6 lists Polymorphisms in genes involved in lipid metabolism run-mummer1 bovis.fasta codc1551.fasta BCG-CDC which takes the format, 0074 Table 7 lists Polymorphisms in genes encoding program

S perl work mtb'scripts Snp-orf2.pl—query seq=. 0089. This creates three output files: seqs/CDC1551.fasta—user=NAME password= 0090) 1. BCG-CDC.gaps—this is the initial output file PASS that simply lists the location of all exact matches in the two 0.104) To assign the H37Rv ORF's the following com Sequences. mand is run: 0.091 2. BCG-CDC.errorgaps—this is a processed ver S perl scriptSisnp-orf2.pl—query seq=.jseqS. sion of the gaps file. H37RV.fasta—user=NAME password=PASS 0105) To determine whether the CDC1551 SNP's are 0092) 3. BCG-CDC,align this is the fully annotated file synonymous or non-synonymous the following command is that is used to locate all polymorphisms. 0093. Pairwise alignments of BCG-H37RV and BCG S cod work/mtbiscripts CDC1551 was done using the BCG genomic sequence as S perl swork, mtb'scripts/synomous.pl—bcg file=. reference. Results of the alignment identified three types of seqSibovis.fasta—query seq=..seqS. polymorphisms: CDC1551.fasta—user=NAME password=PASS 0106) To determine whether the H37RV SNP's are syn 0094) 1. SNPs—single nucleotide polymorphisms in one onymous or non-synonymous the following command is or more of the sequences aligned. 0.095 2. indels insertion or deletion of one or more S cod ?work/mtbiscripts bases in the sequences aligned. S perli work/mtbiscriptSisynomous.pl—bcg file=. seqSibovis.fasta—bcg file=.jseqSH37RV.fasta— 0.096 3. Long polymorphic regions—regions with user=NAME password=PASS numerous changes in the sequences aligned. 0.107 A set of summary columns are used to coallesce all Inserting the Annotation of the Complete Genomes into the the SNP data in one place. To do this, the following Database command is run: 0097. The gene annotation downloaded from either gen S perl? work/mtbiscriptSfcompare-Smps.pl—user= bank or EMBL is included into the database by running the NAME password=PASS following script 0108). To insert data into the SNP analysis table the SNP Swork mtb'scripts annot.pl—seq=filename—db data from the SNP SEQ SNP and gene ontology tables is name=NAME-user=NAME password=PASS fetched and entered into the SNP analysis table. This step also identifies the conservative and non-conservative amino filename indicates either genbank or the EMBL genes anno acids. tation file. 0.109 To do this, the following program is run: Inserting the Data into the DB S run.shwork/mtbiscripts 0098. To insert the CDC1551 SNP's into the DB the following command is run: 0110. The SNP data in the database is thus complete. S perlwork mtb'scripts/snp-insert.pl—Snp=../mum Analysis of SNPs PASS query acc=NC 002755 0111. The SNPs identified were of two kinds: 0099] To insert the H37Rv SNP's into the DB run the 0112 i. Identical nucleotide in CDC1551 and H37RV, but following command is run: a different nucleotide in BCG at the same position. S perlwork mtb'scripts/snp-insert.pl—Snp=../mum 0113 ii. One of the three sequences is polymorphic; the mer? BCG-H37. Snp-user=NAME password= nucleotide sequence of CDC1551 and H37RV are different PASS-query acc=NC 000962 from each other and one of them is identical to the BCG 0100. To determine whether SNPs are synonymous or sequence at identical positions. non-synonymous, whether they are within or outside an open reading frame is first determined. All SNP's that lie 0114. The SNPs thus identified were categorized accord within an ORF are taken and the amino acid for that codon ing to their location in Open Reading Frames. SNPs falling within the ORF of both BCG and H37RV were identified. containing the SNP is determined. The results were validated by determining if the SNPs were 0101) To determine if the BCG locations lie within present in the ORFs of BCG and CDC1551. ORF's run the following command is run: 0115 The SNPs falling in ORFs were further categorized S perl? work/mtbiscriptSisnp-orf-refpl—ref seq=. into synonymous and non-synonymous SNPs. A SNP was seqsbovis.fasta—user=NAME password=PASS said to cause a non-synonymous change if: 0102 All BCG locations within ORF's must have their amino acids determined. To do so, the following command 0.116) 1) It occurs in an ORF is run: 0.117) 2) It occurs in the *same* ORF in the genome it is S perlwork mtb'scripts/ref-aa.pl—ref seq=../seqsbo being compared to. vis.fasta—user=NAME password=PASS 0118. In some cases a SNP can be in one ORF in the 0103) Next, the H37RV and CDC1551 locations are reference sequence but in another ORF in the comparison mapped. To assign the CDC 1551 ORFs the following sequence, e.g. due to a frame-shift mutation earlier in the command is run: Sequence. US 2008/0085284 A1 Apr. 10, 2008

0119) So before we assign SNPs to Non Synonymous 0.120. The non-synonymous SNPs thus identified was or synonymous groupings all SNPs which either did not analysed to predict conservative and non-conservative fall in an ORF, or fell into different ORF's on the reference and comparison sequences were eliminated. The BCG and amino acid substitutions. The effect of the substitution on the H37 genomes have been annotated with respect to one function of the proteins encoded was predicted. This pro another. However CDC1551 has not been so thoroughly vides a powerful insight in predicting SNPs correlating with annotated, so it was not possible to immediately assess if an virulence and infectivity in M. tuberculosis. ORF in BCG was the corresponding ORF in CDC. There fore, a metric was devised to eliminate spurious compari 0121 Below is an example of the output obtained from SOS, the database. US 2008/0085284 A1 Apr. 10, 2008

eMicobacterior Microsoft internet Explorer

r-, . r Back ins; ri top Refresh Home Search Favorites History Address al http:ldbevesthog en comikumaribiom erieuxl refphp3

geneid Beg oRF s ors East w

ce

ES E. Has S a H ns Non conservative - 2. v. IS Conservative Conservative C null =. ins Non conservative on conservative

null s

CDC153 25 E:

Conservative

82 15 CDCSS Conservative

ves S Non conservative E. ns lan tonservative | .: CC55 Non conservative -- :

s7 s 3277 ES

s ves 37Ry null : CC55 6406 null 6445 Non conservative CDCl3 64.46 Non conservative

US 2008/0085284 A1 Apr. 10, 2008

0122) The above figure describes the SNP details, which 0141. A total of 794 indels have been identified in the is as follows: three genomes. Of these, 237 (H37RV) and 237 (CDC1551) indels are present in both H37RV and CDC1551 with respect 0123 Bovis pos—Bovis position having a SNP. to BCG. Of these, 178 are in ORF and 59 are outside the 0.124 Bovis ORF Yes indicates that the SNP in bovis ORF. (FIG. 2) is in bovis ORF. No indicates not in ORF. Analysis of Long polymorphs: 0125 Bovis base Indicates the SNP detailSNP pos 0.142 Long polymorphs are insertions or deletions of ition in bovis long stretches of nucleotides with respect to BCG sequence. 0126 Bovis AA Displays the bovis amino acid after 0.143 To insert the long polymorphs from the align file of the codon translation. the mummer output into the database, following java pro 0127 Qry name Displays the name of a strain, gram is run: example H37RV or microtii Sjava/work/mtb'scripts/indel 0128 Qry pos—Displays the position of a SNP in either 0144) To enter the functional annotation from the gene CDC1551 or H37RV with respect to bovis SNP position. ontology database into the long polymorph table, following 0129 Qry ORF Displays Yes if the SNP falls in the java program is run: ORF of the query (H37Rv or CDC1551) Sjava/work/mtb'scripts/indfunction 0130 Qry base Displays the query SNP. 0145 A table listing the long polymorphisms is given in 0131 Qry AA Displays the amino acid of the query Table 3. (H37Rv or CDC1551). 0146 A total of 136 long polymorphisms have been 0132) Is nsSNP Displays SNPs synonymous (S), non identified in the three genomes. Of these, 30 (H37RV) and 30 synonymous (NS) and SNPs in non-coding region (NC). (CDC1551) indels are present in both H37RV and CDC1551 with respect to BCG. Of these, 22 are in ORF and 8 are 0.133 Conservative subst—Displays homologous sub outside the ORF. (FIG. 3) Stitution in H37rv and CDC 1551. Functional Annotation of the Polymorphisms Identified 0134) Fun annotation—Will display the functional anno tation of the query. 0.147. In order to identify polymorphisms with a putative functional association, a tool was built using the Gene 0135) A list of Single nucleotide polymorphisms identi Ontology DB (GO). The EMBL sequence DB has made fied in the manner described above is given in Table 1. putative GO assignments to most of the ORFs in the three 0.136) A total of 1829 have been identified in the three TB genomes, so a local installation of GO was used together genomes. Of these 1825 SNPs consist of having the same with the EMBL cross reference tables to identify TB poly nucleotide in H37Rv and CDC 1551, with a different nucle morphisms based on their putative functional classification. otide in BCG. Of the1829 SNPs, 1579 are in ORFs while the 0.148. The annotation table consisting of the genbank rest (246) are in non-coding regions. 811 H37RV SNPs and features of the genes Such as coding region, database refer 810 CDC1551 SNPs are synonymous while 1282 H37RV ence and product information to name a few was con and 1219 CDC1551 SNPs are non-synonymous. Out of structed. 1219 CDC 1551 nsSNPs, 312 SNPs have conservative amino acid substitution, 888 have non-conservative substitution 0.149 To inserts the gene ontology features such as term and 19 results in truncated proteins. Out of 1282 H37RV definition and name from the gene ontology database into non-synomous SNPs, 304 have conservative amino acid the indels and long polymorph table, following program is substitution, 954 have non-conservative substitution and 24 results in truncated proteins. (FIG. 2) Sjava/work/mtb'scripts/indfunctionl Analysis of Indels (Insertions and Deletions): 0150. The following are the list of attributes in the annotation table. 0137 Indels are insertions and deletions in the sequence with respect to BCG sequence. These indels could be of one 0151. Accession no This indicates the accession num or more nucleotides. Considering BCG as reference ber of the sequences sequence, the indels in the both the strains of M. tuberculosis, 0152 Gene start This indicates the start of the coding H37 rv and CDC 1551 were identified. region 0138 To insert the indels from the align file of the 0.153 Gene end This indicates the end of the coding mummer output into the database, the following java pro region gram is run: Sjava/work/mtb'scripts/indel 0154) Locus tag— 0.139. To enter functional annotation from the gene ontol 0.155) db Xref This indicates the gene indices represen ogy database into the indels table, the following program is tation of the gene 0156 db Xref GOA This indicates the gene ontology Sjava/work/mtb'scripts/indfunction identity of the gene product 0140. The list of indels identified is given in Table 2. 0157) id This indicates the gene annotation US 2008/0085284 A1 Apr. 10, 2008

0158 type 0172 Different bases in both queries. This query indi 0159 strand This indicates the forward or reverse cates different nucleotides in H37RV and CDC 1551. Strand of the sequence that is stored in the genbank 0173 Having SNPs in BCG-H37 only. This query 0160 gene name This indicates the gene name specifies SNPs in BCG and H37RV only and not in 0161 gene link This provides a hyperlink to the gene CDC 1551. features form the genbank 0.174 Having SNPs in BCG-CDC only This query 0162 note This provides the general information and specifies SNPS in BCG and CDC1551 only and not in the protein information of the gene. H37RV. 0163 A front-end was constructed as an essential part of 0175 BCG-H37 SNPs. This query indicates, that SNPs the database: are present in H37RV with respect to BCG-position and may or may not be present in CDC 1551 at that particular posi Front End of the Database: tion. 0164. The front-end displaying the results of alignment as 0176 BCG-CDC SNPs. This query indicates, that SNPs follows: are present in CDC1551 with respect to BCG position and 0165. The annotation table consists of genbank annota may or may not be present in H37RV at that particular tion about the genes in bovis, H37RV and CDC1551. It position. specifies details including the coding region of a gene and its database reference. 0177. The other options considered are: 0166 The annotation id for the SNPs, indels and long 0.178 Select BCG ORF This provides an option to polymorphs has been hyperlinked to obtain all the records select the presence of BCG SNPs in BCG ORF or outside pertaining to a particular gene. the BCG ORF. 0167 The data pertaining to indels and long polymorphs 0179 Select query ORF This provides an option to have also been added to the front-end. select the presence of query SNPs in query ORF or outside Description of the Queries: the query ORF. 0168 The database is made queryable to retrieve the 0180 Select synonymous This provides an option to required features of SNPs, indels and long polymorphs select if the SNP is synonymous or non-synonymous. respectively. 0181 Select Conservative This provides an option to 0169. The main options to query the SNP information select if the non-synonymous SNP results in conservative, a. non-conservative Substitution or truncated protein. Select SNPS 0182 Select function. This provides an option to select 0170 ALL This displays all the records which satisfies a required function, which includes cell wall synthesis, the below features. Transcription factor, Lipid metabolism, Membrane transport and Surface proteins. 0171 Identical in both queries. This query indicates that SNPs are present in BCG with respect to H37RV and 0183 An example of a query to extract SNP information CDC 1551. from the database is shown below. US 2008/0085284 Al Apr. 10, 2008 11

s Query Microsoft internet Explorer File Edit yiew Favorit

Select SNPs identical base in both queries. ORFSelect BCG Inin ORForF st Select Query ORF Select Synonymous Select Coinservative Select Function US 2008/0085284 A1 Apr. 10, 2008 12

0184 The result obtained from the above query is shown below: US 2008/0085284 A1 Apr. 10, 2008

Select SNPs identical base in both queries Select BCG ORF in ORF s: S. Select Query ORF Select Synonymous Non Synonymous Select Conservative Non Conservative Select Function cell Wa US 2008/0085284 A1 Apr. 10, 2008 14

0185. The query has been designed in the similar way for 0188) Polymorphisms involved in the following func both indels and long polymorphs. tions have been identified: 0186 The SNP analysis includes functional annotation 0189) 1. Cell wall synthesis id, which is hyperlinked to the functional annotation of the 2. Transcription factor gene carrying the polymorphism. The functional annotation 0.190) id consists of either one of the Swiss Prot, SPTREMBL or 0191) 3. Lipid metabolism gene ontology id's. Similarly the indels and long polymor 4. Membrane transport phs are also functionally annotated. 0192) 0.193) 5. Surface proteins. 0187 Genes with known involvement in virulence of Mycobacterium tuberculosis can also be accessed from the 0.194 6. Virulence genes SNP database query or from the Long polymorphs database 0195 One such query for cell wall synthesis function is query respectively. shown below US 2008/0085284 Al Apr. 10, 2008

Total number of records : 424

BoRF roRF 4. y Es R H3ry - 467 YES 8. H m Ps Nonconservative===es 467. YEs 8. YES s Not conservative 234F YES 8 cocs47 H3Rv2345 a. - B S Non conservative 23:47 YES g G CDC135. 2347 YES D 448 YES t L ns. Non conservative O H37Rw 4480 S Non conservative O 448) YES - Cocissi 4.480 i s lon conservative t e EEr- rear Cort YES s H37Rw 6446 s - ns Non conservative sc 27445 res CDC556446 E t Non conservative 27921 yyes Es c A. cEEEDC13s 927 y E aS Nona conservativeconservati. EEEEEEEEEEEEEE440 YES g E CDC55 1440 YES Ta Kl IE Non conservative g YES S.-- Non conservative S6 YES ES C Non coinservative 233 ES y ES Non conservative 24.533 YES E. Non conservative 2.4533 YES Nonconservative REES E Non conservative 2167824678 YE--EE YES m -N s Not conservative 2468 YES 246.79 YES 3. M s tion conservative L. PEe - isE - - ...- E. 4E. ren E.------t ...... E-- 3Dana ------in-a-...------...-,-,-,------...... it is intainst ill : US 2008/0085284 A1 Apr. 10, 2008 16

0196. The output of the above query is shown below US 2008/0085284 A1 Apr. 10, 2008 17

(Ele Ea Vie t A 3 c 3 EY si a w E.t Back F. Stop Refresh Home Search Favorites History Mail Print Edit Discuss Address km.

ygassy H Non conservative p1707 L. cDcs 53623 YES 403 P is Non Conservative O3VKSS. C: HitR 2432924.3129. YES 280 a S Non conservative Onó224 24,822 lon Conservative O6224 NH37Rs 243244 YES 213 c H Ins hon conservative O06224 i. 24,2493 1937 on Consetwative O06224 . Q H37Rw 2415349 YES 2182 c H ns Nonconservative o0622 ; YES 10413 Q CDC5524.404224.4042 YES 627 c Non coiservative O6222 YEs 2032 Hirv 42406638 YEs 3346 C ns Man conservative pring YES 2032 t CDC15 232983 YES 3068 S. to conservative --P72059 Yes | 2034 A. H3TRW 424,643, YES 3348 a E ns lon conservative P72030. YES 2024 c A CDC55 4239958 YES 8070 a ns lion conservative F72030 US 2008/0085284 A1 Apr. 10, 2008

0197) The polymorphisms detected in genes involved in 0208. A methodical screening of all the regions of poly cell wall synthesis are listed in Table 4. morphism identified above in clinical isolates with known disease profiles to further home-in on the polymorphisms Visualization Tools associated with virulence and/or infectivity in M. tuberculo 0198 To increase the utility of the SNP data, two tools to sis is in progress. visualize the Tuberculosis SNP data have been created: the first tool was based on the Generic Genome Browser devel 2. Screening of Regions of Polymorphisms oped at Cold Spring Harbor Lab (CSHL). This visualization tool could show a single TB genome along with any anno 0209. A set of five Mycobacterium tuberculosis strains tations, e.g. SNP locations for all other genomes. with known virulence is being screened for the polymor phisms identified above. 0199 The details of the browser is as follows: 0200. The output displays the polymorphs in the region 0210 Strains chosen: The following strains have been of interest. chosen for the study: 0201 Alternatively the output can be obtained by speci 0211 a. H37RV—a reference laboratory strain known to fying the region of interest in the text box labeled as be infective to mice, but is only mildly infective in humans. “landmark or region'. In case of SNP, the gene start and the It has undergone a number of passages in the lab since its gene end has to be specified and in case of indels or long isolation. It is the standard used in Studies on tuberculosis in polymorphs, the BCG start and BCG end must be specified. different laboratories across the world. 0202 By clicking the ruler at the region of interest across 0212 b. Beijing strain a clinical isolate with known the genome, the view can be re-centered. virulence and infectivity in humans. 70% of the patients with 0203 The display can also be Zoomed in or out by tuberculosis in certain areas of and China are infected selecting the required number of base pairs in the scroll with this strain. The strain was isolated from a patient in the down menu. Western Indian state of Mumbai. 0204 The required features can be displayed by selecting 0213 c. S.I—a mild South Indian strain with only mild the options in the tracks checkbox as shown in FIG. 4 virulence and infectivity in humans isolated from a patient 0205 FIG. 4 display shows a region of 10 kb of the BCG residing in the South Indian state of Hyderabad. genome with three types of annotations: BCG ORFs, SNPs 0214 d. N.I.F. Fatal North Indian strain isolated from in H37RV, and SNPs in CDC 1551. Safderjung hospital, Delhi where the patient developed 0206 To compare multiple genomes, a second tool based pulmonary tuberculosis died. on the Worm Base synteny browser was built. This tool can visualize two TB genomes at one time and was very useful 0215 e. N.I.NF a non-fatal North Indian strain isolated in validating the polymorphisms the CDC1551 genome as from Safderjung hospital, Delhi. Known clinical progression shown in FIG. 5. of disease in the patient. 0207 FIG. 5 shows the comparative genomics browser displaying BCG in the upper panel and H37RV in the bottom 0216) Primers have been designed to encompass the panel. The segments labeled MUM-* are the perfect matches regions of polymorphisms. The list of the primers used for generated by the MUMmer tool, and the vertical lines show the amplification is given in the FIG. 6.1-6.25 the alignment of the MUM segments in both genomes. The 0217 Amplification and sequencing of regions around color coding of the ORFs is used to indicate the length of the polymorphisms: DNA from the five strains has been the ORF. This is very helpful to researchers because if an amplified under optimal conditions determined for each ORF in H37 aligns with an ORF in BCG but they have primer pair. The amplified fragments have been sequenced different colors, then there is a mutation that makes them and the sequences obtained from different strains compared. have different lengths (see for example the genes in the MUM-1280 region). 0218. A few examples are given below:

6 O 70 8O 90 1OO 11O ------BCG ACCGATCTCGCCGCGCAGACAATGGCTGGCTCAGCGGCGATGCTGCTGGAGCGGAT

H37RW ------

CD1551 ------

SI ------

NINF ------

BS ------

NIF ------

US 2008/0085284 A1 Apr. 10, 2008 20

- Continued SI ------

8O 90 1OO 110 12O 13 O ...... +------+ ------+ ------+ ------+ ------BCG AGCACACCCCGACGGCGACTGCGACAACTCCAAGCGTGGCCGGTAACGTGATGCCCA

H37Rw ------

BS ------

NTNF ------

SI ------

131 14 O 150 16 O 17O 18O 19 O ------BCG TGATTGTGCGTTCCCTTCCCGCTGCGTTGCGCGCGTGTGCGCGTCTGCAACCCCATGACCCGG

CDC1551 ----G------

H37RW ----G------

BS ----G------

NTNF ----G------

SI ----G------

2OO 210 22 O 23 O 24 O 25 O ...... ------BCG CCTTCACGTTTATGGATTACGAACAGGACTGGGACGGCGTTGCGATAACCCTGACGT

CDC1551 ------

H37Rw ------

BS ------

NTNF ------

SI ------

251 26 O 27 O 28O 29 O 3OO 310 ------BCG GGTCGCAGCTGTATCGGCGAACGCTGAATGTGGCACGGGAGCTGAGCCGTTGTGGTTCCAGGT

CDC1551 ------C-G

H37RW ------C--G

BS ------C--G

NTNF ------C--G

SI ------C--G

32O 33 O 34 O 350 360 37O ...... ------BCG CGCAGCTGTATCGGCGAACGCTGAATG--TGGCACGGGAGCTGAGCCGTTGTGGTTC

CDC1551 - TGAC-CG------T-T-T----CTCCGCA-G-TC------AC-T-C-CCT----

H37RW ---TGAC-CG------T---T-T----CTCCGCA-G-TC------AC-T-C-CCT----

BS ---TGAC-CG------T-T-T----CTCCGCA-G-TC------AC-T-C-CCT----

NTNF ---TGAC-CG------T-T-T----CTCCGCA-G-TC------AC-T-C-CCT----

SI ---TGAC-CG------T-T-T----CTCCGCA-G-TC------AC-T-C-CCT----

0221 Sequencing of the region from H-3283171 to CDC1551; S.I South Indian strain A2313; BS: Beijing H-3283585. Two SNPs, one indel and a long polymorphism strain; NIF: non-lethal North Indian strain. All the polymor phisms occur in the fadD28, a virulence gene involved in characterize this region. Sequences are amplified from dif fatty acid synthesis. They result in a non-conservative Sub ferent strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis stitution and probably have an important role in the degree strain H37Rv sequence from NCBI database; CDC: of virulence imparted to the strain. US 2008/0085284 A1 Apr 10, 2008 21

BCG TTGGCCCACGTGCTGAACTTGGTGACGTTGGCTGCGGTGACAAACAAGTTCTGATAGGTCGTTGCGCCCGTCGGCCCGAAG

------C------

CDC1551 ------C------

NINF ------A------

SI ------A------

BS ------A------

BCG ATGAGTTGGCCCATGAGTTGGGTGTATTGGGTGCTGAGTGTGGCCAGGCCCTGCAGCAGGGTCGGGATGATGTCGAACG

------

CDC1551 ------

NINF ------

SI ------

BS ------

BCG GAAACTGCGCCGCTGCACTCGAAAGCGCGGTTGTCACCGCATTGGTGCCGCTCGCTAGGGCGGTCGCTTSCCCCGTTGCGG

------G------

CDC1551 ------G------

NINF ------G------

SI ------G------

BS ------G------

0222 Sequencing of the region from H-2051784 to CDC1551; S.I: South Indian strain A2313; BS: Beijing H-2052209. This region is characterized by a SNP between strain; NINF: non-lethal North Indian strain. The SNP M. bovis BCG and the tuberculosis strains and a second SNP common to all the tuberculosis strains results in a conser common to the Asian strains and to BCG, but different from H37Rv and CDC1551. Sequences are amplified from dif vative substitution in the PPE33b gene and does not affect ferent strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis the function of this gene. However the A to G substitution strain H37Rv sequence from NCBI database; CDC: results in the truncation of the protein encoded by BCG.

150 16 O 17O 18O 190 2 OO 21 O 22O 23 O 24 O ------BCG CATCGTCGCCGGCGCGGGTCACTGGCGCCGCTCCTCCCCATCGCTTTGCTCTGCATCGTCGCCGGCGCGGGTCACTGGCGCCGCTCCTCCC

H37Rw ------

CDC1551 ------

SI ------CTGGCGCCGCTCCTCCCCATCGCTTTGCTCTGCATCGTCGCCGGCGCGGGTCACTGGCGCCGCTCCTCCC

BS ------CTGGCGCCGCTCCTCCCCATCGCTTTGCTCTGCATCGTCGCCGGCGCGGGTCACTGGCGCCGCTCCTCCC

NINF ------CTGGCGCCGCTCCTCCCCATCGCTTTGCTCTGCATCGTCGCCGGCGCGGGTCACTGGCGCCGCTCCTCCC

241 25 O 26 O 27 O 28O 29 O 3OO ------BCG CATCGCTTTGCTCTCTGCATCGTCGCCGGCGCGGGTCAATCGAAGATGCCCCGTCGCGTGTC

H37RW ------A------

CDC1551 ------H------US 2008/0085284 A1 Apr. 10, 2008 22

- Continued SI CATCGCTTTGCTCTGCATCGTCGCCGGCGCGGGTCA------A-----

BS CATCGCTTTGCTCTGCATCGTCGCCGGCGCGGGTCA------A-----

NINF CATCGCTTTGCTCTGCATCGTCGCCGGCGCGGGTCA------A-----

0223 Sequencing of the region from H-3006917 to membrane protein in BCG and the Asian strains. This results H-3007246. Sequences are amplified from different strains. in a longer integral membrane product in these strains as BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South compared to H37RV and CDC1551. The SNP also results in Indian strain A2313; BS: Beijing strain; M18: non-lethal the introduction of a stop codon in H37RV and CDC1551 North Indian Strain. This region encloses a long polymor further reducing the length of the membrane protein encoded phism of 106bp inserted into a gene encoding an integral by the latter.

4 O SO 60 7 O 8O 9 O 1 OO 11 O 12O ------BCG CTGGGTCAGCAGCGGGTGTGCGCTGATTTCGATGAAGGTGTGGTAGGCGCCGTCGGCGCCGCTACCGGCGGAAGCGATGGC

BS ------

NINF ------

SI ------

H37RW ------

CDC1551 ------C------

121 13 O 14 O 15 O 16 O 17 O 18O 19 O 2OO ...... ------BCG CTGGCTGGAAATGCACGGGGTTGCGCATGTTGGTGGCCCAGTGTTCGGCGTCGAAGACCGGTTGGGTGTGCAAGTCTGCGT

BS ------

NINF ------

SI ------

H37RW ------GC------

CDC1551 ------

2O1 21O 22O 23 O 24 O 25 O 26 O 27 O 28O ...... ------BCG AGGTGGTGGAGATGATTCCGATGGTGGGGGTCCGTGGGGTCAGATCGGCCAGCTCCGAACGCATCGCCGGCTGCAAAGCA

BS ------

NINF ------

SI ------

H37RW ------

CDC1551 ------

281 281 3OO 310 32O 330 34 O 350 360 ...... ------BCG TCCATGGCCGGATTGTGCGGGGCCACTTCGATATTGACCCGGCTGGCGAATCGGTCCCTAGCGCGCACGCGAGTGATCAA

BS ------

NINF ------

SI ------

H37RW ------A-----A-C----TTTGC------C-G-C------

CDC1551 ------A-----A-C----TTTGC------C-G-C------US 2008/0085284 A1 Apr. 10, 2008 23

0224 Sequencing of the region from H-3247737 to 0225 Sequencing of the region from H-2052524 to H-3248224 Sequences are amplified from different strains. H-2052863. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal Indian strain A2313; BS: Beijing strain; N1 NF: non-lethal North Indian strain. All the polymorphisms observed occur North Indian strain; NIF: Lethal North Indian strain A in ppSA—the polyketide synthase gene and are synonymous single nucleotide polymorphism occurring in the proton substitutions. All the three Asian strains show identity to transport gene PPF.33b results in the introduction of a stop BCG in this region. codon and hence truncation of the protein in BCG.

1OO 11 O 12O 13 O 14 O 150 16 O ------BCG CGCGGTACACGTGTCGAACGGCGACAAACCCAAGGTTGCCTTGCCCGATACT CAGTTGGGTTCACA

H37RW ------

BS ------

SI ------

NINF ------

CDC1551 ------

NIF ------

17O 18O 19 O 2 OO 21 O 22O 23 O ------BCG CTCAACGTGATTCGAAATCCACACTGATACTGGAGGTGATTACCGGCTGAAGCAAAGCGCATTGG

H37RW ---G------

BS ---G------

SI ---G------

NINF ---G------

CDC1551 ---G------

NIF ---G------

BCG CATCGGCCGAAACGTGAGTAATCTGGGCGGCC ------

CDC1551 ------CGCTCAGCGCCCAGGGCATCGAAGAACA

H37RW ------CGCTCAGCGCCCAGGGCATCGAAGAACA

BS ------CGCTCAGCGCCCAGGGCATCGAAGAACA

NINF ------CGCTCAGCGCCCAGGGCATCGAAGAACA

SI ------CGCTCAGCGCCCAGGGCATCGAAGAACA

250 26 O 27 O 28O 290 ------

BCG ...... GTGGCTCGGGGCGGCCCACACC

CDC1551 AGCCCAGGGTGGCCTTGTC------C------

H37RW AGCCCAGGGTGGCCTTGTC------C------

BS AGCCCAGGGTGGCCTTGTC------C------

NINF AGCCCAGGGTGGCCTTGTC------C------

SI AGCCCAGGGTGGCCTTGTC------C------US 2008/0085284 A1 Apr. 10, 2008 24

0226 Sequencing of the region from H-1468644 to 0227 Sequencing of the region from H-455094 to H-1469150. Sequences are amplified from different strains. H-455468. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551:.S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal sequence from NCBI database; CDC: CDC1551; S.I: South North Indian strain. An insertion of 47bp is seen in all the Indian strain A2313; BS: Beijing strain; NINF: non-lethal tuberculosis strains in Mbl346c, a gene with DNA binding North Indian strain. The region is characterized by the activity. A second polymorphism (SNP) is also seen imme occurrence of two indels and two SNPs in a transcription diately adjacent to the insertion in the same gene. The SNP regulator. All the tuberculosis strains appear to be identical results in splitting the gene into two genes while there is a in this region while BCG, has a different amino-acid single long gene in the M. tuberculosis strains. sequence in the region.

BCG TGTTGGCTTCATCAGCACCCCGAGGTGTGTATTCAGGCGATCCGGGGCAGCG

CDC1551 ------C----T-----

------C----T-----

NINF ------C----T-----

SI ------C----T-----

BS ------C----T-----

BCG GGGTCGGGGTGACGCGGTTCCGCCCAAAGGTCC - - GTCACCCTGTG

CDC1551 ------AC------

------AC------

NINF ------AC------

SI ------AC------

BS ------AC------

BCG CAGATCGGCTCGGTCCGCTTCGCGATTTACCGCTCGGACTATGTGCAGTCGGTGACGGCTC

CDC1551 ------T------

H37Rw ------T------

BS ------T------

NTNF ------T------

SI ------T------

NIF ------T------

13 O 14 O 15 O 16 O ...... +------+ ------+ ------BCG ------A------

CDC1551 ------A------

H37RW ------A------

BS ------A------

NTNF ------A------

SI ------A------

NIF ------A------US 2008/0085284 A1 Apr. 10, 2008 25

0228 Sequencing of the region from H-466229 to H-466.536. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF: Lethal North Indian strain.The C to T transition occurs in a gene of unknown function and results in a synonymous substitution. However, the C to A change occurs in a transcription factor (Mb0393) and is a non-conservative substitution resulting in a slightly different protein in BCG.

13 O 14 O 15 O 16 O 17 O 18O 190 2OO ------BCG CCGCCAGGGTTACACCGACGTCGACCAGTTCACACTCGAAAAGTAACCGGACAAAGCGCGCTGGCTACCCA

CDC1551 ------G------

H37RW ------G------

NIF ------G------

NINF ------G------

BS ------G------

SI ------G------

0229 Sequencing of the region from H-560625 to H-56.1248. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF: Lethal North Indian strain. A synonymous SNP occurs in a virulence gene and is identical in all the tuberculosis strains.

15 O 16 O 17 O 18O 19 O 2OO ------BCG GGCCCACGATTTGCAATGGTGACGAGTTGGCTGCCTCGGCGCTGGCGTACTAG

H37RW ------G-

CDC1551 ------G--

BS ------G-

SI ------G-

NINF ------G-

NIF ------G-

21 O 22 O 23 O 24 O 250 ...... ------BCG GCCGCCCCCGCGCTCATGAGCTGGACGAACTGCTCATGGAATGCGACCGC

H37RW ------

CDC1551 ------

BS ------

SI ------

NINF ------

NIF ------US 2008/0085284 A1 Apr. 10, 2008 26

0230 Sequencing of the region from H-2046394 to Indian strain A2313; BS: Beijing strain; NINF: non-lethal H-2046928. Sequences are amplified from different strains. North Indian strain; NIF: Lethal North Indian strain. The BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV SNP in BCG results in splitting the gene PE-PGRS32 into sequence from NCBI database; CDC: CDC1551; S.I: South two parts with the latter being truncated.

4 O SO 6 O 70 8O 90 ------BCG ACGATCATCGGTGGTGGTGGAGCCGGTATGGTAGCTACCGCCACGCGGAAGCTGGT

CDC1551 ------A------

H37RW ------A------

NINF ------A------

SI ------A------

BS ------A------

NIF ------A------

1 OO 11 O 12O 13 O 14 O 150 ------BCG CGGCGGGCGCTTCATGGCGATGACGACCGGACCGGACAGGTCTATGCCGGACGCG

CDC1551 ------

H37RW ------

NINF ------

SI ------

BS ------

NIF ------

151 16 O 17O 18O 190 2OO ------BCG GCGACCGCGGCCACCGGGGTGATAACGGCGTGCACCGGCGCGGTTCTCCCGGGGAA

CDC1551 ------

H37RW ------

NINF ------

SI ------

BS ------

NIF ------

21O 22 O 23 O 24 O 250 26 O ------BCG TACCGGAGCCGCGCCGCCGACCGCACTGGCGAATACCAACGGGGCAATCGCTGC

CDC1551 ------C------

H37RW ------T------

NINF ------C------

SI ------C------

BS ------C------

NIF ------C------US 2008/0085284 A1 Apr. 10, 2008 27

0231 Sequencing of the region from 11-1373629 to 11-1374101. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF: Lethal North Indian strain. The two polymorphisms observed occur in a transcription factor and result in non-conservative Substitutions.

22 O 23 O 24 O 250 26 O 27 O 28O ------BCG TCTCTCGGTCATTCGTGGTCGCAGGCGCCGCACTCGGTGTCTTCGGGGGGGGGGGGGGGGG

H37RW ------T------

CDC1551 ------T------

SI ------T------

BS ------T------

NINF ------T------

NIF ------T------

29 O 3 OO 310 32O 330 34 O ...... ------BCG GGGGGGGGGGGAAGCGCGACCTCGAAGGCCACTGAAACGCCTTACGGAGACGCGACGAAC

H37RW ------

CDC1551 ------

SI ------

BS ------

NINF ------

NIF ------

0232 Sequencing of the region from H-1622821 to H-1623282. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF:North Indian Fatal. The polymor phisms observed occur in a non-coding region outside the ORF.

15 O 16 O 17 O 18O 19 O 2OO 210 22 O 23 O ------BCG TGTGGCGCGCCTGGCTCAGATAACGCAACGCCGCAGGCGCGCGCCGCACGTCAAAAGTGGTGACCGGCAACGGCCGCAGCA

CDC1551 ------A------

H37RW ------A------

SI ------A------

BS ------A------

NINF ------A------US 2008/0085284 A1 Apr. 10, 2008 28

0233 Sequencing of the region from 11)2295752 to Indian strain A2313; BS: Beijing strain; NINF: non-lethal H-2296046. Sequences are amplified from different strains. North Indian strain. The polymorphism observed occurs in BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV the pkS12 gene and results in a non-conservative Substitu sequence from NCBI database; CDC: CDC1551; S.I: South tion.

3 O 4 O SO 60 70 8O 90 ------BCG TGGGCCGCTCTAGATGGGCGCCGCCCCGCGCAGATGCTCGAAGATCAGGGACGTCTGGGTA

H37Rw ------

CDC1551 ---T------

BS ------

SI ------

NINF ------

1 OO 11 O 12O 13 O 14 O 150 ...... ------BCG CCTGCGACGTCGGCGTCGGCATTGAGGTTTTCGACCACGAACGAACGCAGGTCCTCGGTG

H37Rw ------

CDC1551 ------

BS ------

SI ------

NINF ------

151 16 O 17O 18O 190 2OO ------BCG TCGCGAGCGGCGACGTGCAAGATGAAATCGTCGGCGCC ------

H37Rw ------GGCCAGAAAGTAG

CDC1551 ------GGCCAGAAAGTAG

BS ------GGCCAGAAAGTAG

SI ------GGCCAGAAAGTAG

NINF ------GGCCAGAAAGTAG

21 O 22O 23 O 24 O 25 O

...... ------BCG ------CTGCCGTTTGCGGCGGATCTGCTGGATGAAGCTGCGGA

H37RW ACATCCAT CAC------

CDC1551 ACATCCAT CAC------

BS ACATCCAT CAC------

SI ACATCCAT CAC------

NINF ACATCCAT CAC------US 2008/0085284 A1 Apr. 10, 2008 29

0234 Sequencing of the region from H-3086111 to H-3086539. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian Strain. The SNP seen in H37RV occurs in a non-coding region while the deletion in BCG leads to truncation of the transcription regulator.

18O 19 O 2 OO 21 O 22O 23 O 24 O 250 26 O 27 O ------BCG CGGTCGCGGGCGAAGCGTTTGAAGTCCACCGTCGCCAGGCCGCTGGTCATGGCGCTGGCCTGATCCCACAGACCCCAGCCCAGGGAGATGG

H37RW ------C------

CDC1551 ------C------

SI ------C------

NIF ------C------

NINF ------C------

BS ------C------

0235 Sequencing of the region from H-2295062 to H-2295633. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551, A2313: South Indian strain A2313; BS: Beijing strain; NINF: non lethal North Indian strain; NIF:North Indian Fatal. The SNP observed occurs in the pkS12 gene and results in a non conservative Substitution.

8O 9 O 1 OO 11 O 12O 13 O 14 O ------BCG CGGCGAGTACAACGACGCTCGGGTCGATGTCCCGGTCCGATGGCTGCACGGCACCG-AGATC

H37RW ------G------

CDC1551 ------G------

BS ------G------

SI ------G------

NINF ------G------

15 O 16 O 17O 18O 190 2OO

...... ------BCG CGGTGATCACGCCCGACCTGCTGGACGGCTATGCCGAGCGGGCCAGCGATTTCGAGGTGG

H37RW ------

CDC1551 ------

BS ------

SI ------

NINF ------US 2008/0085284 A1 Apr. 10, 2008 30

0236 Sequencing of the region from H-162341 to H-162761. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain. The deletion in BCG occurs in the region corresponding to a gene with putative enzyme activ ity and results in a loss of function in BCG.

12 O 13 O 14 O 150 16 O 17 O 18O 19 O 2 OO 21 O ------BCG CGCCCGCGCCACGACGTCACTACGCACATTCTATTCCGGAGACCCAGGCGAGGCGTCGGGGCGGCACCGTTTGCAGGCCCGGAATCCCTCC

H37RW ------C------

CDC1551 ------C------

BS ------C------

NTNF ------C------

SI ------C------

NIF ------C------

211 22O 23 O 24 O 250 26 O 27 O 28O 290 3OO ------BCG CCCTGAGCGGCCGCCGCAGTCGGCAGGAACCGGACATTGCGCGCGAACGGTGGCCGGACGGGGCAACTCGGCCGGCAGTAGACACCGGTG

H37RW ------

CDC1551 ------

BS ------

NTNF ------

SI ------

NIF ------

3O1 310 32O 330 34 O 350 360 37 O 38O 390 ------BCG GTCAAAACCGCGACGACGAACCAGCCGTCGAACCGGGCGTCTTTGGACTGGACCGCCCGGTAGCAGCGTTCGAAGTCGTCGTGCACCCTT

H37RW ------T------

CDC1551 ------T------

BS ------T------

NTNF ------T------

SI ------T------

NIF ------T------

0237 Sequencing of -the region from H-1478.664 to H-1479140. Sequences are amplified from different strains. BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NIN: non-lethal North Indian strain: NIF:North Indian Fatal. The first T to C transition results in the truncation of the bacterial regulatory protein in BCG.

17 O 18O 19 O 2 OO 21 O 22 O ------BCG CCACCTCGGTGGTGTTCGCCACCGCCCACTACGCGCTGGTGGATTTGGCCGACGTA

H37RW ------CT-CT US 2008/0085284 A1 Apr. 10, 2008 31

- Continued CDC1551 ------CT-CT

NINF ------CT-CT

BS ------CT-CT

SI ------CT-CT

NIF ------CT-CT

23 O 24 O 25 O 26 O 27 O 28O ------BCG CAACCGGGCCAGCGCGTGTTGATCCATGCCGGCACCGGCGGGGTGGGCATGGCGG

CDC1551 AGGT------

NINF AGGT------

BS AGGT------

SI AGGT------

NIF AGGT------

0238 Sequencing of the region from H-2296260 to progress. The polymorphisms which constitute a set of H-2296692. Sequences are amplified from different strains. virulence markers are further being validated in 120 clinical BCG: M. bovis BCG: H37Rv: M. tuberculosis strain H37RV isolates of tuberculosis. sequence from NCBI database; CDC: CDC1551; S.I: South 0239). The virulence factors thus identified could be used Indian strain A2313; BS: Beijing strain; NINF: non-lethal aS North Indian strain; NIF:North Indian Fatal strain. The long 0240) i. Diagnostic markers in prediction of disease and polymorphism observed in the pkS12 gene but does not alter its progress in the patient. the activity of the polyketide synthase enzyme. A total of 0241 ii. Drug targets for development of new and effec 2755 polymorphisms including 1779 in ORFs and 313 in tive treatments for TB. regions outside the ORF are being screened for association to virulence and/or infectivity in tuberculosis. A multicom 0242 iii. Candidate genes/sequences in DNA vaccine. ponent analysis to determine the association of polymor 0243 iv. In development of SiRNA technology for com phism to the degree. of virulence and infectivity is in bating tuberculosis.

TABLE 1. List of SNP's in Mycobacterium tuberculosis/M. bovis BCG.

De Scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 1 467 G R 467 A. H 467 A. H Yes NS, NC P49993 nucleotide binding activity 2 1057 A. I 1057 G V 1057 G V Yes NS, C P49993 nucleotide binding activity 3 2347 G G 2347 A. D 2347 A. D Yes NS, NC Q50790 DNA binding activity 4 2532 C L 2532 T L 2532 T L Yes S, NULL Null 5 3751 G V 3751 T L 3751 T L Yes NS, C Q59586 DNA binding activity 6 448O T L 448O C S 448O C S Yes NS, NC P71573 7 5752 A. V 5752 G V 5752 G V Yes S, NULL Null 8 64O6 T N 64O6 C N 64O6 C N Yes S, NULL Null 9 6446 T S 6446 G A. 6446 G A. Yes NS, NC P41514 nucleic acid binding activity 10 828S T I 828S C I 828S C I Yes S, NULL Null 11 8741 T R 8741 C R 8741 C R Yes S, NULL Null 12 9143 C I 9143 T I 9143 T I Yes S, NULL Null 13 9217 C A. 9217 A. D 9217 A. D Yes NS, NC Q07702 DNA binding activity 14 10727 G V 10727 A. I 10727 A. I Yes NS, C P71575 integral to membrane 15 13197 C Null 13197 G Null 13197 G null Yes S, NULL Null US 2008/0085284 A1 Apr. 10, 2008 32

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly tion O BCG CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base ORF type GO ID Putative Function 16 13459 13460 D 13460 Yes S, NULL 17 14400 144O1 K 144O1 Yes NS, NC P71582 integral to membrane 18 15116 15117 15117 Yes NS, NC P71583 enzyme activity 19 17856 17857 17857 Yes S, NULL 2O 21818 21819 S 21819 Yes NS, NC P71588 enzyme activity 21 22263 22264 A. 22264 Yes S, NULL 22 23173 23174 R 23174 Yes NS, NC P71588 enzyme activity 23 23713 23714 23714 Yes S, NULL 24 24293 24294 R 24294 Yes NS, NC P71590 25 24.533 24534 Y 24534 Yes NS, NC P71590 26 24678 246.79 Y 246.79 Yes NS, C P71590 27 24761 2478O D 24762 Yes NS, NC P71590 28 25287 2S306 G 2S288 Yes NS, NC P71590 29 26O34 26053 A. 26O3S Yes NS, NC P71591 30 27450 27469 Null 27451 No Inc., NULL 31 294,42 29462 C 29.444 Yes NS, NC P71595 32 299.79 29999 Q 2998O Yes NS, NC P71596 33 30736 30756 K 30737 Yes S, NULL 34 31041 31057 R 3.1038 Yes S, NULL 35 32608 32624 H 32568 Yes NS, NC P71599 36 33788 33804 Null 33748 No Inc., NULL 37 36288 36304 A. 36248 Yes S, NULL 38 36522 36538 S 36482 r Yes S, NULL 39 36596 36612 K 36.556 Yes S, NULL 40 39742 39758 H 397O2 Yes S, NULL 41 41228 41244 Null 41.188 No Inc., NULL 42 41437 41453 G 41397 Yes S, NULL 43 42265 42281 C 42225 Yes NS, NC P71696 integral to membrane 44 43929 43943 V 43889 Yes S, NULL 45 45.177 45.191 A. 45137 Yes S, NULL 46 49989 SOOO3 Null 49949 No Inc., NULL 47 S2012 52026 G 51972 Yes NS, NC P71705 integral to membrane 48 S3663 53677 P S3623 Yes NS, NC P71707 enzyme activity 49 S9861 S9869 Null 59815 No Inc., NULL 50 62758 62766 G 62712 y Yes S, NULL 51 63O29 63O37 Null 62983 null No Inc., NULL 52 63O49 63.057 Null 63003 No Inc., NULL 53 65857 6586S V 65811 Yes NS, C hydrolase activity S4 699.13 69921 69867 Yes NS, NC molecular function unknown 55 70O82 7OO90 C 70O36 Yes S, NULL Nu 56 70257 70265 & V 70211 & Yes NS, NC OS3609 molecular function unknown 57 71758 Null 71729 Null 71712 null No Inc., NULL Nu 58 74119 74O90 74O73 Yes S, NULL Nu 59 74.188 74159 K 74.142 Yes NS, NC OS3611 isocitrate dehydrogenase (NADP+) activity 60 78130 Cg 78.101 E 78O84 E Yes O53615 glycine NS, NC hydroxymethyltransferase activity 61 79388 79359 Null 79342 null No Inc., NULL Null 62 8O169 8O131 C 8O123 Yes S, NULL Null 63 86899 86862 86854 Yes NS, C OS3623 DNA binding activity 64 89235 89198 G 891.90 Yes NS, C O53625 65 89570 89533 Null 89.525 null No Inc., NULL Null 66 90964 90927 A. 90919 Yes NS, C Q10880 oxidative phosphorylation 67 92.357 92320 Null 92312 null Yes Inc., NULL Null 68 943.38 d 94301 M 94.293 r Yes NS, NC Q10883 oxidative phosphorylation 69 961.36 I 96099 V 96.091 null Yes NS, C Q10884 electron transport 70 97731 Null 97.694 Null 97.686 G null No Inc., NULL Null 71 993.36 Null 99.299 Null 99.291 null Yes Inc., NULL Null 72 1OO624 A. 100587 T 100.579 k Yes NS, NC Q10876 magnesium ion binding activity US 2008/0085284 A1 Apr. 10, 2008 33

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 73 O363S G R O3598 A. C O3590 A. C Yes NS, NC Q10890 integral to membrane 74 OS903 T W O5865 C R O5857 C R Yes NS, NC Q10892 integral to membrane 75 O6370 A. P O6332 C P O6324 C P Yes S, NULL Null 76 226SO T W 22612 C R 22604 C R Yes NS, NC Q10898 cAMP-dependent protein kinase complex 77 23556 C H 23S18 T Y 23S10 T Y Yes NS, C Q10898 cAMP-dependent protein kinase complex 78 23878 T NII 23840 C Null 23832 C null No inc, NULL Nu 79 266OO A. S 26S 61 C A. 26554 C A. Yes NS, NC Q10900 magnesium ion binding activity 8O 26840 G P 268O1 A. S 26794 A. S Yes NS, NC Q10900 magnesium ion binding activity 81 27.447 G L 27408 C L 274O1 C L Yes S, NULL Nu 82 30172 A. V 301.33 G A. 3O126 G A. Yes NS, C Q10900 magnesium ion binding activity 83 3O237 T P 301.98 C P 301.91 C P Yes S, NULL Nu 84 37223 A. Q 371.83 G Q 37.177 G Q Yes S, NULL Nu 85 38339 A. R 38299 G G 38292 G G Yes NS, NC O53636 86 39796 C Null 39754 T NII 39747 r null No inc, NULL Nu 87 43247 C G 432OS T S 431.98 r S Yes NS, NC O53639 DNA binding activity 88 46OO6 A. A. 45964 G A. 45957 G A Yes S, NULL Nu 89 47495 A. W 47453 C W 47446 C W Yes S, NULL Nu 90 47911 C Null 47871 A Null 47864 A null No inc, NULL Nu 91 49987 G G 49947 C G 49940 C G Yes S, NULL Nu 92 59370 A. F 59.177 G 59350 G Yes S, NULL Nu 93 60535 T K 60342 C E 60515 C E Yes NS, NC P968.09 94 61.144 T F 60951 G V 61124 G V Yes NS, NC P96810 N-acetyltransferase activity 95 62499 A. R 62306 G G 62479 G G Yes NS, NC P96811 enzyme activity 96 62S30 G G 62337 A. D 62510 A. D Yes NS, NC P96811 enzyme activity 97 65799 G G 656O7 A. D 6578O A. D Yes NS, NC P96815 98 66696 A. H 66504 G R 66677 G R Yes NS, NC P96816 99 70273 G L 70O81 A. 70254 A. Yes NS, NC P9682O voltage-gated chloride channel activity OO 71097 C R 70905 A. R 71078 A. R Yes S, NULL Null O1 73091 A. R 72899 C R 73072 C R Yes S, NULL Null O2 79424 T R 79.232 C G 79405 C G Yes NS, NC P96828 O3 81862 C G 81670 T D 81843 T D Yes NS, NC P9683O protein phosphatase activity O4 849.17 G C 84725 A. Y 84895 A null Yes NS, NC P96833 05 88.267 T 88075 G C 88.245 G C Yes NS, NC O53642 O6 89999 T 898O7 C A. 89977 C A Yes NS, NC O86360 O7 90284 T 90092 C A. 90262 C A Yes NS, NC O86360 O8 92.177 A. 91985 G 921S6 G Yes S, NULL Null 09 95552 C A. 95.358 T V 95.529 r V Yes NS, C OO7411 enzyme activity 10 95758 G A. 95S64 T S 95735 r S Yes NS, NC O07411 enzyme activity 11 98328 A. 98.134 C 983OS C Yes S, NULL Null 12 99.662 G A. 99.468 T S 99639 r S Yes NS, NC P72013 pathogenesis 13 998OO S 99606 C C 99777 C Yes NS, NC P72013 pathogenesis 14 200622 C 200428 T 200599 r Yes NS, NC O07414 pathogenesis 15 201759 C D 201565 G E 201736 G E Yes NS, C OO7415 pathogenesis 16 206673 G C 206479 C C 2O66SO C C Yes S, NULL Null 17 206676 G 2O6482 G G 2O6653 G G Yes S, NULL Null 18 210634 H 210440 C R 210554 C R Yes NS, NC O07423 19 212446 A Null 212252 G Null 212366 G null No inc, NULL Null 2O 2173.93 C N 2171.99 T N 217313 T N Yes S, NULL Null 21 218055 G R 217861 C C 217975 C C Yes NS, NC O07430 hydrolase activity 22 225861 S 225666 C G 22578O C G Yes NS, NC O07437 23 227215 G V 227020 A. 227134 A. Yes NS, C OS3645 nucleotide binding activity 24 227738 M 227543 C 22 7657 C Yes NS, NC O53645 nucleotide binding activity 25 228OS3 L 227858 C C 227972 C C Yes NS, NC O53645 nucleotide binding activity 26 228924 R 228729 C R 228843 C R Yes S, NULL Null 27 23.1783 C NI 231587 T NII 231.701 T null No inc, NULL Null US 2008/0085284 A1 Apr. 10, 2008 34

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly tion O BCG CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 28 2321.88 231992 232106 A. Yes S, NULL Null 29 233552 233356 233470 V Yes S, NULL Null 30 233558 233362 233476 R Yes NS, NC OS3648 31 243794 243596 243712 H Yes NS, NC OS3656 integral to membrane 32 244589 244391 244507 V Yes NS, C OS3656 integral to membrane 33 246117 245919 246O3S D Yes NS, C O53657 membrane 34 24636S 246167 246.283 F Yes NS, NC O53657 membrane 35 249718 24952O 249636 K V Yes NS, C P96391 36 251771 251573 2S1689 A. Yes NS, NC P96392 37 2S1865 Null 251667 251783 G null No inc, NULL Null 38 256378 25618O 2S6296 G Yes NS, C P96396 enzyme activity 39 2591.27 Null 2S8900 259.016 null Yes inc, NULL Null 40 260507 26O28O 26O396 P Yes NS, NC P96399 enzyme activity 41 262385 2621.58 262274 D Yes NS, NC P964OO electron transport 42 2651.83 266857 26,6973 D Yes S, NULL Null 43 265653 S 267327 267443 P Yes NS, NC P964OS metabolism 44 2666O1 2682.75 268.391 F Yes NS, NC P964O6 S-adenosylmethionine dependent methyltransferase activity 45 269989 271663 271779 S Yes NS, C P964.09 46 271077 27 2751 272867 S Yes NS, NC P964.09 47 271882 273556 273672 S Yes S, NULL Null 48 273691 275365 275481 D Yes NS, C P96413 Zinc ion binding activity 49 276.1.86 277860 Null 277976 null No inc, NULL Null 50 282208 283882 283998 T Yes NS, NC P96419 cell adhesion 51 283942 285616 285732 M Yes NS, NC P96419 cell adhesion 52 285894 287568 287684 V Yes NS, C OS3660 hydrolase activity 53 287276 28895O 289066 T Yes S, NULL Null S4 287759 289433 289549 A. Yes NS, NC OS3663 55 288778 290452 290S68 L Yes S, NULL Null 56 292523 2941.96 294,313 E Yes NS, NC OS3666 acyl-CoA dehydrogenase activity 57 292.778 C 294451 T K 294568 r K Yes NS, C OS3666 acyl-CoA dehydrogenase activity 58 29418O Nu 295.853 Nu 295.970 No inc, NULL Nu 59 295519 2971.92 297309 A. Yes NS, C OS3668 60 3OOO12 Nu 3O1685 Nu 301802 No inc, NULL Nu 61 3O1364 303037 3O3154 G Yes S, NULL Nu 62 3OS428 307101 307218 G Yes S, NULL Nu 63 3O8090 Nu 309763 Nu 30988O Yes Inc, NULL Nu 64 31.1176 312849 312966 L Yes S, NULL Nu 65 3121.94 313867 313984 S Yes S, NULL Nu 66 3.18505 32O178 32O294 Yes S, NULL Nu 67 321009 322682 322798 L Yes S, NULL Nu 68 321631 Nu 323304 Nu 32342O No inc, NULL Nu 69 323830 325503 325619 V Yes S, NULL Nu 70 327543 329216 329332 P Yes NS, NC P95229 71 3299.13 33 1586 331702 S Yes NS, NC OS3681 72 33.1537 Nu 333210 Nu 333326 Yes Inc, NULL Nu 73 33 1617 Nu 333290 Nu 333406 Yes Inc, NULL Nu 74 331719 Nu 333392 Nu 33.3508 No inc, NULL Nu 75 340O88 Nu 339084 Nu 339148 No inc, NULL Nu 76 340090 Nu 339086 Nu 3391SO No inc, NULL Nu 77 340091 Nu 339087 Nu 3391.51 No inc, NULL Nu 78 340092 Nu 339088 Nu 33915.2 No inc, NULL Nu 79 340097 Nu 339093 Nu 339157 No inc, NULL Nu 8O 343148 342144 342208 Yes NS, NC O53687 nuoleotide binding activity 81 344283 343.279 343.343 A. Yes S, NULL Nu 82 3S1491 350487 350551 A. Yes S, NULL Nu 83 355282 3S4278 354342 A. Yes NS, NC O86362 84 3621 63 36.1159 361.223 null No inc, NULL Nu 85 362818 361814 361878 F Yes S, NULL Nu 86 364,560 363S11 363575 null Yes NS, C OO7226 87 364804 363755 363819 V Yes S, NULL Nu

US 2008/0085284 A1 Apr. 10, 2008 41

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly tion O BCG CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base ORF type GO ID Putative Function 527 189705 189243 189291 Yes NS, NC OS341S 528 193177 191815 192055 Yes NS, NC OS3416 529 193221 E 191859 192099 Yes S, NULL Null 530 1995O1 Null 1981.39 Null 198376 No Inc., NULL Null 531 1995O2 Null 198140 Null 198377 No Inc., NULL Null 532 199636 1982.74 198511 Yes S, NULL Null 533 205184 2O3822 g 204059 Yes NS, NC OS3426 534 2O6868 205506 205743 Yes S, NULL Null 535 212729 211367 211604 Yes NS, NC OS3434 metabolism 536 214512 2131.68 213326 Yes S, NULL Null 537 221942 22O568 Null 22O117 No Inc., NULL Null 538 226,570 22S196 224745 Yes NS, NC O53444 carbohydrate metabolism 539 227449 226O75 22S624 Yes NS, NC OS3445 S4O 23O847 229473 229.022 G Yes NS, C O53449 integral to membrane 541 232149 230776 23O325 Yes NS, NC OS3450 DNA binding activity S42 236O28 2346SS 2342OS Yes S, NULL Null 543 236817 235444 234994 Yes S, NULL Null 544 239854 238481 238031 Yes NS, C O53459 GTP binding activity 545 241612 24O239 239789 G Yes NS, NC OO6567 S46 244718 243345 242895 Yes NS, C OO6572 guanylate cyclase activity 547 245764 244391 243941 Yes S, NULL Null S48 249753 248380 247930 Yes NS, NC OO6577 549 250307 248934 248483 Yes S, NULL Null 550 251711 2SO338 249887 Yes S, NULL Null 551 251728 250355 249904 Yes NS, NC OO6579 ATP binding activity 552 255933 2S4560 254109 Yes NS, NC OO6582 553 255953 2S458O 254129 Yes NS, NC OO6582 554 257383 256O10 255.559 Yes S, NULL Null 555 259222 257849 257398 Yes NS, NC OO6583 556 26.1908 260535 26OO84 Yes S, NULL Null 557 282O61 G 280687 28O179 Yes NS, NC OO6SS1 methyltransferase activity 558 282113 T 280739 C 28O231 C A. Yes NS, NC methyltransferase activity 559 283143 28.1769 S 281.261 r Yes NS, NC OO6553 S60 288484 287110 Null 286600 r No Inc., NULL Null S61 29OO71 288697 2881.87 Yes NS, C OO6559 electron transport S62 291161 289787 289.277 Yes NS, NC OO6559 electron transport 563 295.376 2940O2 2934.92 Yes NS, C OO6562 electron transport S64 295 770 294396 293886 Yes S, NULL Null 565 3O3656 302281 3O1771 Yes NS, NC OSO428 566 3O4272 302897 3O2387 No Inc., NULL Null 567 307530 306.279 305769 Yes NS, NC OSO431 electron transport 568 3092O7 307956 307446 Yes NS, C OSO431 electron transport 569 311565 310314 309804 Yes NS, C OSO434 transaminase activity 570 318.177 316925 316414 Yes NS, NC OSO437 alcohol dehydrogenase activity 571 325815 324563 324052 Yes Nu 572 335064 333.812 333301 Yes Nu 573 340O81 338829 Null 3383.18 null Yes Nu 574 341302 34OOSO 339539 Yes OOS298 575 341343 340091 33958O Yes OOS298 576 341420 G 34O168 3396.57 Yes Nu 577 341887 34O638 Null 34O127 null No Nu 578 341914 34O665 34O154 Yes Nu 579 342O2S 34O776 34O265 Yes Nu S8O 342O28 34O779 34O268 Yes Nu 581 342031 34O782 34O271 r Yes Nu 582 342O77 340828 340317 Yes OOS299 583 342.287 341038 340527 Yes OOS300 S84 342456 3412O7 34O696 Yes Nu 585 343724 342475 34-1964 Yes Nu S86 345988 344739 344228 Yes Nu 587 34842O 347171 346660 Yes Nu

US 2008/0085284 A1 Apr. 10, 2008 45

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly tion O BCG CDC of

phism SNP SNP SNP SNP SNP ID Position Base Position Base Position Base AA ORF type GO ID Putative Function 766 789681 8043.13 795.222 Y Yes NS, C O53907 inositol phosphatidylinositol phosphatase activity 767 790156 804788 795697 P Yes NS, NC O53907 inositol phosphatidylinositol phosphatase activity 768 79058O 805212 7961.21 Yes NS, NC OS3908 histidine biosynthesis 769 798.626 813258 804167 C Yes S, NULL Nu 770 8O2214 s 816846 807 755 s Yes NS, C O06134 magnesium ion binding activity 771 8O8750 823.382 814291 Yes S, NULL Nu 772 811420 826OS2 81696.1 Yes NS, NC OO6141 773 813171 8278O3 818712 Yes Nu 774 813366 827998 818907 No inc, NULL Nu 775 813755 A. 828.387 819296 Yes Nu 776 815661 83O293 8212O2 Yes Nu 777 82O225 834857 825766 Yes OO6147 RNA binding activity 778 822223 A Nul 8368SS 827764 No inc, NULL Nu 779 824626 8392S8 83O167 Yes Nu 780 825.125 839757 8306.66 Yes OO6151 transporter activity 781 828921 843SS3 834.462 Yes Nu 782 834571 8492O3 84O112 s Yes P94974 magnesium ion binding activity 783 834975 849607 840516 r Yes 784 844924 k s 8595.57 8SO466 C Yes P94984 magnesium ion binding activity 785 845757 86O390 851.299 Yes P94985 tRNA binding activity 786 85O233 864.866 855775 Yes P94986 787 858.345 872957 86386S Yes 788 862697 877.309 868.217 Yes P94996 enzyme activity 789 86362O 878232 86914O Yes 790 864215 878827 869735 Yes P94996 enzyme activity 791 86,7321 8.81933 872841 Yes N O65933 enzyme activity 792 867566 8821.78 873O86 Yes 793 869512 884124 875 032 Yes SN: O65933 enzyme activity 794 869897 884.509 875417 Yes Nu 795 870867 88S479 876387 Yes O65933 enzyme activity 796 871495 886.107 877O15 Yes O65933 enzyme activity 797 873514 888126 879034 Yes OO6586 enzyme activity 798 874.459 889071 879979 Yes s OO6586 enzyme activity 799 878859 893471 884379 Yes Nu 800 8854.17 900019 890823 null Yes i.N. L Nu 8O1 886196 900798 891.602 V Yes NS, C OS3922 transcription factor activity 802 88.8569 903171 893.975 Yes Nu 803 890S13 905115 895919 Yes O33182 804 89.1732 906334 897.138 No inc, NULL Nu 805 897364 912O22 902826 Yes O33.188 drug transporter activity 806 897922 912S8O 903384 Yes Nu 807 8999.10 914568 905372 Yes Nu 808 905462 92O118 910922 Yes Nu 809 91O3O1 924957 915761 Yes Nu 810 911052 92.5708 916512 Yes O33199 811 911527 9261.83 91 6987 Yes O33199 812 91S691 93O348 921152 Yes Nu 813 916811 93.1468 922272 ul No inc, NULL Nu 814 917059 93.1716 922S2O Yes O33204 815 921036 935693 926497 Yes O332O6 sulfate porter activity 816 92.1535 936.192 926996 Yes O332O6 sulfate porter activity 817 921866 936.523 927327 null Yes O332O7 818 928.563 943.220 934O24 null Yes Null 819 933291 947949 938753 V Yes P71980 82O 934,421 949079 939883 Null 821 937104 951762 Null 942S6S null Null

US 2008/0085284 A1 Apr. 10, 2008 47

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 884 2053144 G. N. 2062678 A Null 2060017 A null Yes inc, NULL Null 885 2OS3386 C V 2O62920 T I 2O6O259 T I Yes NS, C Q50614 nucleotide binding activity 886 2O54840 G A. 2O64374 A. V 2O61713 A. V Yes NS, C Q50614 nucleotide binding activity 887 2055511 A. P 206SO45 G P 2O62384 G P Yes S, NULL Null 888 2062654 C A. 2O72188. A. E 2O69527 A. E Yes NS, NC Q50607 glycine cleavage system complex 889 2O64683 T C 2O74217 C R 2O71556 C R Yes NS, NC Q50604 890 2O65441 C Y 2075087 T Y 2O72314 r Y Yes S, NULL Null 891 2O66872 C A. 2O76518 T V 2O73745 r V Yes NS, C Q50601 glycine cleavage system 892 2O67570 G A. 2O77216 A. T 2O74443 A. T Yes NS, NC Q50601 glycine cleavage system 893 2O68670 T V 2078316 C V 2075543 C V Yes S, NULL Null 894 2070996 C A. 2O80642 T V 2O77869 r V Yes NS, C Q50599 enzyme activity 895 2O73217 T H 2O82863 C R 2080091 C R Yes NS, NC Q50597 integral to membrane 896 207458O A. C 2O84226 G R 2081454 G R Yes NS, NC Q50597 integral to membrane 897 2080993 G S 2090774 A. L 2O880O2 A. L Yes NS, NC Q50592 integral to membrane 898 2O8290S A. G 2092686 G G 2O89914 G G Yes S, NULL Null 899 2083932 C K 2093713 T NI 2090941 T null Yes inc, NULL Null 900 2O86536 A. Y 2096324 C D 2093S46 C D Yes NS, NC P95163 901 2087337 T N 2097126 C N 2094,348 C N Yes S, NULL Null 902 2087353 C L 2097142 G V 2094,364 G V Yes NS, C P95162 enzyme activity 903 209 OOO2 T L 2099791 A. M 2097O13 A. M Yes NS, NC P500.50 nickel ion binding activity 904 2092402 A. W 2102191 G R 20994.13 G R Yes NS, NC P95160 electron transport 905 2094.478 T H 2104268 C R 2101490 C R Yes NS, NC P95158 metabolism 906 2094.633 T A. 2104423 C A. 2101645 C A Yes S, NULL Nu 907 2097719 T M 2107SO9 C T 2104731 C T Yes NS, NC P95155 nucleotide binding activity 908 2099.098 C NI 2108888 A Null 2106110 A null No inc, NULL Nu 909 2099682 T NII 2109472 C Null 2106694 C null No inc, NULL Nu 910 21OOST4 C T 2110363 A. T 2107586 A. T Yes S, NULL Nu 911 2103397 C Q 2113186 G E 21104.09 G E Yes NS, NC P95149 DNA binding activity 912 2110986 T L 212O775 C NI 2117998 C null Yes inc, NULL Nu 913 2111 OOS A. L 212O794 T : 2118017 r : Yes NS, TP P951.45 base-excision repair 914 2112834 G A. 2122623 A. V 21.19846 A. V Yes NS, C P951.43 electron transport 915 211318S G A. 2122974 C G 212O197 C G Yes NS, C P951.43 electron transport 916 2113487 G G 2123276 T G 212O499 r G Yes S, NULL Nu 917 2113686 A. N 2123475 G D 212O698 G D Yes NS, NC O07756 918 2116O72 T NII 212S861 C Null 2123084 C null No inc, NULL Nu 919 2118582 G T 2128370 A. T 2125593 A. T Yes S, NULL Nu 920 2119054 A. T 2128842 G A. 212606S G A Yes NS, NC O07752 glutamate-ammonia ligase activity 921 2119491 A. A. 21292.79 C A. 2126SO2 C A Yes S, NULL Nu 922 212O739 G. N. 2130527 A NII 21277SO A null No inc, NULL Nu 923 21281.38 A. G 2137901 G G 2135149 G G Yes S, NULL Nu 924 2134872 A. D 214463S G D 2141883 G D Yes S, NULL Nu 925 21361.13 G C 2145876 A. 2143124 A. L Yes NS, NC O07733 926 21374OS T Q 21471.23 C R 2144,416 C R Yes NS, NC O07732 guanylate cyclase activity 927 2141458 T D 2151176 C D 2148469 C D Yes S, NULL Nu 928 2141625 T M 2151343 C T 2148,636 C T Yes NS, NC O07728 929 2141958 T T 2151676 G C 214.8969 G P Yes NS, NC OO7727 monooxygenase activity 930 2145004 A. 2154722 C R 2152015 C R Yes NS, NC Q08129 catalase activity 931 2145.783 A. 21555O1 G T 2152794 G T Yes S, NULL Nu 932 21463OS T C 2156O23 G C 2153.316 G P Yes S, NULL Nu 933 2147148 T G 2156866 G G 21541.59 G G Yes S, NULL Nu 934 21472O2 T 2156920 A. 2154213 A. T Yes S, NULL Nu 935 21483.89 C G 215.8107 T D 21 SS400 r D Yes NS, NC O07721 NADPH: quinone reductase activity 936 2153341 A. 21630.58 G 216O3S1 G L Yes S, NULL Nu 937 2159560 G S 2167924 A. 216S280 A. L Yes NS, NC O53960 938 215964.5 G R 21 68009 T S 21 6536S r S Yes NS, NC O53960 939 2159953 A. I 2168317 G 2165673 G T Yes NS, NC O53960 US 2008/0085284 A1 Apr. 10, 2008 48

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 940 2163647 C I 2172010 G M 21 69366 G M Yes NS, NC O53962 metabolism 941 2166221 A. R 2174584 C R 2171940 C R Yes S, NULL Null 942 2167025 A. D 2175388 G G 2172744 G G Yes NS, NC P95290 943 2167348 A. N 2175711 G D 2173067 G D Yes NS, NC P95290 944 21687.08 C NI 2177071 T NII 2174427 T null No inc, NULL Null 945 2180277 T C 2188640 C C 2185973 C C Yes S, NULL Null 946 2188349 C L 2196713 G V 2194O46 G V Yes NS, C P95269 947 21888.10 A. T 21971.74 G 21945O7 G T Yes S, NULL Null 948 21893O3 T T 2197667 C A. 21.9SOOO C A Yes NS, NC P95268 949 2190686 G R 21990SO C G 2196.383 C G Yes NS, NC P95266 950 2191 OSO G L 21994.14 A. 2196747 A. Yes NS, NC P95265 951 2192930 C R 22O1294 T H 2198627 T null Yes NS, NC P95260 952 220O2S1 G G 2221.333 A. D 221866.7 A. D Yes NS, NC O53979 S-adenosylmethionine dependent methyltransferase activity 953 22O1224 C G 2222306 T D 221964O T D Yes NS, NC Q10875 amino acid-polyamine transporter activity 954 2204091 C R 2225173 A. 2222507 A. Yes NS, NC Q10840 ribonucleoside diphosphate reductase activity 955 2205817 A. 2226899 G A. 2224233 G A Yes NS, NC Q10873 integral to membrane 956 22O8717 G C 2229799 C C 2227133 C C Yes S, NULL Nu 957 2210402 G Null 2231484 A Null 2228.818 A null No inc, NULL Nu 958 2220658 G C 2241740 A. C 2239.074 A. C Yes S, NULL Nu 959 222 1950 G R 2243032 T S 2240366 T S Yes NS, NC Q10859 enzyme activity 96.O 2223259 T R 2244341 G R 2241675 G R Yes S, NULL Nu 961 2225876 C V 2246958 G V 2244292 G V Yes S, NULL Nu 962 2226085 A. H 2247167 G R 22445O1 G R Yes NS, NC Q10856 963 223.1155 A. S 225223.7 G G 2249571 G G Yes NS, NC Q10850 carbohydrate metabolism 964 223201S C A. 2253097 A. A. 22SO431 A. A Yes S, NULL Nu 96S 2234-651 C C 2255733 T 2253067 T Yes NS, NC Q10850 carbohydrate metabolism 966 2237132 G D 2258214 A. N 2255548 A. N Yes NS, NC Q10848 967 2239.016 T NII 2260098 C NI 2257.432 C null No inc, NULL Nu 968 224O157 G E 2261239 A. E 2258573 A. E Yes S, NULL Nu 969 2240609 C Null 2261691 G. N. 2259.025 G null No inc, NULL Nu 970 2244542 G G 226S624 A. D 2262958 A. D Yes NS, NC O53464 971 2246892 R 2267974 G R 22653O8 G R Yes S, NULL Nu 972 2247313 2268.395 C NI 226S729 C Yes Inc, NULL Nu 973 22S3136 C V 2269218 A. 2271552 A. Yes NS, NC O53470 ATP binding activity 974 22S3292 A. C 2269374 G R 2271708 G R Yes NS, NC O53470 ATP binding activity 975 2255734 G. N. 2271818 A Null 2274152 A null No inc, NULL Nu 976 2258377 C V 2274461 A. 2276795 A. Yes NS, C O53473 ATP binding activity 977 2263994 228OO79 C C 2282413 C C Yes NS, NC O53476 978 2266289 V 2282374 C V 2284708 C V Yes S, NULL Nu 979 226629O V 2282375 C V 22847.09 C V Yes S, NULL Nu 98O 2271395 A. V 2287480 G A. 2289814 G A Yes NS, C OS3485 transporter activity 981 22.72986 C D 2289071 G H 229.1405 G H Yes NS, NC Q50575 enzyme activity 982 2275230 C C 2291315 T S 2293649 r S Yes NS, NC O53489 983 2282106 S 2298.191 C G 2300525 G Yes NS, NC O53489 enzyme activity 984 228SOO2 C 2301087 T 23O3421 r Yes S, NULL Null 985 23O4443 G A. 232O528 A. V 2322862 V Yes NS, C OS3498 biosynthesis 986 2312457 C M 2328541 T 233O875 r Yes NS, NC Q10672 porphyrin biosynthesis 987 2312508 A. G 2328S92 G G 2330926 G G Yes S, NULL Null 988 2312541 T A. 2328625 C A. 2330959 C A Yes S, NULL Null 989 2313380 T P 2329464 C C 233.1798 C C Yes S, NULL Null 990 2317013 T H 2335125 C H 2337459 C H Yes S, NULL Null 991 2317887 T L 2335999 C C 2338.333 C C Yes NS, NC Q10687 992 2319259 C A. 23373.71 T V 2339705 T V Yes NS, C Q10688 membrane 993 2322578 T L 234O687 C 234301S C Yes S, NULL Null 994 2322919 A. T 2341028 G A. 2343356 G A Yes NS, NC Q10691 integral to membrane 995 2329.583 T I 23.47738 C 23SOOO8 C Yes NS, NC Q10699 DNA binding activity 996 2332029 G R 23S0184 A. R 23S24.54 A. R Yes S, NULL Null US 2008/0085284 A1 Apr. 10, 2008 49

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly tion O BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 997 233336S M 2351520 G T 2353790 G T Yes NS, NC Q10701 nucleic acid binding activity 998 2335359 V 23S3514 G A. 2355.783 Yes NS, C Q10704 999 2344034 A S 2362191 G A. 2364459 A Yes NS, NC OS3499 nucleic acid binding activity OOO 23496.68 G 2369184 G R 2370093 Yes NS, NC O33244 endopeptidase activity OO1 23.53673 P 2373246 A. T 2374098 Yes NS, NC O33248 OO2 23.53887 L 2373460 C S 2374.312 Yes NS, NC O33248 OO3 23S4.188 Y 23.73761 C Y 2374613 Yes S, NULL Nu OO4 23S 6850 NI 237 6423 A Null 2377275 null No Inc., NULL Nu 005 236O168 NI 2379741 C Null 2380593 null Yes Inc., NULL Nu OO6 2364.212 V 2383812 T I 23.82390 Yes NS, C O33259 dihydropteroate synthase activity 2365.161 R 238.4761 A. R 2383,339 Yes S, NULL Nu 2369039 D 2388639 G G 2387216 null Yes NS, NC O33261 amino acid-polyamine transporter activity 2369143 S 2388743 G G 238732O Yes NS, NC O33261 amino acid-polyamine transporter activity O10 237O697 NI 2390297 A Null 2388874 null No Inc., NULL Nu O11 2373.913 R 2393513 T R 2392.090 Yes S, NULL Nu O12 2377026 V 2396626 A. L 23952O3 Yes NS, C OO6239 undecaprenol kinase activity 2384213 L 2403697 G L 24O2390 Yes S, NULL Nu 2393.645 P 2413129 A. L 2411822 Yes NS, NC OO6224 cytokinesis 2393760 L 2413244 C V 241.1937 Yes NS, C OO6224 cytokinesis 2395865 V 2415349 C V 2414042 Yes S, NULL Nu 2399656 V 2419140 A. V 2417833 Yes S, NULL Nu 2402330 A. 2421814 A. V 242OSO7 Yes NS, C OO6217 2405256 Null 2424862 A Null 2423555 null No Inc., NULL Nu 2405346 Null 2424952 A Null 2423645 null No Inc., NULL Nu 2405863 R 2425469 T R 2424.162 Yes S, NULL Nu 2408O26 A. 2427 632 A. V 2426325 Yes NS, C OO6213 241.3856 Null 2434820 A Null 24321 SS null No Inc., NULL Nu 2416293 S 24372.57 G A. 2434592 Yes NS, NC O53508 2416871 243783S G P 2435170 Yes NS, NC O53509 24.17128 A. 2438092 T S 2435427 Yes NS, NC O53510 protein kinase activity 2418.638 K 24396O2 G T 2436937 Yes NS, NC O53511 DNA binding activity 2421.783 2442747 G P 2440O82 Yes NS, NC OS3514 2423986 G 2444950 A. G 2442285 Yes S, NULL Null 2426460 2447424 G I 2444759 Yes S, NULL Null 2426573 NII 244.7537 A Null 2444872 i No Inc., NULL Null 2427491 S 2448456 C S 244.5791 Yes S, NULL Null 2428328 E 2449293 G G 2446628 Yes NS, NC O53521 enzyme activity 2433.96S 2454930 G T 24.52265 Yes S, NULL Null 2433967 2454932 A. T 24.52267 Yes S, NULL Null 2438.267 2459232 G M 2456567 Yes NS, NC O387 electron transport 244O692 S 246.1543 C A. 2458821 Yes NS, NC O389 integral to membrane 24476.68 V 2468519 C V 2465797 Yes S, NULL 2449344 C 24701.95 A. C 2467473 Yes S, NULL 2449738 NII 2470589 A Null 246.7867 Yes S, NULL 24S2103 S 2472954 T L 247O232 Yes NS, NC O397 porphyrin biosynthesis 24S4263 F 247S114 G F 2472392 Yes S, NULL ul 2455035 A. 2475886 T E 2473.164 Yes NS, NC enzyme activity 2458,114 L 2.47896S C F 2476243 Yes NS, NC aminopeptidase activity 2463948 G 2484799 A. D 2482O77 Yes NS, NC enzyme activity 24656O1 L 248.6452 T F 2483730 Yes NS, NC integral to membrane 2468463 Null 2489314 A Null 248.6590 null No Inc., NULL 2474596 NII 249S447 T NII 2492723 r No Inc., NULL 2474647 L 249S498 T L 2492774 Yes S, NULL 2478070 D 24.98921 G G 24961.97 Yes NS, NC OS10 248O295 A. 2SO1146 T A. 2498422 d Yes S, NULL ul 24.80652 L 25O1502 T F 2498778 r Yes NS, NC OS11 2481905 Q 2502755 C R 2SOOO31 Yes NS, NC O513 kinesin complex 2488510 I 2SO9360 G T 2SO6634 Yes NS, NC O518 vitamin B12 biosynthesis 055 2488792 G 2SO9642 G G 2SO6916 Yes S, NULL Nul

US 2008/0085284 A1 Apr. 10, 2008 51

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP D Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 12 266 1841 G S 2693 770 A. N 2691102 A. N Yes NS, C P71748 oxygen transporter activity 13 2663.078 T H 2695.007 C R 26.92339 C R Yes NS, NC P71746 transporter activity 14 2668.981 G L 2700910 C V 2698242 C V Yes NS, C P71740 15 2671087 T G 27O3O16 G G 27OO348 G G Yes S, NULL Null 16 2671898 T A. 2703827 C A. 27O1157 C A Yes S, NULL Null 17 2677979 G A. 2709793 A. A. 27O6644 A. A Yes S, NULL Null 18 2681 098 A. L 2712911 C R 2709762 C R Yes NS, NC P71728 DNA binding activity 19 268.3310 C A. 27.15123 T T 2711974 r T Yes NS, NC P71727 20 2.684991 C R 2716804 T R 2713655 r R Yes S, NULL Null 21 2685,491 A. V 2717304 G A. 2714155 G A Yes NS, C P71724 enzyme activity 22, 2686456 T S 2718269 C G 2715120 C G Yes NS, NC O86328 nicotinate-nucleotide adenylyltransferase activity 23 2687242 A NII 27190SS G. N. 2715906 G null No inc, NULL Nu 24, 268908O A. Y 272O893 G H 271 7744 R Yes NS, C P71924 DNA binding activity 25 26891.37 C A. 27.20950 T T 271 78O1 Y Yes NS, NC P71924 DNA binding activity 26 2689139 A. I 27.20952 G T 271 7803 R Yes NS, NC P71924 DNA binding activity 27 2691689 C L 2723504 T L 272O355 T L Yes S, NULL Nu 28 2692O3O G L 2723845 A. F 272O696 A. F Yes NS, NC P71922 nucleotide binding activity 29 2693966 T V 27258O1 C NI 2722652 C V Yes Inc, NULL Nu 3O 2702213 T V 2734O48 C A. 273O899 C A Yes NS, C P71913 ribokinase activity 31 270518O A NII 273701S G. N. 273.3866 G null Yes inc, NULL Nu 32 2708856 C NI 274O691 T Null 273.7539 T null Yes inc, NULL Nu 33 270992.1 T L 274.1756 G L 27386O4 G L Yes S, NULL Nu 34 2719463 A. T 2751298 G T 2748.146 G T Yes S, NULL Nu 35 272O611 G D 275.2446 A. N 2749294 A. N Yes NS, NC O53178 36 2721171 T NII 27S3006 C NI 2749854 C null No inc, NULL Nu 37 2728310 T R 276O145 C R 2756992 C R Yes S, NULL Nu 38 2729181 A. I 276.1016 G M 2757863 G M Yes NS, NC O53186 transporter activity 39 2733102 A. L 2764937 G P 2761784 G P Yes NS, NC O53189 cytokinesis 40 2736OOS C Q 2767840 T Q 2764687 r Q Yes S, NULL Nu 41 274O904 A. S 2772739 C A. 2769586 C A Yes NS, NC O531.96 nucleic acid binding activity 42 2741117 C G 2772952 T S 2769799 r S Yes NS, NC O531.96 nucleic acid binding activity 43 2742343 C S 2774178 G W 2771025 G W Yes NS, NC O531.98 alpha-amylase activity 44 2743386 T Null 2775221 C NI 2772O68 C null No inc, NULL Null 45 274SO44 A. V 2776879 G A. 2773,726 G A Yes NS, C OS32O1 46 2751087 C T 2782925 T T 2779772 r T Yes S, NULL Null 47 2754350 C S 2787546 T N 2783035 r N Yes NS, C O532O7 glycerol-3-phosphate O acyltransferase activity 48 27S7260 C G 2790456 A. C 2785945 A. C Yes NS, NC O53208 metabolism 49 275 7900 T D 2791096 C G 2786,585 C G Yes NS, NC O53209 molecular function unknown 50 2758277 G A. 27.91473 A. A. 2786962 A. A Yes S, NULL Null 51 2762360 T T 27955.57 C T 2791046 C T Yes S, NULL Null 52 2763122 A. V 279632O G A. 2791809 G A Yes NS, C OS3212 53 2763158 G T 2796356 C S 2791845 C S Yes NS, C OS3212 S4 2771624 A. H 2804833 G H 280O322 G H Yes S, NULL Null 55 2774.283 A. D 28.07484 C A. 28O2973 C A Yes NS, NC O53217 thymidylate synthase activity 56 2776115 A. W. 28.09316 G R 2804805 G R Yes NS, NC OO6159 protein binding activity 57 2777179 T S 28.1038O C G 280S869 C G Yes NS, NC OO6160 58 2779539 G G 28.12740 A. G 28O8229 A. G Yes S, NULL Null S9 2784243 T L 2817444 G F 28.12933 G F Yes NS, NC OO6165 biotin carboxylase activity 60 2785043 T S 2818244 C G 2813733 C G Yes NS, NC OO6165 biotin carboxylase activity 61 2789771 T P 2822972 C P 28.18461 C P Yes S, NULL Null 62 2792263 A. K 2825464 G K 2820953 G K Yes S, NULL Null 63 2798.279 G G 2831.480 A. G 2826969 A. G Yes S, NULL Null 64 2802.192 T NII 2835393 C NI 283O882 C null No inc, NULL Null 65 28O3O99 C A. 28363OO G A. 2831789 G A Yes S, NULL Null 66 28O3100 C A. 2836301 G A. 2831790 G A Yes S, NULL Null US 2008/0085284 A1 Apr. 10, 2008 52

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP D Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 67 2804844 A. M 2838.045 G V 2833534 G V Yes NS, NC O53226 68 2807819 G R 2841 O20 A. C 2836SO9 A. C Yes NS, NC P95029 enzyme activity 69 2814O78 G D 2847279 A. D 2842768 A. D Yes S, NULL Null 7O 281.8088 A. V 28.51289 G V 284.6777 G V Yes S, NULL Null 71 28.18694 G Q 2851895 C E 2847383 C E Yes NS, NC P95025 72 2819952 A. E 2853153 G G 284.8641 G G Yes NS, NC P95024 nucleic acid binding activity 73 2824.432 A. D 2857633 G D 2853121 G D Yes S, NULL Null 74 282S466 T D 28586.67 G A. 28541S6 G A Yes NS, NC P95020 RNA binding activity 75 283.7839 T NII 287O384 C NI 2866S30 C null No inc, NULL Null 76 2843260 G N 2875806 C K 2871952 C K Yes NS, NC O07438 nucleic acid binding activity 77 2846.432 T D 2878978 C G 2875124 C G Yes NS, NC Q50739 nucleotide binding activity 78 28494.17 C T 288.1964 T I 2878109 T I Yes NS, NC Q50737 79 2853851 A. S 2886,398 G G 2882S43 G G Yes NS, NC Q50732 80 285.4021 G E 2886568 A. E 2882713 A. E Yes S, NULL Null 81 28S.4631 T S 28871.78 G A. 2883323 G A Yes NS, NC Q50732 82 2858117 T 2890666 C L 288681.1 C L Yes S, NULL Null 83 2863708 A. V 28962S8 G A. 2892403 G A Yes NS, C Q50649 nucleic acid binding activity 84 2865290 C NI 2897840 G. NII 2893985 G null No inc, NULL Null 85 2865319 G. NII 2897869 A Null 2894.014 A null No inc, NULL Null 86 2866213 T A. 2.898763 C A. 2894908 C A Yes S, NULL Null 87 2868616 A. : 29O1166 G W 28973.11 G W Yes NS, TP Q50644 hydrolase activity 88 2871372 A. 29O3922 G A. 29OOO67 G A Yes NS, NC Q50642 enzyme activity 89 28.71964 A. Q 2904514 G R 29OO659 G R Yes NS, NC Q50642 enzyme activity 90 2874.457 T 2.907.007 G V 2903152 G V Yes NS, NC Q50639 peptidyl-prolyl cis-trans isomerase activity 91 2879964 A. 2912514 G T 2908659 G T Yes S, NULL Null 92 288O160 G C 2912710 T T 29.088SS r T Yes NS, NC Q50635 protein targeting 93 2880535 T 291.3O85 C A. 2909230 A Yes NS, NC Q50635 protein targeting 94 2881707 C S 29.14257 T N 291.04O2 r N Yes NS, C Q50634 protein targeting 95 2886244 A. Y 291.8794 T F 2914939 r F Yes NS, C Q50631 enzyme activity 96 2887118 A. 29.19668 G L 2915813 G L Yes S, NULL Nu 97 2888634 A. T 2921184 G A. 29.17329 G A Yes NS, NC Q50631 enzyme activity 98 2890241 A. S 2922791 G G 2.918936 G G Yes NS, NC Q50630 99 2890386 A. D 2922936 G G 29.19081 G G Yes NS, NC Q50630 2OO 2890432 C G 2922982 T G 291.9127 r G Yes S, NULL Nu 2O1 2893419 C R 2925960 T C 2922105 r C Yes NS, NC Q50625 2O2 2894748 A. A. 2927289 G A. 2923434 G A Yes S, NULL Nu 2O3 2894.968 T I 2927509 G S 2923.654 G S Yes NS, NC Q50622 integral to membrane 204 2896.114 A. 29.286SS G L 29248OO G L Yes S, NULL Nu 2OS 290O347 G 2932888 A. F 2929.033 A. F Yes NS, NC OO6209 acyl-CoA metabolism 2O6 2903343 T S 2935.884 C P 2932O29 C P Yes NS, NC OO6206 2O7 2911 OO2 A. H 2943S43 G. N. 2939688 G null Yes inc, NULL Nu 2O8 291.3009 A. I 294S541 G V 2941686 G V Yes NS, C OO6198 209 2913792 C Null 2946324 T Null 2942469 T null No inc, NULL Nu 210 292O364 A. H 2952899 G H 2949044 G H Yes S, NULL Nu 211 292O770 T NII 2953305 G. N. 294.9450 G null No inc, NULL Nu 212 2922.696 T 295.5231 C S 2951376 C S Yes NS, NC OO6184 213 2922723 G R 295.5258 C P 2.951403 C P Yes NS, NC OO6184 214 2926786 G. N. 2959321 A NII 2955467 A null No inc, NULL Nu 21S 2929067 G G 296.16O2 C G 2957748 C G Yes S, NULL Nu 216 2935735 C A. 296827O T V 2964.416 T V Yes NS, C P71942 protein tyrosine phosphatase activity 217 29.35930 C NI 296846S G Null 2964611 G null Yes inc, NULL Nu 218 293.8515 A. R 2982O32 G R 297.682O G R Yes S, NULL Nu 219 2941411 C A. 2984929 G G 2979718 G G Yes NS, C P71965 220 2941695 A. K 2985213 G E 298OOO2 G E Yes NS, NC P71965 221 2945 109 G D 2988627 C H 298.3416 C H Yes NS, NC P71969 222 2948.228 G S 2991691 C S 2986535 C S Yes S, NULL Nu 223 2950721 C L 2994.184 T L 2989028 T L Yes S, NULL Nu 224 29SO791 C G 29942S4 G A. 2989098 G A Yes NS, C OS3231 uroporphyrinogen decarboxylase activity

US 2008/0085284 A1 Apr. 10, 2008 54

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 283 3O89537 G Q 3132963 T Q 3127179 T Q Yes S, NULL Nu 284 3O89546 G 3132972 C V 3127188 C V Yes NS, C P71627 28S 3O8962S G C 3133051 C C 3127267 C C Yes S, NULL Nu 286 3O89626 C C 3133052 G C 3127268 G C Yes S, NULL Nu 287 30914-10 A. R 3134836 G R 31290S2 G R Yes S, NULL Nu 288 3092467 G G 3135893 A. G 313.0109 A. G Yes S, NULL Nu 289 3092732 G A. 3136158 T E 313O374 r E Yes NS, NC P71624 290 30938O8 C Null 3137234 T N 31314SO r null No inc, NULL Nu 291 3094520 T C 3137946 C R 3132162 C R Yes NS, NC P71621 enzyme activity 292 3096,227 G C 3139653 C A. 3133869 C A Yes NS, NC P71619 transporter activity 293 3096535 A. 313996.1 G C 31341.75 G Yes NS, NC P71619 transporter activity 294 3096724 A. 314O150 C S 3134364 C Yes NS, NC P71619 transporter activity 295 3098.942 T 3142369 C 3136583 C Yes S, NULL Nu 296 3099150 A. 3.142577 G C 3136791 G C Yes NS, NC P71616 transporter activity 297 31O3523 C E 31469SO T E 3141164 r E Yes S, NULL Nu 298. 3108827 T T 31.52254 C A. 3146468 A Yes NS, NC O05814 tRNA ligase activity 299 3108991 C R 3152418 T H 3146632 r H Yes NS, NC O05814 tRNA ligase activity 3OO 3.111157 C R 3154584 T R 3148798 r R Yes S, NULL Nu 301 3.111158 C R 3154585 G R 3148799 G R Yes S, NULL Nu 3O2 3114301 G C 3157782 C W 3151996 C W Yes NS, NC O05810 ATP binding activity 303 3.115235 T S 3158716 C G 3152930 C G Yes NS, NC OO5809 nucleotide binding activity 3O4 31.15753 T Q 3159234 C R 3153448 C R Yes NS, NC O05809 nucleotide binding activity 305 311932O C N 31628O1 A. K 3157013 A. K Yes NS, NC OO5806 306 3121.306 T D 3.164787 C D 3158.999 C D Yes S, NULL Nu 307 3127009 G P 3170490 C C 31.64702 C C Yes S, NULL Nu 3O8 3131012 G R 3174493 A. C 31 686S1 A. C Yes NS, NC O33344 309 3131107 G P 3174588 C R 3168746 C R Yes NS, NC O33344 31 O 3133059 A. I 3176540 G 317O698 G Yes S, NULL Nu 311 3135930 G A. 31794.11 D 3173569 r D Yes NS, NC O33350 isoprenoid biosynthesis 31 2 3137SO4 A. F 3.180985 C V 3175143 C V Yes NS, NC O33351 metalloendopeptidase activity 313 3141276 T NII 31847S7 G. N. 3178914 G null Yes inc, NULL Nu 314 31443O8 C R 3.187789 W 3181946 r R Yes NS, NC Q10802 integral to membrane 31S 3145285 G H 3188766 A. Y 3182923 A. Y Yes NS, C Q10803 membrane 316 3145758 G A. 3189239 A. A. 3.183396 A. A Yes S, NULL Nu 317 3146096 C NI 3189577 T NII 3183734 r null No inc, NULL Nu 318 314618O C K 3189661 K 3183818 r K Yes S, NULL Nu 319 3149728 T N 3193210 C NI 3187366 C null No inc, NULL Nu 32O 31SO908 A. I 3.194389 K 3188545 r K Yes NS, NC Q10809 321 3152247 C V 3195728 V 31898.84 r V Yes S, NULL Nu 322 3154848 A. M 3198329 G T 3192485 G T Yes NS, NC Q10788 translation elongation factor activity 323 315682O G V 32OO3O1 A. V 3194457 A. V Yes S, NULL Nu 324 3162781 C G 3206262 E 3200418 r E Yes NS, NC Q10817 DNA mediated transformation 32S 3163766 T L 32O7247 A. 32O1403 A. Yes S, NULL Nu 326 3169156 T H 3.212637 C R 32O6793 C R Yes NS, NC Q10793 RNA binding activity 327 3169605 T N 3213086 C D 32O7242 C D Yes NS, NC Q10789 proteolysis and peptidolysis 328 31798.19 T E 3223.300 C E 3217456 C E Yes S, NULL Nu 329 3185717 A. W 3229198 G R 3223352 G R Yes NS, NC Q10961 enzyme activity 33O 3186529 C R 323OO10 A. 3224164 A. Yes NS, NC Q10961 enzyme activity 331 3.187380 A. D 323O861 G D 3225O15 G D Yes S, NULL Nu 332 3192343 C G 3235712 G R 323OO3S G R Yes NS, NC Q10970 ATP-binding cassette (ABC) transporter activity 333 3193344 C R 3236713 A. 3231036 A. Yes NS, NC Q10970 ATP-binding cassette (ABC) transporter activity 334 3199928 C NI 3243352 T Null 3237675 T null No inc, NULL Null 335 3200987 G V 3244411 A. M 3238734 A. M Yes NS, NC Q10976 enzyme activity 336 3203120 C T 3246544 A. N 324O867 A. N Yes NS, C Q10977 enzyme activity 337 3204152 A. Q 3247576 G R 3241899 G R Yes NS, NC Q10977 enzyme activity 338 3211268 T A. 3254692 C A. 324901S C A Yes S, NULL Null

US 2008/0085284 A1 Apr. 10, 2008 56

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG. De scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 397 3384993 T C 3428443 G G 3424236 G G Yes NS, NC P95095 cellular response to starvation 398 3385588 C P 3429038 T 3424831 T Yes NS, NC P95095 cellular response to starvation 399 3386152 A Null 342.9602 G. N. 342S395 G null Yes inc, NULL Null 400 3392.209 G. N. 3435659 C N 3431452 C null Yes inc, NULL Null 4O1 3394.933 A. I 3438383 G 34341.76 G Yes S, NULL Null 402 3397089 G G 3440539 A. G 343.6332 A. G Yes S, NULL Null 403 34O1113 A. L 3444563 G 3440356 G Yes S, NULL Null 404 3401886 A. V 34.45336 G A. 3441129 G A Yes NS, C P95078 protein kinase activity 40S 3404010 A. C 3447460 G R 34432S3 G R Yes NS, NC Q06861 DNA binding activity 406 34OS330 A. I 3448780 G V 3444573 G V Yes NS, C OS3300 407 34O9943 G A. 3453,393 A. 344918.6 A. T Yes NS, NC O53304 molecular function unknown 408 3410810 G V 3454260 C 345OOS3 C Yes NS, C OS3304 molecular function unknown 409 3414061 C N 3457511 G. N. 3453304 G null No inc, NULL Null 41 O 34.17571 C M 3461 O21 T 3456814 T Yes NS, NC O05771 411 3421176 G G 34.64626 A. E 346O424 A. E Yes NS, NC O05775 hydrolase activity 412 3423466 G A. 34.6691.6 C G 3462714 C G Yes NS, C P77909 enzyme activity 413 3423862 C E 3467312 G D 3463110 G D Yes NS, C O05776 414 3424012 G A. 3467462 C A. 3463260 C A Yes S, NULL Null 41S 342671S C S 347.016S T N 3465963 T N Yes NS, C P96293 cytokinesis 416 3430975 A. I 3.474424 G V 3470223 G V Yes NS, C O05783 ferredoxin-NADP reductase activity 417 3431707 G D 3475.156 A. N 3470955 A. N Yes NS, NC O05783 ferredoxin-NADP reductase activity 418 343.4037 G V 34774.86 A. 3473.285 A. I Yes NS, C O05785 419 34341.51 T N 34776OO C N 3473.399 C null No inc, NULL Null 42O 343S166 G A. 3478615 A. 3474.414 A. T Yes NS, NC O05786 enzyme activity 421 343.6346 A. G 3479795 G G 3475594 G G Yes S, NULL Null 422 3436653 C C 348O103 G W 3475.902 G W Yes NS, NC O05790 metabolism 423 3437.192 G R 34.80642 T 3476441 r Yes NS, NC O05790 metabolism 424 3437336 C P 3480786 T S 3476585 r S Yes NS, NC O05791 Zinc ion binding activity 42S 3438022 A. T 3481472 G A. 3477 271 G A Yes NS, NC P963.54 peroxidase activity 426 3438.540 A. G 348.1990 G G 34.77789 G G Yes S, NULL Null 427 3442328 T S 34.88553 G R 34843S2 G R Yes NS, NC O07033 428 3442459 A. Q 3488.684 G R 3484.483 G R Yes NS, NC O07034 429 3443112 G. N. 3489337 C N 348S 136 C null No inc, NULL Null 430 34.45311 A. V 3491536 G A. 3487.335 G A Yes NS, C O05798 431 34462O2 G C 3492427 T 3488226 T Yes NS, NC OO5800 432 3447457 C G 3493.682 A. V 34.89481 A null Yes NS, C Not annotated 433 34521.90 G T 349841S A. 3494.214 A. Yes NS, NC P95194 ATP binding activity 434 3456796 T NII 3501684 C N 3497171 C null Yes S, NULL Null 43S 3459632 A. N 3504520 G S 3500007 G S Yes NS, C P95188 lyase activity 436 346OO73 C S 3SO4961 T 3SOO448 r Yes NS, NC P95188 lyase activity 437 346O112 T L 3505.000 C C 3SOO487 C C Yes NS, NC P95188 lyase activity 438 346O114 C P 35050O2 A. 3SOO489 A. Yes NS, NC P95188 lyase activity 439 34642OO C NI 3509088 G. NII 3504575 G null No inc, NULL Null 440 3464257 C Q 3509145 A. H 3SO4632 A. H Yes NS, NC P95184 441 3465229 G Q 35101.17 T K 3SOS604 r K Yes NS, NC P95182 442 3466 696 G. N. 3511584 T Null 3507071 r null No inc, NULL Null 443 3467191 C G 3512079 A. G 3507566 A. G Yes S, NULL Null 444 3468833 T N 3513721 C N 3509208 C N Yes S, NULL Null 445 3470134 T G 3515022 C G 3510509 C G Yes S, NULL Null 446 3472676 T N 3517564 C N 3513051 C N Yes S, NULL Null 447 3476153 G A. 3S21041 A. T 3S16528 A. T Yes NS, NC P95173 electron transporter activity 448 3478232 A. Y 352312O T F 35186O7 T F Yes NS, C O863SO oxidative phosphorylation 449 34.80424 A. D 3525312 G G 352O799 G G Yes NS, NC O53307 oxidative phosphorylation 450 3480666 A. I 3525.554 G V 3S21041 G V Yes NS, C O53307 oxidative phosphorylation 451 3482095 T A. 3526983 A. A. 3S22470 A. A Yes S, NULL Null

US 2008/0085284 A1 Apr. 10, 2008 63

TABLE 1-continued List of SNP's in Mycobacterium tuberculosis/M. bovis BCG.

De scrip Poly- tion O- BCG H37Ry CDC of

phism SNP SNP SNP SNP SNP ID Position Base AA Position Base AA Position Base AA ORF type GO ID Putative Function 787 425.9178 A. I 4322902 G V 4315228 G V Yes NS, C P96229 molecular function unknown 788 4259279 A. A 4323OO3 G A 4315329 G A Yes S, NULL Null 789 4266OSS A NII 4329779 G. N. 4322.105 G null Yes inc, NULL Null 790 4266511 T H 433O23S C R 4322561 C R Yes NS, NC P962.19 monooxygenase activity 791 427O698 C E 4334,422 G Q 4326748 G Q Yes NS, NC P96218 glutamate biosynthesis 792 4.272870 A Null 4336594 C Null 432892O C null No inc, NULL Null 793 4273741 C A 433746S T V 4329791 T V Yes NS, C P96217 794 428O361 T W 43.44038 C A 4336,363 C A Yes NS, C O69733 nucleotide binding activity 795. 42931.45 A. K 4356822 G E 4349146 G E Yes NS, NC O69742 796 4293741 A. G 4357418 G G 4349742 G G Yes S, NULL Nu 797 4294.795 A. P 43584.72 C P 43SO796 C P Yes S, NULL Nu 798 43OO168 A. L 43638OO G L 4356106 G L Yes S, NULL Nu 799 4301.014 C A 4364646 T T 4356952 r T Yes NS, NC OOS461 subtilase activity 800 43O4014 G Y 4367646 A. Y 4359.952 A. Y Yes S, NULL Nu 8O1 43O4489 G P 4368121 A. L 4360427 A. L Yes NS, NC OOS459 802 4307013 C G 437O645 T D 4362949 r D Yes NS, NC O05457 8O3 43O8186 C A 4374225 T T 43.66529 r T Yes NS, NC OOS453 804 4310990 C S 4377030 G S 4369334 G S Yes S, NULL Nu 8OS 4313166 T R 43792OS C R 4371509 C R Yes S, NULL Nu 806 4314838 C G 438O877 T D 4373.181 r D Yes NS, NC OOS449 807 4316253 C V 4382.293 T I 4374597 r I Yes NS, C OOS448 808 4316561 C G 43826O1 T D 4374.905 r D Yes NS, NC OOS448 809 4316925 T NII 43.8296S C NI 4375269 C null No inc, NULL Nu 810 4317617 G Q 43.83652 A. : 4375961 A. : Yes NS, TP OOS447 811 4317969 G Null 4384.004 C NI 4376313 C null No inc, NULL Nu 812 4319148 G P 4385184 A. S 4377493 A. S Yes NS, NC OOS446 813 432O218 C A 4386.254 G P 43785.63 G P Yes NS, NC OOS445 814 432O714 C W 43.86750 A. L 4379059 A. L Yes NS, NC OO5444 81S 4324427 C V 4390463 T M 4382772 T M Yes NS, NC OOS441 816 432482O T : 4390856 C W 438316S C W Yes NS, TP OOS440 817 4327799 T L 43938.35 C L 4386144 C L Yes S, NULL Null 818 4328171 C R 43942O7 G G 4386S16 G R Yes NS, NC OOS436 819 4328226 C A 4394262 G G 4386571 G A Yes NS, C OOS436 820 4329348 G S 4395384 A. N 4387692 A. N Yes NS, C OOS436 821. 4329765 T W 43.958O1 C A 4388109 C A Yes NS, C OOS436 822 4331996 C A 4398O32 T V 4390340 r V Yes NS, C OOS435 pathogenesis 823 4334.624 T S 44OO660 C S 4392968 G A Yes S, NULL Null 824 43.35857 G G 44O1894 A. D 43942O1 A. D Yes NS, NC PS2214 thioredoxin reductase (NADPH) activity 82S 433906S C R 4405102 T 43974.09 r R Yes S, NULL Null 826 4341548 C A 4407585 T A 4399892 r A Yes S, NULL Null 827 4342S30 A. W 4408567 G R 4400874 G R Yes NS, NC O53598 nucleic acid binding activity 828 26940 G Null 26959 C Null Null Null Null No Null, NC Null Null 829 34O28 C Null 34044 T NII Null Null Null No Null, NC Null Null Table I: List of single nucleotide polymorphisms in Mycobacaterium tuberculosis/M. bovis BCG Polymorphism ID: The ID by which the polymorphism can be identified SNP Position: Position of the SNP in the respective genome Base: The nucleotide occurring in the region of the polymorphism in the respective genome AA: The aminoacid occurring in the region of the polymorphism in the respective genome ORF: Indicates whether the polymorphism occurs in an open reading frame (yes) or not (no) SNP type: Indicates the kind of SNP-S: synonymous SNP which codes for the same amino acid as the reference sequence; NS: non-synonymous SNP which codes for an aminoacid different from the reference sequence: C: conservative SNP coding for an aminoacid of the same family as the reference sequence: NC: nonconservative SNP coding for an aminoacid from a different family as the reference sequence GO ID: The ID for the sequence in the gene Ontology database Putative function: The putative function of the gene in which the SNP occurs. US 2008/0085284 A1 Apr. 10, 2008 64

0244)

TABLE II

List of insertion deletions in M. tuberculosis, M. bovis BCG

BCG BCG H37Ry H37Ry CDC CDC Polymorphism ID Start End Start end Start end ORF GO ID Putative Function 830 13233 13234 13233 13235 3233 3235 YES P71580 integral to membrane 831 24719 24.720 2472O 24739 3233 3235 YES P71590 832 28917 28918 28.936 28938 3233 3235 YES P71594 833 30962 30967 30982 30983 3233 3235 YES P71596 834 42578 42588 42594 42595 3233 3235 YES P71697 835 71576 71.614 71584 71585 3233 3235 YES Null 836 79584 79594 795.55 79.556 3233 3235 YES O53616 RNA binding activity 837 82490 82491 824.52 824.54 3233 3235 YES O53618 nucleotide binding activity 838 125870 125872 1258.32 1258.33 3233 3235 YES Q10900 magnesium ion binding activity 839 131213 131215 131174 131175 3233 3235 YES Null 840 138784 138786 138744 138745 3233 3235 YES O53637 peroxidase activity 841 139598 139600 1395.57 139558 3233 3235 YES O53637 peroxidase activity 842 147495 1474.96 147453 1474S5 3233 3235 YES O07170 translation elongation factor activity 843 147853 147854 147812 147814 3233 3235 YES Null 844 150079 1SOO8O 1SOO39 15OO67 3233 3235 YES OO7174 845 1SO906 151077 150893 1SO894 3233 3235 YES OO7174 846 162346 162347 162153 1621SS 3233 3235 YES P96811 enzyme activity 847 162451 162453 162259 162260 3233 3235 YES P96811 enzyme activity 848 162694 162695 1625O1 162SO3 3233 3235 YES P96811 enzyme activity 849 1944.95 194498 1943O3 1943O4 3233 3235 YES OO7410 transcription factor activity 850 208509 2O8510 2O8315 2O8322 3233 3235 YES OO742O 851 223943 223945 223749 223 750 3233 3235 YES OO7436 852 230770 230772 230575 230576 3233 3235 YES Null 853 234690 234693 234,494 234.495 3233 3235 YES O53648 854 257984 2S8O14 257786 257787 3233 3235 YES P96397 acyl-CoA dehydrogenase activity 855 264979 26498O 264.752 266645 3233 3235 YES P96403, P96405 856 26SO66 26SO68 266741 266742 3233 3235 YES P96405 metabolism 857 29.1957 29.1959 293631 293632 3233 3235 YES Null 858 331998 331999 333671 333.673 3233 3235 YES P56877 859 332977 335748 334651 334652 3233 3235 YES P56877 860 336706 336707 335600 335657 3233 3235 YES P56877 861 336884 33688S 335844 3.35863 3233 3235 YES P56877 862 33818O 3381.81 337158 3371.68 3233 3235 YES O53684 863 339S4O 339S41 338.527 3.38537 3233 3235 YES O53684 864 363810 36.3856 362806 362807 3233 3235 YES OO7224 intracellular 865 369.162 3691.63 3681.13 368129 3233 3235 YES OO7231 tRNA ligase activity 866 370799 370800 369765 369767 3233 3235 YES OO7231 tRNA ligase activity 867 374314 374315 373281 373.283 3233 3235 YES OO7232 868 416214 416215 415182 41.5184 3233 3235 YES OO6296 869 42S351 425.353 42432O 424,321 3233 3235 YES OO6303 870 425821 42S824 424789 424790 3233 323S YES OO63O4 871 428.391 428392 427357 427373 3233 323S YES OO63O4 872 482549 482SSO 481528 481,530 3233 3235 YES P95211 membrane 873 4881.17 488119 487097 487098 3233 3235 YES O86335 enzyme activity 874 570941 570942 S6992O S6996.1 3233 3235 YES Q11146 molecular function unknown 875 578459 578500 577494 577495 3233 3235 YES Null 876 581835 S81956 S8O812 S80813 3233 3235 YES Q11156 two-component response regulator activity 877 612063 612O64 610910 610912 3233 3235 YES Null 878 624447 624522 623295 623296 3233 3235 YES OO6398 879 6246SS 624665 623419 62342O 3233 3235 YES OO6398 880 62.5594 625596 624349 6243SO 3233 3235 YES OO6398 881 641.609 641610 640363 640365 3233 3235 YES OO6415 882 664431 664432 663186 663188 3233 3235 YES O53767 ribonucleoside diphosphate reductase activity 883 66995.0 669952 6687O6 668707 3233 3235 YES O53772 monooxygenase activity US 2008/0085284 A1 Apr. 10, 2008 65

TABLE II-continued

List of insertion deletions in M. tuberculosis, M. bovis BCG

BCG BCG H37Ry H37Ry CDC CDC Polymorphism ID Start End Start end Start end ORF GO ID Putative Function 884 690O39 690041 688794 688795 3233 3235 YES OO7788 pathogenesis 885 6931.38 693140 691892 691893 3233 3235 YES OO7786 pathogenesis 886 713437 713439 7121.90 712191 3233 3235 YES O07759, O07758 887 72368O 723681 722432 722434 3233 3235 YES P96920 DNA binding activity 888 73.1330 73.1331 730O83 730093 3233 3235 YES P96923 889 743870 744394 742632 742633 3233 3235 YES Null 890 800911 800912 799140 799142 3233 3235 YES P95044 891 804268 804309 802498 802499 3233 3235 YES Null 892 832699 832702 83O875 83O876 3233 3235 YES O538O2 893 83.8696 83.8697 836870 836919 3233 3235 YES O538.09 894 839071 839.072 837.293 837342 3233 3235 YES O538.09 895 839638 839767 837908 837909 3233 3235 YES O538.09 896 841026 841185 839.098 839.099 3233 3235 YES O53810 897 841398 841494 8393O2 839303 3233 3235 YES O53810 898 84.1688 84.1689 8394.87 839497 3233 3235 YES O53810 899 85.6450 85.6451 854258 854260 3233 3235 YES Null 900 877O25 877O28 874834 874.835 3233 3235 YES P71834, P71835 901 88.1931 88.1932 879738 87974O 3233 3235 YES P71838 integral to membrane 902 890O37 890O38 887845 88.7847 3233 3235 YES OO7268 cytoplasm 903 927816 927891 926984 926985 3233 3235 YES O53844 904 92.8822 92.8823 92.7918 92.7928 3233 3235 YES O5384.5 calcium ion binding activity 905 928975 928.976 928O8O 928215 3233 3235 YES O5384.5 calcium ion binding activity 906 936197 936.204 9354.46 93S447 3233 3235 YES O53850 cell wall 907 95.3566 95.3567 952809 952811 3233 3235 YES Null 908 96.1024 961O2S 96O268 96.0309 3233 3235 YES Null 909 963656 963657 962953 962.955 3233 3235 YES O53876 Mo-molybdopterin cofactor biosynthesis 910 96S541 96SS42 964839 965070 3233 3235 YES O53879 911 96.89OO 968910 968438 968439 3233 3235 YES O53884 912 969448 969449 96.8977 968.981 3233 3235 YES O53884 913 977362 977363 976894 976896 3233 3235 YES Q10540 integral to membrane 914 O10671 O10673 O10204 O1 O2OS 3233 3235 YES O05900 915 O32449 O32450 O31981 O31983 3233 3235 YES O05917 916 O39551 O39553 O39084 O39085 3233 3235 YES OO5871 protein kinase activity 917 O41920 O41922 O41452 O41453 3233 3235 YES P953O2 nucleotide binding activity 918 O64SSO O6455.1 O64081 O64110 3233 3235 YES Null 919 O87886 O87887 O87445 O87447 3233 3235 YES O86319 acyl-CoA dehydrogenase activity 920 O90629 O90631 O901.89 O901.90 3233 3235 YES Null 921 131681 131683 131228 131229 3233 3235 YES OO5597 922 135355 135356 134901 134907 3233 3235 YES P96384 membrane 923 165969 165971 16552O 16SS21 3233 3235 YES Null 924 169165 1691.67 16871S 168716 3233 3235 YES O86321 925 173288 173289 172837 172839 3233 3235 YES Null 926 189124 18912S 188674 188678 3233 3235 YES O53415 927 189603 189622 1891S6 189157 3233 3235 YES O53415 928 189661 189662 1891.96 1892OO 3233 3235 YES O53415 929 191462 191463 191OOO 191010 3233 3235 YES O53416 930 191817 192525 191364 191365 3233 3235 YES O53416 931 192629 192812 191459 191460 3233 3235 YES O53416 932 214392 214393 213O3O 213049 3233 3235 YES O53435 933 214589 214590 213245 213255 3233 3235 YES O53435 934 214840 214844 213505 213506 3233 3235 YES O53435 935 215028 215074 213690 213691 3233 3235 YES O53435 936 2196.17 219618 218234 218244 3233 3235 YES O53439 937 231791 231792 230417 230419 3233 3235 YES O53449 integral to membrane 938 274621 274623 273248 273249 3233 3235 YES OO6545 membrane 939 3OO681 3OO683 299307 2993O8 3233 3235 YES O50424 940 3O6903 306904 305528 305643 3233 3235 YES Null 941 314587 314589 3.13336 313337 3233 3235 YES Null 942 341420 341421 34O168 340182 3233 3235 YES O05298 US 2008/0085284 A1 Apr. 10, 2008 66

TABLE II-continued

List of insertion deletions in M. tuberculosis, M. bovis BCG

BCG BCG H37Ry H37Ry CDC CDC Polymorphism ID Start End Start end Start end ORF GO ID Putative Function

943 358664 35866S 357415 357421 3233 3235 YES O05315 944 36 7083 36 7086 365839 36S840 3233 3235 YES OO6291 serine-type endopeptidase activity 945 404177 4041 78 402931 405929 3233 3235 YES Q11063, Q11061 946 407255 4O72S6 4O9016 4O9018 3233 3235 YES Q11058 monooxygenase activity 947 439690 439691 44.1542 44.1686 3233 3235 YES Q10614 enzyme activity 948 44.1478 44.1519 443483 443484 3233 3235 YES Q10616 integral to membrane 949 466163 4661.64 468112 46811S 3233 3235 YES Q10620 integral to membrane 950 47SO 63 47SO64 477025 477027 3233 3235 YES Null 951 539986 539.987 S41949 543298 3233 3235 YES Null 952 S4O483 S4O485 543804 5438OS 3233 3235 YES P71799 953 543150 543152 54.6470 S46471 3233 3235 YES P718O1 sulfotransferase activity 954 5691.67 569 168 S72486 S72849 3233 3235 YES P71664 integral to membrane 955 608954 608976 612645 612646 3233 3235 YES OO6823 956 627336 627337 630987 63101S 3233 3235 YES OO6810 957 628863 628891 632S41 632S42 3233 3235 YES OO6810 958 632753 63.2882 636167 63.6168 3233 323S YES OO6808 959 63290S 63.2909 636181 636182 3233 323S YES OO6808 960 633457 633467 63673O 636731 3233 323S YES OO6808 961 689986 68.998.7 693.238 693240 3233 3235 YES P71783 962 737536 737538 753.521 753.522 3233 3235 YES Q10777 enzyme activity 963 738O35 738037 754019 7S4O20 3233 3235 YES Q10777, Q10776 964 74418.6 744.191 760169 76O170 3233 3235 YES Q10761 Succinate dehydrogenase activity 96S 745810 747954 761789 76.1790 3233 3235 YES Q10773 membrane 966 754.245 7S4246 768071 7688.68 3233 3235 YES Q10768 alpha-amylase activity 967 765829 765830 780461 780463 3233 3235 YES OO6615 968 765952 765954 780585 780586 3233 3235 YES OO6615 969 837548 837549 85218O 852182 3233 3235 YES Null 970 850305 850327 864938 864939 3233 3235 YES P94986 971 8796.87 879698 894299 8943OO 3233 3235 YES O53916 nucleotide binding activity 972 89.2915 892916 907517 907558 3233 3235 YES Null 973 90O884 90O887 91SS.42 91SS43 3233 3235 YES O33.192 974 914O68 914O69 928.724 928.726 3233 3235 YES Null 975 93O724 930725 945381 945383 3233 3235 YES P71976 976 941O12 941 OS3 955670 955671 3233 3235 YES Null 977 953648 953650 968249 9682SO 3233 3235 YES O33271 978 96.7611 967752 982211 982212 3233 3235 YES O65937 979 9684.48 968449 98.2898 98.2967 3233 3235 YES O65937 98O 968664 96.8665 983.192 98.3261 3233 3235 YES O65937 981 98.3171 98.3172 992328 99.2330 3233 3235 YES OO6794 982 985312 985313 994470 994472 3233 3235 YES OO6795 molecular function unknown 983 99.2126 99.2145 2001684 2OO1685 3233 3235 YES OO68O1 984 2O16682 2O16683 2026222 2O26231 3233 3235 YES O86373 985 2051905 2OS191S 2061448 2061449 3233 3235 YES Q50615 integral to membrane 986 2O64977 2O64978. 2074S11 2O74614 3233 3235 YES Null 987 2O791.95 2O791.96 2088841 2O88979 3233 3235 YES Q50594 integral to membrane 988 2080613 2080626 20904O6 20904O7 3233 3235 YES Q50593 integral to membrane 989 2O84.192 2084.193 2093973 2093975 3233 3235 YES P951.65 phosphogluconate dehydrogenase (decarboxylating) activity 990 2O851.36 2O851.37 2094918 209492S 3233 3235 YES P951.65 phosphogluconate dehydrogenase (decarboxylating) activity 991 2O87040 2O87041. 2096828 20968.30 3233 3235 YES Null 992 2093386 2093387 2103175 2103177 3233 3235 YES Null US 2008/0085284 A1 Apr. 10, 2008 67

TABLE II-continued

List of insertion deletions in M. tuberculosis, M. bovis BCG

BCG BCG H37Ry H37Ry CDC CDC Polymorphism ID Start End Start end Start end ORF GO ID Putative Function 1993 2099733 2099735 2109523 2109524 3233 3235 YES Null 1994 2116913 211691S 2126,702 2126703 3233 3235 YES OO7753 transporter activity 1995 2123684 21237OO 2133472 2133473 3233 3235 YES OO7748 1996 2127747 2127758 213752O 2137521 3233 3235 YES OO7744 1997 2133043 2133044 21428O6 21428O8 3233 3235 YES OO7737 alcohol dehydrogenase activity 1998 2133758 213376O 2143522 2143523 3233 3235 YES OO7737 alcohol dehydrogenase activity 1999 2136332 2136378 2146095 2146096 3233 3235 YES OO7733 2OOO 2151627 2151629 2161345 2161346 3233 3235 YES OO7718 enzyme activity 2001 21.53S48 2153549 216326S 2163278 3233 3235 YES OO7716 enzyme activity 2002 2153668 215.4142 2163397 2163398 3233 3235 YES OO7716 enzyme activity 2003 21.54541 21.54542 2163787 2163847 3233 3235 YES OO7716 enzyme activity 2004 2156236 21566O2. 216SSS1 2165552 3233 3235 YES OO7716 enzyme activity 2005 2160449 216O4S1 2168813 2168814 3233 3235 YES O53960 2006 2184230 2184231 2192593 2192595 3233 3235 YES P95275 electron transport 2007 219922S 2199227 2207589 22O7590 3233 3235 YES Null 2008 2254439 225444O 227OS21 2270531 3233 3235 YES Null 2009 2260638 2260639 2276722 2276724 3233 3235 YES O53475 nucleoside metabolism 2010 2312O77 2312O79 2328162 23281 63 3233 3235 YES Q10680 vitamin B12 biosynthesis 2011 2313772 2313774 2329856 2329857 3233 3235 YES Q10671 porphyrin biosynthesis 2012 2313988 2313989 233OO71 2332091 3233 3235 YES Q10671, Q10683 2013 232O088 232O092 23382OO 23382O1 3233 3235 YES Q10689 integral to membrane 2014 2324539 2324551 2342648 2342649 3233 3235 YES Q10692 2015 2329456 23294.57 2347554 23475.95 3233 3235 YES Q10699 DNA binding activity 2016 233.9127 23391.28 2357282 2357286 3233 3235 YES Q10707 2017 2339871 2339873 2358O29 2358O3O 3233 3235 YES Q10707, Q97.AE2 2018 2347255 2347256 2365412 2366761 3233 3235 YES Null 2019 2349048 2349049 23.68563 236856S 3233 3235 YES Null 2020 2352985 23S2986 2372SO1 2372542 3233 3235 YES O33247 molecular function unknown 2021 2361585 2361586. 23.81158 2381186 3233 3235 YES O33258 2022 2378768 2378769 2398368 2398377 3233 3235 YES OO6237 2023 2382325 2382432 24O1925 24O1926 3233 3235 YES Null 2O24 2402489 2402494 242.1973 242.1974 3233 3235 YES OO6217 2025 24O4OSS 24O4.056 2423S35 2423634 3233 3235 YES OO6215 2026 2404228 2404229 2423816 2423835 3233 3235 YES OO6215 2027 2410508 2410509 2430114 243.1463 3233 3235 YES Null 2028 24274.64 2427465 2448428 2448430 3233 3235 YES O53521 enzyme activity 2029 2440537 244O642 24615O2 2461503 3233 3235 YES Q10389 integral to membrane 2O3O 248O295 248O297 2SO1146 2SO1147 3233 3235 YES Q10511 2O31 2502267 2502271 2523205 2S232O6 3233 3235 YES Null 2O32 2SO4789 2504790 2525724 2525726 3233 3235 YES O53525 electron transport 2O33 2511025 2S11026 2S31961 2532O58 3233 3235 YES Null 2O34 2513519 251352O 2534561 2S34564 3233 3235 YES O53536 nitrogen metabolism 2035 2528967 252.8968 2SSOO11 255.1360 3233 3235 YES Q50687 glycerol metabolism 2O36 2S4O310 2S4O311 2S62712 2S62714 3233 3235 YES Q50675 membrane 2037 2S4O853 2S4O854 2S632S6 2563259 3233 3235 YES Q59570 thiosulfate sulfurtransferase activity 2O38 2541962 2541964 2S64367 2S64368 3233 3235 YES Q50673 enzyme activity 2O39 2544362 2544364 2S66,766 2566767 3233 3235 YES Null 2040 2S84392 2584410 2606.795 2606796 3233 3235 YES P71879 transporter activity 2041 2592156 2592158 26.14558 26.14559 3233 3235 YES Null 2042 26O7119 2607120 2639043 2639047 3233 3235 YES P95249 2043 2658273 26S827S 26902OO 26902O1 3233 3235 YES P71749 2044 266.1152 266.1153 2693.078 2693O88 3233 3235 YES P71748 oxygen transporter activity 2O45 2672954 2672976 2704883 2704884 3233 3235 YES P71736 2046 2673694 2673779 27056O2 2705603 3233 3235 YES Null 2047 26796.11 2679613 2711425 2711426 3233 3235 YES P71729 US 2008/0085284 A1 Apr. 10, 2008 68

TABLE II-continued

List of insertion deletions in M. tuberculosis, M. bovis BCG

BCG BCG H37Ry H37Ry CDC CDC Polymorphism ID Start End Start end Start end RF GO ID Putative Function 2048 2689758 2689759 2721571 2721574 3233 3235 E S P71924 DNA binding activity 2049 26923.84 269238S 27241.99 2724220 3233 3235 Null 2OSO 2748934 2748935 278.0769 2780773 3233 3235 OS32O3 2051 275.2776 275.2777 2784614 2785963 3233 3235 Null 2052 2762O72 2762O73 2795268 2795270 3233 3235 Null 2053 2762938 2762939 27.96135 27.96137 3233 3235 OS3212 2O54 2763778 2763 780 2796976 2796977 3233 3235 OS3212 2055 2768766 2768767 28O1963 28O1967 3233 3235 O53215 RNA-3'-phosphate cyclase activity 2056 2769956 2769.957 28O3156 28O3166 3233 3235 E S O53215 RNA-3'-phosphate cyclase activity 2057 2.771738 2.771747 2804947 2804948 3233 3235 E S O53215 RNA-3'-phosphate cyclase activity 2O58 2834OO3 2.8346SO 2867204 2867205 3233 3235 P95009 2059 28394.52 283.9453 2871997 2871999 3233 3235 s P95OO1 shikimate 5 dehydrogenase activity 2060 284.9054 2849OSS 2881 600 2881 602 3233 3235 Q50737 2061 2855417 28S5418 2887964 2887.967 3233 3235 Q50732 2O62 2863468 2863469 2896017 2896019 3233 3235 Q50649 nucleic acid binding activity 2O63 2890583 2890593 29231.33 2923.134 3233 3235 Q50630 2O64 2911.188 2911 198 294,3729 294,3730 3233 3235 OO61.99 206S 2915441 29.15442 2947973 2947977 3233 3235 OO6.191 2066 292SO32 292SO33 2957567 2957569 3233 3235 P71930 molecular function unknown 292S801 292S803 295.8337 2958.338 3233 3235 E S P71930 molecular function unknown 2O68 2938901 2938902 2982418 298242O 3233 3235 Null 2069 2947177 2947218 299.0695 299.0696 3233 3235 Null 2070 2952702 2952795 2996.16S 2996.166 3233 3235 O86317 2O71 3O1058O 3O10581 3OS3941 3OS3943 3233 3235 O33284 2O72 3.011358 3.011359 3054720 3054795 3233 3235 O33284 2O73 3O12343 301.2367 3055789 3055790 3233 3235 O33285 2074 3.042938 3.042939 3O86361 3O86386 3233 3235 O33321 DNA binding activity 2075 3O43634 3O87081 3O87083 3233 3235 E S alanine dehydrogenase activity 2O76 3.064422 3064.426 3107870 3107871 3233 3235 P71652 207 7 3073960 3073961 31.17405 31.17408 3233 3235 s P71639 DNA binding activity 2O78 3075770 3075771 3119217 3.1198OO 3233 3235 Null 2O79 3075914 3076356 3119953 31199.54 3233 3235 s Null 2080 3076439 30765O1 31 20027 31 20O28 3233 3235 Null 2081 30786O1 3078745 3122118 3122119 3233 3235 Null 2O82 3078.967 3078968 3122331 31.22394 3233 3235 Null 2O83 3O88O34 3O88044 3131469 3131470 3233 3235 P71629 molecular function unknown 2O84 3098539 3098.540 3.141.96S 3141967 3233 3235 P71617 transporter activity 2O85 3112605 3112606 31S6032 3156073 3233 3235 Null 2O86 3.146666 31466.67 31901.47 319014.9 3233 3235 Q10806, Q10806 2087 3150757 3150759 3194239 3194240 3233 3235 Q10809 2088 31961.90 31961.91 3239559 3239600 3233 3235 Null 2089 3248018 3248.123 32.91442 32.91443 3233 3235 Null 2090 32S3069 3253071 3296379 3296.380 3233 3235 P96284 enzyme activity 2091 326782O 3267822 3311129 3311130 3233 3235 P95134 metabolism 2092 3288055 32880S6 3331363 333.1366 3233 3235 P95120 aspartic-type endopeptidase activity 2093 3293332 3293.333 3336642 333.6751 3233 3235 Null 2094 3.29446S 3294.466 3337893 3337903 3233 3235 P95114 cell wall 2095 33O7774 33O7815 33S1211 33S1212 3233 3235 Null 2096 33.13357 33.13358 3356738 3356740 3233 3235 Null 2097 3336999 3337085 3380437 338.0438 3233 3235 O53268, O53268 2098 3371757 3371758 3415194 3415209 3233 3235 Null 2099 3381643 3381645 342SO94 342SO95 3233 3235 P95097 acyl-CoA dehydrogenase activity 2100 3430544 3473994 347399.5 3233 3235 Y E S Null