USOO955643OB2

(12) United States Patent (10) Patent No.: US 9,556,430 B2 Polyak et al. (45) Date of Patent: Jan. 31, 2017

(54) METHYLATION AND EXPRESSION Lehmann. The American Journal of Pathology. 2002. 160(2): 605 612.* (75) Inventors: Kornelia Polyak, Brookline, MA (US); Umbricht. Oncogene. 2001. 20: 3348-3353.* Min Hu, Brighton, MA (US); Noga Fackler. Cancer Research 2004. 64: 4442-4452. Qimron, Brighton, MA (US); Jun Yao, Kurose. Human Molecular Genetics. 2001. 10(18): 1907-1913.* Chestnut Hill, MA (US) Moinfar. Cancer Research 2000. 60: 2562-2566. (73) Assignee: Dana-Farber Cancer Institute, Inc., Tan. Carcinogenesis. 2002. 23(2): 231-236.* AF327440.1 (Retrieved on May 16, 2013 from the internet: ).* patent is extended or adjusted under 35 NM 018266.1 (Retrieved on May 16, 2013 from the internet: U.S.C. 154(b) by 1565 days. ).* NM 138460.1 (Retrieved on May 16, 2013 from the internet: (22) PCT Filed: May 30, 2006 ).* PCT Pub. Date: Nov. 30, 2006 Lehmann et al. American Journal of Pathology. 2002. 160(2): (65) Prior Publication Data 605-612.* Cullen et al. Contemporary Endocrinology of Breast Cancer. 1999. US 2009/0280478 A1 Nov. 12, 2009 11: 155-168. Related U.S. Application Data Rush et al. Analytical Biochemistry. 2002. 307: 191-201.* Strichman-Almashanu et al. Genome Research. 2002. 12: 543 (60) Provisional application No. 60/685,104, filed on May 554.* 27, 2005. NCBI Reference Sequence: NM 03.1918. 1 (Oct. 27, 2004).* Dunn et al., Genomic signature tags (GSTs): A System for profiling (51) Int. Cl. genomic DNA; Genome Research, Cold Spring Harbor Laboratory CI2O I/68 (2006.01) Press, 12(11): 1756-1765, Nov. 2002. CI2N IS/10 (2006.01) Saha et al... Using the Transcriptome to 1-10 Annotate the Genome, (52) U.S. Cl. Nature Biotechnol., 2005):508-512, May 2002. CPC ...... CI2N 15/1093 (2013.01); C12O 1/6855 Sidransky, Emerging Molecular Markers of Cancer, Nature Rev. Cancer, Natur Publishing Group, London, 2(3):210-219, Mar. 2002. (2013.01); C12O 1/6881 (2013.01); C12O Cantile et al., “In vivo expression of the whole HOX gene network 2600/154 (2013.01); C12O 2600/158 in human breast cancer'. Eur, J. Can. 39:257-264 (2003). (2013.01) Dai et al., “An AscI Boundary Library for the Studies of Genetic and (58) Field of Classification Search Epigenetic alterations in CpG Islands'. Genome Research, 12:1591 USPC ...... 546/113: 514/300 1598 (2002). See application file for complete search history. Evron et al., “Loss of Cyelin D2 Expression in the Majority of Breast Cancers Is Associated with Promoter Hypermethylation'. (56) References Cited Can. Res., 61:2782-2787 (2001). Hu et al., “Distinct epigenetic changes in the stromal cells of breast U.S. PATENT DOCUMENTS cancers', Nature Genetics, 37(8):899-905 (2005). Huang et al., “Methylation profiling of CpG islands in human breast 2004/00968.92 A1 5/2004 Wang cancer cells”, Human Mol. Genetics., 8(3):459-470 (1999). 2004/023496.0 A1* 11/2004 Olek ...... C12O 1/6827 Kominsky et al., Loss of the tight junction claudin-7 435/6.11 correlates with histological grade in both ductal carcinoma in situ 2007/0054295 A1* 3/2007 Spivack et al...... 435/6 and invasive ductal carcinoma of the breast. Oncogene, 22:2021 2033 (2003). FOREIGN PATENT DOCUMENTS (Continued) DE WO O218632 A2 * 3, 2002 ...... C12O 1/6827 WO WO 2004/085621 A 10, 2004 Primary Examiner — Joseph G Dauner WO WO 2005/01018O 2, 2005 (74) Attorney, Agent, or Firm — Fish & Richardson P.C.

OTHER PUBLICATIONS (57) ABSTRACT The invention provides a method of analyzing the methyl Hu et al. Nature Genetics. 2005, 37(8): 899-905.* Qui et al. Nature Genetics. 2008, 40(5):650-655.* ation status of all or part of an entire genome. Moreover, the Bock. Epigenomics. 2009. 1(1): 99-110.* invention features methods of and reagents for characteriz Michels. Experimental Gerontology. 2010. 45: 297-301.* ing biological cells containing DNA that is Susceptible to Ying et al. Cardiovascular Research. 2000. 46: 172-179.* methylation. Such methods include methods of diagnosing Feng et al. PNAS. 2010. 107(19): 8689-8694.* cancer, e.g., breast cancer. Nguyen. Journal of the National Cancer Institute. 2001. 93(19): 1465-1472. 4 Claims, 60 Drawing Sheets US 9,556.430 B2 Page 2

(56) References Cited

OTHER PUBLICATIONS Paz et al., “Genetic unmasking of epigenetically silenced tumor Suppressor in colon cancer cells deficient in DNA methyltransferases”. Human Mol. Genetics., 12(17):2209-2219 (2003). Shi et al., “Expressed CpG Island Sequence Tag Microarray for Dual Screening of DNA Hypermethylation and Gene Silencing in Cancer Cells”, Can. Res., 62:3214-3220 (2002). Umbricht, et al., “Hypermethylation of 14-3-3 O 9stratifin) is an early event in breast cancer'. Oncogene, 20:3348-3353 (2001). Wang et al., “Digital karyotyping, PNAS, 99(25):16156-16161 (2002). Makiyama et al., “Aberrant expression of HOX genes in human invasive breast carcinoma'. Oncol. Rep., 2005, 13(4):673-679. Svingen et al., “Altered HOX Gene Expression in Human Skin and Breast Cancer Cells”, Can. Biol. Ther. 2003, 2(5):518-523. Widschwendter et al., “DNA methylation and breast carcinogenesis'. Oncogene, 21:5462-5482 (2002). Canadian Intellectual Property Office, Office Action Dated Feb. 5, 2014 in CA Application No. 2,609.512. * cited by examiner U.S. Patent US 9,556,430 B2

I"?INH U.S. Patent US 9,556,430 B2

U.S. Patent Jan. 31, 2017 Sheet 3 of 60 US 9,556,430 B2

|

9."OIH U.S. Patent Jan. 31, 2017 Sheet 4 of 60 US 9,556,430 B2

+'0IH ------“ $??IIOSUBI)VIXINT JAA U.S. Patent Jan. 31, 2017 Sheet S of 60 US 9,556,430 B2

i.ar. va S. ra er * ex sr. s

S

e U.S. Patent Jan. 31, 2017 Sheet 6 of 60 US 9,556,430 B2

THX3 catgcagc.cagctttctg.ccc.cttcactittgcacagcacttgttaccCaagaagggc.ca999 Caa agtgttgtgaggctgtgctgggtgcaggtggctgagtgtcagct aag atcaggggt. tgttgttggcaggactggggaaggggggcaggittaaggag

caccCCottcct is, it agtcc cccaaggtgc aga s 59CCaCCa99C toccaagttcC

Ctgggatt cctacccttctggitgcagcCtctggalaggcagtgccCagg tgatcaac

F.G. 6A U.S. Patent Jan. 31, 2017 Sheet 7 of 60 US 9,556,430 B2

LHX3,agtc.c. minus c ccaaggtgcstrand gccc.ggcc tict gCag Cat gccaccagg cagtggy Ctggagggitt sigggctCCC aagttcc c. . . . actgggcctgg ctgtc. is gtgctgaa . CC's c ctcc.gggc

cCCtct, it is citctg.ccc tgti, giccitccCtcc ccc.ccct

F.G. 6B U.S. Patent Jan. 31, 2017 Sheet 8 of 60 US 9,556,430 B2

w E9 ggc siggaggacctgtagaggagaagaaatatgttctgaretc...tgcc. ictgggact 9 ccagcagccaccactcctgggaaagaactgagggagtgtcCagggiac Cagaatcago Caggaggatagg9tCcagcCaaga gaatgtagggtgggaggagagat.ca.gtCaca gaactgctctggctgatctgatttcacttgaagtcaa.ca Ittatgtacttaggccto acco cc.caactgstttctoottctoctg.ccc.ccctcaccc.ccacctacatcCCttgcc.ccaggttittc catcc saatc. actc.ccc.caacctata laaggtggg.ccct igga totctgcaggaa cagotactggggtatattgggtatataaagagtggg taccCtc.cct S acidgtocagg cttctggacticctgg sect gggaggtoctociccagt Egctggcc

FIG. TA U.S. Patent Jan. 31, 2017 Sheet 9 of 60 US 9,556,430 B2

LMX1A, minus strand agagag aagttgaga gectgctgg aagcttctgc ce. is gaggagc eccotsgct to gagic s tggctctgct cottisgar cc.caggctgg gcigga t ggt cc.gga t cctaccagcc tectc. citcagoggaggagaggg

FIG. 7B U.S. Patent Jan. 31, 2017 Sheet 10 of 60 US 9,556,430 B2

TCF7L1 ctggggggggggggggggggg.ca.g. gaggaggreggctc.ca gc g gagggga is agctgatcccct tccagga. 'sagcag: Ec Ct aga, aggt C gaggtaaggaagca gccacccd.iggggat.cc caggs gagag ccescagcc atc ggacactitt CC agaa.gc

agg tatgttgcC ctgggacagccCcccactict ttcc.ctg

gacCtg

gaggaggi cacilitgagtgcc.t. gcc.cgiggcttggc.catggagtgg

9. S. ggCt9. citggccagtgcc tagatgaaagtaaagttactitta acttittCCCctic ttgiggttgaggttittggagtcCaCCtctgggatctt CCttggCctCcagaat

FIG. 8A U.S. Patent Jan. 31, 2017 Sheet 11 of 60 US 9,556,430 B2

TCF7L.1, plus strand

tgagtgcc..tiggs, c. Egg Cttggc.catg gagtgggggatggggcott is gagttgaact actctotggregosag

FIG. 8B

U.S. Patent Jan. 31, 2017 Sheet 14 of 60 US 9,556,430 B2

tra

psy, `No.,?ÍK??ÕÕÕae. %D0D0D0D0D0Dedaovºu U.S. Patent Jan. 31, 2017 Sheet 15 of 60 US 9,556,430 B2

U.S. Patent Jan. 31, 2017 Sheet 16 of 60 US 9,556,430 B2

£I“?INH

U.S. Patent US 9,556,430 B2

@@@@

psy

LI-HIS-N* U.S. Patent Jan. 31, 2017 Sheet 19 of 60 US 9,556,430 B2

PRDM14 caacci gttcctgcaaaacccattgaagttccctttccttccttcttctoagtc.ctacaaaact settggtagggtotgtgggagc gggCCCtgccaacct gggCt999C ggagc sag Cagg Eigt is gagggcaatggagcaag acagctcCagg, ctcctggg.ccct gctgggagg gaagagc gaggaccctgggte cac Cagatggaga : ct cocagagcc.ccc.gaggcagg

to cagdgaccc.g. acctyctotgg.cccagigtgtgaccc ggtoctaggtactgact

Caataagaaa

FIG. 16A U.S. Patent Jan. 31, 2017 Sheet 20 of 60 US 9,556,430 B2

PRDM14, minus strand

acctg citctgg.ccca gigtgtgac cc sigg to ctggs gttcc tgactgcc caggggaggggg. ccact tttggctgCC ctagatg cocctgaac Ctct titt CCC to ggcag iCCaC attccc.gg titcct is gaa acticcaatca ttcta

FIG. 16B U.S. Patent Jan. 31, 2017 Sheet 21 of 60 US 9,556,430 B2

ZCCHC14

tagg ggcai caccttgttggcag stggcc.cc clic

CEggtggttggaggc.catggtgaacagcagcaggaa CtCa tgcagcc tagttgttggatgatggagt fitgags. V acaCCagCagcttgctg tCaggttggtgaggctgCCCaggit gciggttgttggccttgatcto tgg tagt cot

F.G. 17A U.S. Patent Jan. 31, 2017 Sheet 22 of 60 US 9,556,430 B2

ZCCHC14, minus strand tgg aggetgggt sigggggtegtgcaggc tggaggCttg gagtgcagag ttggggatgc agaCttg999 tacagggCag ag Eggcascac cttgttggcag gctgggcaag toggcage cacci g;99. l, it: ccttgc.c. stggcc.cc clic ttca gqctgctctg gatctgaatg agctcctgg.

FG, 17B U.S. Patent Jan. 31, 2017 Sheet 23 of 60 US 9,556,430 B2

Hoxd4 gCaggtoaggaccatgtggctggctgct gctgtggg caaaagggggtggggatgggggggt gggggaggactCcattttcagagcagggggaaggctgtggaggagggggattitccaaaatgct tgagggttctgacctggtggtggg.cccagaagaaggagcacatttggggatcc caagcctgg gg tatgtgggtgttgttgaggaggtgggtgggagtgage tytgaggggagagggggagg gaggaa.gcaagiisagCttgggags ...is ggcct's gGG saccaggaagtga gagg ragggg CCtaaCtagtggC !ctdacctdcC totgttttgtc agtgaaccocaactacacasgtggggaaccoaagiisg.tcc CaagtCctagaactggaaaaagaattt cattttalacaggitatctgaCaag gaggatga ctcacaccctgttgttctgttgagiccagatcaagatctggttccagaac agtggaaaaaagatcataagctg.cccaa.cactaaaggcaggtoat cCtCatct tcct CCtca cittgctcct

FIG. 18A U.S. Patent Jan. 31, 2017 Sheet 24 of 60 US 9,556,430 B2

HOXD4, plus strand ag aagaaggagC acatttgggg atcc is caag cctggggitat gtgggtgttgt ttgaggaggt gygtgggagt gagstgtg c sigggaga gq9 ggagg gaggaagcaa gagcttgg gags . g ggagggC gcct gg. E. aagtgag g a gaggraggggcctaact agtggc gg

F.G. 18B U.S. Patent Jan. 31, 2017 Sheet 25 of 60 US 9,556,430 B2

galactagdtgggagtgggCCCtgcagtgaggCagggggtgggCCagggagaacaaggcaa.gagga gcttcattcagggttcctgagcctttgtgagccactCastttttaccactCacttaactcitt tgttgttggggtgaggggtcct agcctggatttggg tatgaaaacccaggcaagaaagacctg cc.caagcctttaaaggaatgcaaagtcatcctictagocaccc.ccagagataaaggctggggat tgagtctoctgcagatggtgg gcctCctggggctggcaagttgggacagaggcccataag.ccc toctgGG's ccttcCCaCCCctct gccCtctccactCccagatggggatttgggttcagag cagcctggCacacacaccCccaccccaccagaatctoactc.ccagottcctatgactattoatta gtatt Cacaacaatgggaaagttctgggtgtgcacagggatttitttacagttagaaagttgtttaa gtcaatgacct cactgggCCtcagcaacCCtgggaggCagatggCagtCagaatgatcCataaat gacctg.cccCaggtoacacagotcCtaala CaggggagctggalacctggctgggagcCttgacitat Ccactgctica

FG, 19A U.S. Patent Jan. 31, 2017 Sheet 26 of 60 US 9,556,430 B2

SLC9A3R1, plus strand tcal tittitt accact cact taactCtt tdttgttggg gtgaggggtc ct agcctg gatttgggta tigaaaaccCa ggcaagaaag acctg.cccaa gCCtttaaag gaatgcaaag to atcCtcta gccacccc.ca gagatsaaa ggctggggat tdagt.ctCct gCagatggtg gigcctCct ggggctggCa agttgggaCa gaggCCCata agcCCtcCtgggggie'.cctt CCCaCCCCtC t is gcc.ctct Ccactic

F.G. 19B U.S. Patent Jan. 31, 2017 Sheet 27 of 60 US 9,556,430 B2

TOC38.9333

CtgggCCttg 9 aattatct, ... . site CCC fitgg acce: aagg

99. toagttttctggccaa is E. gCCtgcagggggccCagg acts. C. 9999

c. cccttgggggactctgg ttgggCa Egggcctgg siggs

FIG. 20A U.S. Patent Jan. 31, 2017 Sheet 28 of 60 US 9,556,430 B2

Toc389.333, minus strand

cas aagcc. . . . g . gcagggggcc cage act tccagggaac age scegcc ccttggggga citctggggc cts atctga cittgggca Eggg cctgg gg, i. Caccitccaca cact saact gctggggggcaggactt ggcc.ca.gg

FG. 20B U.S. Patent Jan. 31, 2017 Sheet 29 of 60 US 9,556,430 B2

gCaC cctggccaccct ggctctcttaaaggagccaccc.ccacccCagggcaatcat. gacc gaccaggcctc. g.gtgacacatch egctctcagaggs ccaggaccctatoattcat CCCtttcCatsgcaaagttgaaaagttcagagc.ccggg cacacacottgg tittatgtatacag aagtggggtgciggs ggaaggg sigggaatgagggaacctagaggc

CCaCatcC

ggaged Bigggggg .. s sic Ccccatgcag atcaggc cttcttgg

FIG. 21A U.S. Patent Jan. 31, 2017 Sheet 30 of 60 US 9,556,430 B2

CDC42Ep5, minus strand atttatgta tacagaagtg gggtgc gg gaggcratg a tettca gct aggto ctg. tcc ggggggitt c is list C. Etccatga cccagdac t

FIG 21B U.S. Patent Jan. 31, 2017 Sheet 31 of 60 US 9,556,430 B2

Cxorf12 P gigteggs aggagg gagciggtgartcacictitcc.ccc

it a CCCCt9ggcc.caagcotcttaaaggacCCCt9 ctgcct's fictgci if: :Ctcagat Caggtg CCCCCCCaC is ccto scoctggg agc.ccatctgccacttcccctcagg caGGriss cc si, is congco to ": ". M glactittcagoaca ctigoccacccccttgggaatgtctotaggttgataagggascaccigaagaaaaacaaagatggs o ggs goatct is treet

FIG. 22A U.S. Patent Jan. 31, 2017 Sheet 32 of 60 US 9,556,430 B2

Cxorf12,ccctg. ictgcct plus strandis giggggtgggggt ge. - ctgggctaaa gct agt ctoagatc in:

cc ccc.ca.gc.ccc.ccca c ctccaricct eggs agc.cCatC ttgccacttic cocts cogcatc gg toaata g, a

FIG. 22B U.S. Patent Jan. 31, 2017 Sheet 33 of 60 US 9,556,430 B2

N S SA.

SS S --S- Sr. St. rS S S. Sea SN SR E-S- i-S- ors SN to SN se reer t re-N EraN geo-N t-egon-N

(%). Lobile Lu eigee U.S. Patent Jan. 31, 2017 Sheet 34 of 60 US 9,556,430 B2

se

es es e e es r s s w (%) uoefuelu elee U.S. Patent Jan. 31, 2017 Sheet 35 of 60 US 9,556,430 B2

St

St v-s- -S-

sco : Aero 7-ON Erdon ro ON 6 Her

al-ider --- oriter N re-N 9ers EN sh-N is a 8 e s R (%) uoye-upeu e?see U.S. Patent Jan. 31, 2017 Sheet 36 of 60 US 9,556,430 B2

Sri rS s I S -st rS se -S- Sri -is SN co-s- Sn os-N

r -on I star re resoA -er rer s la-- o N pre rear

S Sw S S 3 s. s (%) udgeleu eAgee U.S. Patent Jan. 31, 2017 Sheet 37 of 60 US 9,556,430 B2

& SS &B O N

ris ess

rise San SN

SN

E-S-N

is-n

s:

SN rsrn

-car aro --N e-en serN

at-r

ter vi-de-N

e wa is a s 8 s & %) uoele Ll elee U.S. Patent Jan. 31, 2017 Sheet 38 of 60 US 9,556,430 B2

os rSN SN

SN

or Sr.

E-S-N

to-SN

o

ea

EN

sector

recora r He

re re reer N -a-N: EN N (x) lore|Aureu e ApeedKn U.S. Patent Jan. 31, 2017 Sheet 39 of 60 US 9,556,430 B2

:

S. 5 S 3 S S S S 7 (%) uoke|Auleu engee U.S. Patent Jan. 31, 2017 Sheet 40 of 60 US 9,556,430 B2

(%) LOSSelcice eAllee

8-d III O-Ii (EIGH III IIII g is III I-II IIII I-ii-I I-III 8III Sir t g-IGI I I É IIIi g-d g HK 'iRI E.

y o s (%) LIOISSeldyce eAllee U.S. Patent Jan. 31, 2017 Sheet 41 of 60 US 9,556,430 B2

OSI

SSI

ISI

glas I

I.S.I.

LSI

III, SE

SILS".

8"S

S. 81-ALS-lie t | | | | EIILS

I.S.

co o of d s 8 . S. & S S U.S. Patent Jan. 31, 2017 Sheet 42 of 60 US 9,556,430 B2

ISOI -I-SIGOXIII 5 i 8talionwi dgo SETI " g S. a E I-III IT iOSICI (I) C 9. l 9-120AW-1 a Q 9d. OAM:N l(N :-(1205YK cd S. O N ... m. It Q O) t"d:05 Fit 29do S. Q C s O SIE E

IIIi

III at: S. girls Eds

I Z

Lic

III

o c o s s s s s (%) uoISSeldxe eAllee U.S. Patent Jan. 31, 2017 Sheet 43 of 60 US 9,556,430 B2

ISI

han O -- it-8 ISOATII SS FS 8t-aldori 25 S5 is5 SR-8IIDARI a E

I

Or S. CO A 9". SSS U5 s

E E. . cids C o: iii.30AFIEL ZgC. C dict E

Irie).

ver ea o vs. s r o (%) uOSSadke en Lee U.S. Patent US 9,556,430 B2

U.S. Patent US 9,556,430 B2

09Z"?INH

is nepeze LIN Ogoolago (8pezius data aepszeug

U.S. Patent Jan. 31, 2017 Sheet 46 of 60 US 9,556,430 B2

C44 Kar-i-

ca 723 &&G is 7386 26800 727cc

FN

139storab- Ssssss Ascad ......

Foxc

is 34

13480 1349 30 1330 13320, 1933 1s&c. 1833 560 58?

AAP

38f300 38saco 38ssoo) 38ssood 1385 38sso 38ssoo co14CSQQQOGO.QOGOOOOOOOOOOOOOOOOCQ N aroliff.. asses - Tag As i 4732&s

cocai2s

X1

Sox3

X1

FIG. 27A U.S. Patent Jan. 31, 2017 Sheet 47 of 60 US 9,556,430 B2

4. so MUC C24

Fine

CD44 O Musi CD24

Foxx

4 s is 8 a S 8 9 a. s. is a 5 - 7 s:44 K. WU co24

FIG. 27B

ca C. MUC 24 don

COA cold

CD44

CDX 4

O41. c23.

U.S. Patent Jan. 31, 2017 Sheet 49 of 60 US 9,556,430 B2

C O O

HOXA1 O

O 7.

FOXC1

FNC

2

O

2 S.

2 O

HX1

5.

O

OO PACAP O2468OOOO soo

------2 wo-r2 T-coaa. FIG. 29A U.S. Patent Jan. 31, 2017 Sheet 50 of 60 US 9,556,430 B2

HOXAf

M

FOXC1

FIG. 29B

s M

PACAP

s.

200 U.S. Patent Jan. 31, 2017 Sheet 51 of 60 US 9,556,430 B2

FNDC1, plus strand TGGGGCAGTGTTACAATTACAGAAAAGGGAGAGGAGGTC cTGAGTCCTTGGCCTGGGCAACAAGGCACACT GAAAACTGGGTTCCTTTT ACC CAT-TG. ccCTAGAAATGACAGCCAGAAGCAGGGTCTAAGGA cTGAAAACCCCTGA regge is , GGTAGGGA, TGGAAGGACTGGGCTAGCCACAGGAACTACA GacTG GAcaggrgAGGGGTcceCCAGTCCCCACTTGGGG scAGAGGTGTTTCTGTAAGGGGACAAA TGGG ACTTC

FG, 30 U.S. Patent Jan. 31, 2017 Sheet 52 of 60 US 9,556,430 B2

FOXC1 plus strand AGGGCAAAGCCTTTCTTCCCACACCCACAGCCAAGG TCTGCAGGGGCACA, CCTTCTGCTCCAGCC CCAGGAAGGscTTTccCTGCAGTCCTC. As GC GCTCC discAccAccoTGGCTC:GCAGAC

TCTGGGGCCTGGGGACT ... CCCACCCTG, G accCCCCACATGAGC: AGGTTGGGAGGCTG. GGGCCT CTGTCCTCCCAGGCTGGAGTGGGdacrcTGAGTCCTGGGGA

FIG. 31 U.S. Patent Jan. 31, 2017 Sheet 53 of 60 US 9,556,430 B2

PACAP minus stran C GGCCTGCAGGGGGCCCAG ACT. TCCAGGGAAC GTG. CAGGAGC AGC ascetic eccCCTTGGGGGACTCTGGGGC GGGGCAGCT, ATCTGA CTTGGGC ACTGTC.GGACTTGGCCCAGGGCCAGCTCTAGGTAGGTGGCCCAGagGAGCCACCATGGGACCTGGGACTGGT GGGCCTGG GCCCTCCTCCAGAGCCACCTCCACACACT CTG CTGGGGGCA GGGAGA CTGGCCCGG

FIG. 32 U.S. Patent Jan. 31, 2017 Sheet 54 of 60 US 9,556,430 B2

DDN, minus strand TAGGCCCAGCAGCTTGGAGCCCAGGGCAGAGC AGGGG GAAGTGG. GTGGTCCAGAG. c.gc GGGCTGGG GAccCCAGGGCCCTAccGGCCCTG. AGAAGC TGTCC's ATC . GCCCCTGACTGCCCGGCCTCC . TGG GAATGAGAGGACCCTGCCcities AGGTTGGAAACAGTT. s:

FIG. 33 U.S. Patent Jan. 31, 2017 Sheet 55 of 60 US 9,556,430 B2

LHX1 plus stran CTCTGGACTCCATCTCTCACTTCTCTCTGGATTCTGGGCTCTCCTGGCT, GCCTGGGTGCCCAAAGTGGCAGTG

TGGGCCTCTGTGGGATGGAGAGG is degGGCCTGACCTGAATGA, CATGTTGAGGC gTCTCCTGC CAGCTGCT, GATGTGGGGTGGGCTGGGTGTAGCAG AAGGGCCTTCAGT

FIG. 34 U.S. Patent Jan. 31, 2017 Sheet 56 of 60 US 9,556,430 B2

SOX13 plus strand AATGG is ATC. GCCCACTGCAACCTCCCACTCC GGTTCAAGs. ATTCTCCTGCCTCAGCCTCCAGTA

CAGGATTGTCrcCATCrcCTGACCTTGATCCCCC GCCTCCCAAAGTGCTGGGATTACAGG, TGAG sigCAGAGATCAGGTTCTTAAGGGAAGTCcCACCACCCCC GAGAAATGGGGTTTTTAAAAAACC, AGT

FG, 35 U.S. Patent Jan. 31, 2017 Sheet 57 Of 60 US 9,556,430 B2

DTX1 plus strand TGCTGCTCCTGGCTCTCAGCCAGCCCTTTGCACTCCCAACTG TCCCAGACCCTG CCGCTTCAGTGGG

...E.C.CAG TCCCTGCCCCCITGTT CCCTCTCCAgas ccTGGACTG, GTGTCCTGGAAGGCCTGGGTGGC GCTCGAGCTGGGCTGGGGGARAGGGGGAAGGGGGGGGGGCC EC GTGGACCCTGG GGGCA CCCCACCCCACTTGCT

FIG. 36 U.S. Patent Jan. 31, 2017 Sheet 58 of 60 US 9,556,430 B2

HOXA10 minus strand

T. GGG codic cTCTTTCTCCTCTGCTGC ccTCTCC GCAGC did deciscTGTC AACTTGAAGTTGC is cc TTGCAGC accide AGGITGsc.gc.cGCCCC GCTCTTGrccc.ggg.TAGTGAGGAGG TTC:GGGTGc AGGCTGrcra AAAAC

AAAGCCTGTGGCAGGAcer TcCTGCCCA GGGGTGGGG TGGTGGTAGATTGGCAGCTCTTGGCCAGCAT

FG, 37 U.S. Patent Jan. 31, 2017 Sheet 59 of 60 US 9,556,430 B2

SLC9A3R1, plus strand TCATTTTTACCACTCACTTAACTCTTTGTTGTTGGGGTGAGGGGTCCAGCCTGGATTTGGGTA TGAAAACCCAGGCAAGAAAGACCTGCCCAAGCCTTTAAAGGAATGCAAAGTCATCCTCTAGCCACCCCCA GAGAT. AAAGGCTGGGGATTGAGTCTCCTGCAGATGGTGGGCCTCCTGGGGCTGGCAAGTTGGGACA GAGGCCCATAAGCCCTCCTGGG ccTTCCCACCCCTCT, GCCCTCTCCACTC

FIG. 38 U.S. Patent Jan. 31, 2017 Sheet 60 of 60 US 9,556,430 B2

CDC42Ep5, minus strand TTTATGTATACAGAAGTGGGGTGcGGGGAAGGG GGGAATGAGGGAACCTAGAGGC ATGA

CTTGGFGCAGCCCTCTTCAGCTAGGTC TTGGGGGCAgc. TcCATGACCCAGCAC.." T T : GG

FIG. 39 US 9,556,430 B2 1. 2 GENE METHYLATION AND EXPRESSION conjugating to one terminus or to both termini of each of the first fragments a binding moiety, the binding moiety CROSS REFERENCE TO RELATED comprising a first member of an affinity pair, the conjugating APPLICATIONS resulting in a plurality of second fragments; exposing the plurality of second fragments to a fragment This application is a national phase filing under 35 U.S.C. ing restriction enzyme (FRE) to generate a plurality of third S371 of international application number PCT/US2006/ fragments, each third fragment containing at one terminus 020843, filed May 30, 2006, which claims priority to U.S. the first member of the affinity pair and at the other terminus Provisional Application No. 60/685,104, filed May 27, 2005. the 5' cut sequence of the FRE or the 3' cut sequence of the 10 FRE; The entire content of the prior applications is incorporated contacting the plurality of third fragments with an herein by reference in its entirety. insoluble substrate having bound thereto a plurality of STATEMENT REGARDING FEDERALLY second members of the affinity pair to the contacting result SPONSORED RESEARCH ing in a plurality of bound third fragments, each bound third 15 fragment being a third fragment bound via the first and second members of the affinity pair to the insoluble sub This invention was made with government Support under Strate; grant numbers P50CA893.93 and CA94074 awarded by The conjugating to free termini of the bound third fragments National Institutes of Health and DAMD 17-02-1-0692 and a releasing moiety, the releasing moiety comprising a releas W81XWH-04-1-0452 awarded by the Department of The ing restriction enzyme (RRE) recognition sequence and, 3' Army. The government has certain rights in the invention. of the recognition sequence of the RRE, either the 5' cut sequence of the FRE or the 3' cut sequence of the FRE, the TECHNICAL FIELD conjugating resulting in a plurality of bound fourth frag ments, each bound fourth fragment (i) containing at one This invention relates to epigenetic gene regulation, and 25 terminus the recognition sequence of the RRE and (ii) being more particularly to DNA methylation and its effect on gene bound via the first member of the affinity pair at the other expression, and its use as a marker of a particular cell type terminus and the second member of the affinity pair to the and/or disease state. insoluble substrate; and exposing the bound fourth fragments to the RRE, the BACKGROUND 30 exposing resulting in the release from the insoluble Substrate of a MSDK library, the library comprising a plurality of fifth Epigenetic changes (e.g., changes in the levels of DNA fragments, each fifth fragment comprising the releasing methylation), as well as genetic changes, can be detected in moiety and a MSDK tag, the tag consisting of a plurality of cancer cells and stromal cells within tumors. In order to base pairs of the genomic DNA. Thus, the method results in develop more discriminatory diagnostic methods and more 35 the production of a plurality of MSDK tags. effective therapeutic methods it is important that these In the method, the MMRE can be, e.g., AscI, the FRE can epigenetic effects be defined and characterized. be, e.g., NlaII, and the RRE can be, e.g., Mme. The binding moiety can further include a 5" or 3' cut sequence of the SUMMARY MMRE. The binding moiety can also further include, 40 between the 5' or 3' recognition sequence of the MMRE and The inventors have developed a method of assessing the the first member of an affinity pair, a linker nucleic acid level of methylation in an entire, or part of a, genome. They sequence comprising a plurality of base pairs. The releasing call this method Methylation Specific Digital Karyotyping moiety can further include, 5' of the RRE recognition (MSDK). The MSDK method can be adapted to establish a sequence, an extender nucleic acid sequence comprising a test genomic methylation profile for a test cell of interest. By 45 plurality of base pairs. The test cell can be a vertebrate cell comparing the test profile to control profiles obtained with and the vertebrate test cell can be a mammalian test cell, e.g., defined cells types, the test cell can be identified. The MSDK a human test cell. Moreover the test cell can be a normal cell method can also be used to identify genes in a test cell (e.g., or, for example, a cancer cell, e.g., a breast cancer cell. The a cancer cell) the methylation of which is altered (increased first member of the affinity pair can be biotin, iminobiotin, or decreased) relative to a corresponding control cell (e.g., 50 avidin or a functional fragment of avidin, an antigen, a a normal cell of the same tissue as the cancer cell). This haptenic determinant, a single-stranded nucleotide information provides the basis for methods for discriminat sequence, a hormone, a ligand for adhesion receptor, a ing whether a test cell of interest (a) is the same as a control receptor for an adhesion ligand, a ligand for a lectin, a lectin, cell (e.g., a normal cell) or (b) is different from a control cell a molecule containing all or part of an immunoglobulin Fc but is, for example, a pathologic cell Such as a cancer cell. 55 region, bacterial protein A, or bacterial protein G. The Such methods include, for example, assessing the level of insoluble Substrate can include, or be, magnetic beads. DNA methylation or the level of expression of genes of Also provided by the invention is a method of analyzing interest, or the level of DNA methylation in a particular a MSDK library. The method includes: providing a MSDK chromosomal area in test cells and comparing the results to library made by the above-described method; and identify those obtained with control cells. 60 ing the nucleotide sequences of one tag, a plurality of tags, More specifically, the invention features a method of or all of the tags. Identifying the nucleotide sequences of a making a methylation specific digital karyotyping (MSDK) plurality of tags can involve: making a plurality of ditags, library. The method includes: each ditag containing two fifth fragments ligated together; providing all or part of the genomic DNA of a test cell; forming a concatamer containing a plurality of ditags or exposing the DNA to a methylation-sensitive mapping 65 ditag fragments, wherein each ditag fragment contains two restriction enzyme (MMRE) to generate a plurality of first MSDK tags; determining the nucleotide sequence of the fragments; concatamer, and deducing, from the nucleotide sequence of US 9,556,430 B2 3 4 the concatamer, the nucleotide sequences of one or more of differentiated cells derived therefrom can be normal or the MSDK tags that the concatamer contains. The ditag cancer cells (e.g., breast cancer cells) or obtained from a fragments can be made by exposing the ditags to the FRE. cancerous tissue (e.g., breast cancer). The method can further include, after making a plurality of Another embodiment of the invention is a method of ditags and prior to forming the concatamers, the number 5 diagnosis. The method includes: (a) providing a test breast (abundance) of individual ditags is increased by PCR. The epithelial cell; (b) determining the degree of methylation of method can further include determining the relative fre one or more C residues in a DNA sequence (e.g., in a gene) quency of some or all of the tags. in the test cell, wherein the DNA (e.g., the gene) is selected Another aspect of the invention is an additional method of from the AscI sites identified by the MSDK tags listed in analyzing a MSDK library. The method includes: providing 10 a MSDK library made by the above-described method; Table 5, wherein the one or more C residues are C residues identifying a chromosomal site corresponding to the in CpG sequences; and (c) comparing the degree of meth sequence of a tag selected from the library. The method can ylation of the one or more residues to the degree of meth further involve determining a chromosomal location, in the ylation of corresponding one or more C residues in a genome of the test cell, of an unmethylated full recognition 15 corresponding gene in a control epithelial cell obtained from sequence of the MMRE closest to the identified chromo non-cancerous breast tissue, wherein an altered degree of somal site. These two steps can be repeated with a plurality methylation of the one or more C residues in the test of tags obtained from the library in order to determine the epithelial cell compared to the control epithelial cell is an chromosomal location of a plurality of unmethylated rec indication that the test epithelial cell is a cancer cell. The ognition sequences of the MMRE. The identification of the altered degree of methylation can be a lower degree of chromosomal site and the determination of the chromosomal methylation or a higher degree of methylation. The altered location can be performed by a process that includes com degree of methylation can be in the promoter region of the paring the nucleotide sequence of the selected tag to a virtual gene, an exon of the gene, an intron of the gene, or a region tag library generated using the nucleotide sequence of the outside of the gene (e.g., in an intergenic region). The gene genome or the part of a genome, the nucleotide sequence of 25 can be, for example, PRDM14 or ZCCHC14. the full recognition sequence of the MMRE, the nucleotide The invention provides another method of diagnosis. The sequence of the full recognition sequence of the FRE, and method includes: the number of nucleotides separating the full recognition (a) providing a test colon epithelial cell; (b) determining the sequence of the RRE from the RRE cutting site. degree of methylation of one or more C residues in a DNA In another aspect, the invention provides a method of 30 sequence (e.g., in a gene) in the test cell, wherein the DNA classifying a biological cell. The method includes: (a) iden sequence (e.g., the gene) is selected from those identified by tifying the nucleotide sequences of one tag, a plurality of the MSDK tags listed in Table 2, wherein the one or more tags, or all of the tags in an MSDK library made as described C residues are C residues in CpG sequences; and (c) above and determining the relative frequency of some or all comparing the degree of methylation of the one or more of the tags, thereby obtaining a test MSDK profile for the test 35 residues to the degree of methylation of corresponding one cell; (b) comparing the test MSDK profile to separate control or more C residues in a corresponding gene in a control MSDK expression profiles for one or more control cell epithelial cell obtained from non-cancerous colon tissue, types; (c) selecting a control MSDK profile that most closely wherein an altered degree of methylation of the one or more resembles the test MSKD profile; and (d) assigning to the C residues in the test epithelial cell compared to the control test cell a cell type that matches the cell type of the control 40 epithelial cell is an indication that the test epithelial cell is MSDK profile selected in step (c). The test and control cells a cancer cell. The altered degree of methylation can be a can be vertebrate cells, e.g., mammalian cells such as human lower degree of methylation or a higher degree of methyl cells. The control cell types can include a control normal cell ation. In addition, the altered degree of methylation can be and a control cancer cell of the same tissue as the normal in the promoter region of the gene, an exon of the gene, an cell. The control normal cell and the control cancer cell can 45 intron of the gene, or a region outside of the gene (e.g., an be breast cells or of a tissue selected from colon, lung, intergenic region). The gene can be, for example, LHX3. prostate, and pancreas. The test cell can be a breast cell or TCF7L1, or LMX-1A. of a tissue selected from of colon, lung, prostate, and Another method of diagnosis featured by the invention pancreas. The control cell types can include cells of different involves: (a) providing a test myoepithelial cell obtained categories of a cancer of a single tissue and the different 50 from a test breast tissue; (b) determining the degree of categories of a cancer of a single tissue can include, for methylation of one or more C residues in a DNA sequence example, a breast ductal carcinoma in situ (DCIS) cell and (e.g., in a gene) in the test cell, wherein the DNA sequence an invasive breast cancer cell. The different categories of a (e.g., the gene) is selected from those identified by the cancer of a single tissue can alternatively include, for MSDK tags listed in Table 10, wherein the one or more C example, two or more of a high grade DCIS cell, an 55 residues are C residues in CpG sequences; and (c) compar intermediate grade DCIS cell; and a low grade DCIS cell. ing the degree of methylation of the one or more residues to The control cell types can include two or more of a lung the degree of methylation of corresponding one or more C cancer cell; a breast cancer cell; a colon cancer cell; a residues in a corresponding gene in a control myoepithelial prostate cancer cell; and a pancreatic cancer. In addition, the cell obtained from non-cancerous breast tissue, wherein an control cell types can include an epithelial cell obtained 60 altered degree of methylation of the one or more C residues from non-cancerous tissue and a myoepithelial cell obtained in the test myoepithelial cell compared to the control myo from non-cancerous tissue. Furthermore, the control cells epithelial cell is an indication that the test breast tissue is can also include stem cells and differentiated cells derived cancerous tissue. The altered degree of methylation can be therefrom (e.g., epithelial cells or myoepithelial cells) of the a lower degree of methylation or a higher degree of meth same tissue type. The control stem and differentiated cells 65 ylation. In addition, the altered degree of methylation can be therefrom can be of breast tissue, or of a tissue selected from in the promoter region of the gene, an exon of the gene, an colon, lung, prostate, and pancreas. The control stem and intron of the gene, or a region outside of the gene (e.g., an US 9,556,430 B2 5 6 intergenic region). The gene is can be, for example, resembles the degree of methylation in the control stem cell; HOXD4, SLC9A3R1, or CDC42EP5. (ii) more likely to be a differentiated luminal epithelial cell Yet another method of diagnosis embodied by the inven if the degree of methylation in the test sample more closely tion involves: resembles the degree of methylation in the control epithelial (a) providing a test fibroblast obtained from a test breast 5 cell; or (iii) more likely to be a myoepithelial cell if the tissue; (b) determining the degree of methylation of one or degree of methylation in the test sample more closely more C residues in a DNA sequence (e.g., in a gene) in the resembles the degree of methylation in the control myoepi test cell, wherein the DNA sequence (e.g., the gene) is thelial cell. The C residues can be in the promoter region of selected from those identified by the MSDK tags listed in the gene, an exon of the gene, an intron of the gene, or in a Tables 7 and 8, wherein the one or more C residues are C 10 residues in CpG sequences; and (c) comparing the degree of region outside of the gene (e.g., an intergenic region). The methylation of the one or more residues to the degree of gene can be, for example, SOX13, SLC9A3R1, FNDC1, methylation of corresponding one or more C residues in a FOXC1, PACAP DDN, CDC42EP5, LHX1, and HOXA10. corresponding gene in a control fibroblast obtained from The invention also features a method of diagnosis that non-cancerous breast tissue, wherein an altered degree of 15 involves: (a) providing a test cell from a test tissue; (b) methylation of the one or more C residues in the test determining the degree of methylation of one or more C fibroblast compared to the control fibroblast is an indication residues in a PRDM14 gene in the test cell, wherein the one that the test breast tissue is cancerous tissue. The altered or more C residues are C residues in CpG sequences; and (c) degree of methylation can be a lower degree of methylation comparing the degree of methylation of the one or more or a higher degree of methylation. In addition, the altered residues to the degree of methylation of corresponding one degree of methylation can be in the promoter region of the or more C residues in the PRDM14 gene in a control cell gene, an exon of the gene, an intron of the gene, or a region obtained from non-cancerous tissue of the same tissue as the outside of the gene (e.g., an intergenic region). The gene can test cell, wherein an altered degree of methylation of the one be, for example, Cxorf12. or more C residues in the test cell compared to the control In another aspect, the invention includes a method of 25 cell is an indication that the test cell is a cancer cell. The determining the likelihood of a cell being an epithelial cell altered degree of methylation can be a lower degree of or a myoepithelial cell. The method involves: methylation or a higher degree of methylation. In addition, (a) providing a test cell; (b) determining the degree of the altered degree of methylation can be in the promoter methylation of one or more C residues in a DNA sequence region of the gene, an exon of the gene, an intron of the gene, (e.g., in a gene) in the test cell, wherein the DNA sequence 30 or a region outside of the gene (e.g., an intergenic region). (e.g., the gene) is selected from those identified by the The test and control cells can be breast cells or of a tissue MSDK tags listed in Table 12, wherein the one or more C selected from colon, lung, prostate, and pancreas. residues are C residues in CpG sequences; and (c) compar Another embodiment of the invention is a method of ing the degree of methylation of the one or more residues to diagnosis that includes: (a) providing a test sample of breast the degree of methylation of corresponding one or more C 35 tissue comprising a test epithelial cell; (b) determining the residues in a corresponding gene in a control myoepithelial level of expression in the test epithelial cell of a gene cell and to the degree of methylation of corresponding one selected from those listed in Table 5, wherein the gene is one or more C residues in a corresponding gene in a control that is expressed in a breast cancer epithelial cell at a epithelial cell, wherein the test cell is: (i) more likely to be Substantially altered level compared to a compared to a a myoepithelial cell if the degree of methylation in the test 40 normal breast epithelial cell; and (c) classifying the test cell sample more closely resembles the degree of methylation in as: (i) a normal breast epithelial cell if the level of expression the control myoepithelial cell; or (ii) more likely to be an of the gene in the test cell is not substantially altered epithelial cell if the degree of methylation in the test sample compared to a control level of expression for a normal breast more closely resembles the degree of methylation in the epithelial cell; or (ii) a breast cancer epithelial cell if the control epithelial cell. The C residues can be in the promoter 45 level of expression of the gene in the test cell is substantially region of the gene, an exon of the gene, an intron of the gene, altered compared to a control level of expression for a or in a region outside of the gene (e.g., an intergenic region). normal breast epithelial cell. The gene is can be, for The gene can be, for example, LOC38.9333 or CDC42EP5. example, PRDM14 or ZCCHC14. The alteration in the level In another aspect, the invention includes a method of of expression can be an increase in the level of expression determining the likelihood of a cell being a stem cell, an 50 or a decrease in the level of expression. differentiated luminal epithelial cell or a myoepithelial cell. Another aspect of the invention is a method of diagnosis The method involves: (a) providing a test cell; (b) deter that includes: mining the degree of methylation of one or more C residues (a) providing a test sample of colon tissue comprising a test in a DNA sequence (e.g., in a gene) in the test cell, wherein epithelial cell; the DNA sequence (e.g., the gene) is selected from those 55 (b) determining the level of expression in the test epithelial identified by the MSDK tags listed in Table 15 or 16, cell of a gene selected from those listed in Table 2, wherein wherein the one or more C residues are C residues in CpG the gene is one that is expressed in a colon cancer epithelial sequences; and (c) comparing the degree of methylation of cell at a Substantially altered level compared to a compared the one or more residues to the degree of methylation of to a normal colon epithelial cell; and (c) classifying the test corresponding one or more C residues in a corresponding 60 cell as: (i) a normal colon epithelial cell if the level of gene in a control stem cell, to the degree of methylation of expression of the gene in the test cell is not substantially corresponding one or more C residues in a corresponding altered compared to a control level of expression for a gene in a control differentiated luminal epithelial cell, and to normal colon epithelial cell; or (ii) a colon cancer epithelial the degree of methylation of corresponding one or more C cell if the level of expression of the gene in the test cell is residues in a corresponding gene in a control myoepithelial 65 substantially altered compared to a control level of expres cell, wherein the test cell is: (i) more likely to be a stem cell sion for a normal colon epithelial cell. The gene can be, for if the degree of methylation in the test sample more closely example, LHX3, TCF7L1, or LMX-1A. The alteration in the US 9,556,430 B2 7 8 level of expression can be an increase in the level of Also embodied by the invention is a method of diagnosis expression or a decrease in the level of expression. that includes: Another method of diagnosis included in the invention (a) providing a test cell; (b) determining the level of expres involves: (a) providing a test sample of breast tissue com sion in the test cell of a PRDM14 gene; and (c) classifying prising a test stromal cell; (b) determining the level of 5 the test cell as: (i) a normal cell if the level of expression of expression in the stromal cell of a gene selected from those the gene in the test cell is not Substantially altered compared listed in Tables 7, 8, and 10, wherein the gene is one that is to a control level of expression for a control normal cell of expressed in a cell of the same type as the test stromal cell the same tissue as the test cell; or (ii) a cancer cell if the level at a substantially altered level when present in breast cancer of expression of the gene in the test cell is substantially 10 altered compared to a control level of expression for a tissue than when present in normal breast tissue; and (c) control normal cell of the same tissue as the test cell. The classifying the test sample as: (i) normal breast tissue if the alteration in the level of expression can be an increase in the level of expression of the gene in the test stromal cell is not level of expression or a decrease in the level of expression. substantially altered compared to a control level of expres The test and control cells can be breast cells or of a tissue sion for a control cell of the same type as the test stromal cell 15 selected from colon, lung, prostate, and pancreas. in normal breast tissue; or (ii) breast cancer tissue if the level The invention also provides a single Stranded nucleic acid of expression of the gene in the test stromal cell is Substan probe that includes: (a) the nucleotide sequence of a tag tially altered compared to a control level of expression for a selected from those listed in Tables 2, 5, 7, 8, 10, 12, 15 and control cell of the same type as the test Stromal cell in normal 16; (b) the complement of the nucleotide sequence; or (c) the breast tissue. The test and control stromal cells can be AscI sites defined by the MSDK tags listed in Tables 2, 5, myoepithelial cells and the genes can be those listed in Table 7, 8, 10, 12, 15, and 16. 10, e.g., HOXD4, SLC9A3R1, or CDC32EP5. Alternatively, In another aspect, there is provided an array containing a the test and control stromal cells can be fibroblasts and the substrate having at least 10, 25, 50, 100, 200, 500, or 1,000 genes can be those listed in Tables 7 and 8, e.g., Cxorf1. The addresses, wherein each address has disposed thereon a alteration in the level of expression can be an increase in the 25 capture probe that includes: (a) a nucleic acid sequence level of expression or a decrease in the level of expression. consisting of a tag nucleotide sequence selected from those In another aspect, the invention includes a method of listed in Tables 2, 5, 7, 8, 10, 12, 15 and 16; (b) the determining the likelihood of a cell being an epithelial cell complement of the nucleic acid sequence; or (c) the AscI or a myoepithelial cell. The method includes: (a) providing sites defined by the MSDK tags listed in Tables 2, 5, 7, 8, 10. a test cell; (b) determining the level of expression in the test 30 12, 15, and 16. sample of a gene selected from the group consisting of those The invention also features a kit comprising at least 10, identified by the MSDK tags listed in Table 12; (c) deter 25, 50, 100, 200, 500, or 1,000 probes, each probe contain mining whether the level of expression of the selected gene ing: (a) a nucleic acid sequence comprising a tag nucleotide in the test sample more closely resembles the level of sequence selected from those listed in Tables 2, 5, 7, 8, 10. expression of the selected gene in (i) a control myoepithelial 35 12, 15 and 16; (b) the complement of the nucleic acid cellor (ii) a control epithelial cell; and (d) classifying the test sequence; (c) the AscI sites defined by the MSDK tags listed cell as: (i) likely to be a myoepithelial cell if the level of in Tables 2, 5, 7, 8, 10, 12, 15, and 16. expression of the gene in the test cell more closely resembles Another aspect of the invention is kit containing at least the level of expression of the gene in a control myoepithelial 10, 25, 50, 100, 200, 500, or 1,000 antibodies each of which cell; or (ii) likely to be an epithelial cell if the level of 40 is specific for a different protein encoded by a gene identified expression of the gene in the test cell more closely resembles by a tag selected from the group consisting of the tags listed the level of expression of the gene in a control epithelial cell. in Tables 2, 5, 7, 8, 10, 12, 15 and 16. The gene can be, for example, LOC38.9333 or CDC42EP5. As used herein, an “affinity pair is any pair of molecules In another aspect, the invention includes a method of that have an intrinsic ability to bind to each other. Thus, determining the likelihood of a cell being a stem cell, a 45 affinity pairs include, without limitation, any receptor/ligand differentiated luminal epithelial cell, or a myoepithelial cell. pair, e.g., vitamins (e.g., biotin)/vitamin-binding The method includes: (a) providing a test cell; (b) determin (e.g., avidin or streptavidin); cytokines (e.g., interleukin-2)/ ing the level of expression in the test sample of a gene cytokine receptors (e.g., interleukin-2); hormones (e.g., Ste selected from the group consisting of those identified by the roid hormones)/hormone receptors (e.g., steroid hormone MSDK tags listed in Table 15 or 16; (c) determining whether 50 receptors); signal transduction ligands/signal transduction the level of expression of the selected gene in the test sample receptors; adhesion ligands/adhesion receptors; death more closely resembles the level of expression of the domain molecule-binding ligands/death domain molecules; selected gene in (i) a control stem cell, (ii) a control lectins (e.g., pokeweed mitogen, pea lectin, concanavalin A, differentiated luminal epithelial cell, or (iii) a control myo lentil lectin, phytohemagglutinin (PHA) from Phaseolus epithelial cell; and (d) classifying the test cell as: (i) likely 55 Vulgaris, peanut agglutinin, Soybean agglutinin, Ulex euro to be a stem cell if the level of expression of the gene in the paeus agglutinin-I, Dolichos biflorus agglutinin, Vicia vil test cell more closely resembles the level of expression of losa agglutinin and Sophora japonica agglutinin/lectin the gene in a control stem cell; (ii) likely to be an differen receptors (e.g., carbohydrate lectin receptors); antigens or tiated luminal epithelial cell if the level of expression of the haptens (e.g., trinitrophenol or biotin)/antibodies (e.g., anti gene in the test cell more closely resembles the level of 60 body specific for trinitrophenol or biotin); immunoglobulin expression of the gene in a control differentiated luminal Fc fragments/immunoglobulin Fc fragment binding proteins epithelial cell, or (iii) likely to be a myoepithelial cell if the (e.g., bacterial protein A or protein G). Ligands can serve as level of expression of the gene in the test cell more closely first or second members of an affinity pair, as can receptors. resembles the level of expression of the gene in a control Where a ligand is used as the first member of the affinity pair myoepithelial cell. The gene can be, for example, SOX13, 65 the corresponding receptor is used as the second member of SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, CDC42EP5, the affinity pair and where a receptor is used as the first LHX1, and HOXA10. member of the affinity pair, the corresponding receptor is US 9,556,430 B2 9 10 used as the second member of the affinity pair. Functional enzyme cut sequences share qualitative features and differ fragments of polypeptide first and second members of only in how these nucleotides are distributed between the 5' affinity pairs are fragments of the full-length, mature first or and 3' cut sequences. second members that are shorter than the full-length, mature FIG. 2 is a schematic depiction of the MSDK procedure first or second members but have at least 25% (e.g., at least: described in Examples 1 and 2. 30%; 40%; 50%: 60%; 70%; 80%; 90%; 95%: 98%: 99%; FIGS. 3-5 are diagrammatic representations of the results 99.5%: 100%; or even more) of the ability of the full-length, of a methylation-detecting sequence analysis of segments of mature first or second members to bind to corresponding the LHX3 gene region (FIG. 3: SEQID NO:3), the LMX-1A second or first members, respectively. gene region (FIG. 4: SEQ ID NO:5), and the TCF7L1 gene 10 region (FIG. 5: SEQID NO:4) shown in FIGS. 6-8, respec The nucleotide sequences of all the identified genes in tively. The circles represent potential methylation sites Tables 2, 5, 7, 8, 10, 12, 15 and 16 are available on public (CpG) in the analyzed segment of SEQID NOS:3, 5, and 4. genetic databases (e.g., GeneBank). These sequences are The order of circles (starting from the left of the rows of incorporated herein by reference. circles) is that of the CpG dinucleotides in the analyzed As used herein, a “substantially altered” level of expres 15 segments of SEQID NOS:3, 5 and 4 (starting from the 5' end sion of a gene in a first cell (or first tissue) compared to a of the analyzed segment nucleotide sequences). The analy second cell (or second tissue) is an at least 2-fold (e.g., at ses were performed on DNA from wild-type HCT 116 human least: 2-, 3-, 4-, 5-, 6-, 7-; 8-, 9-; 10-, 15-; 20-30-40-; 50 colon cancer cells (“WT) and HCT116 cells having both 75-; 100-; 200-; 500-; 1,000-; 2000-; 5,000-; or 10,000-fold) alleles of their DNTM1 and DNMT3b methyltransferase altered level of expression of the gene. It is understood that genes “knocked out” (“DKO'). Each circle is pie chart with the alteration can be an increase or a decrease. the amount of shading indicating the frequency (0%-100%) As used herein, breast “stromal cells' are breast cells at which the relevant potential methylation site was found to other than epithelial cells. be methylated. The top lines under the circles are linear Unless otherwise defined, all technical and scientific depictions of the relevant gene transcripts and include the terms used herein have the same meaning as commonly 25 exons (shaded boxes) and introns (lines between the shaded understood by one of ordinary skill in the art to which this boxes) and the bottom line under the circles are linear invention pertains. In case of conflict, the present document, depictions of the on which the genes are including definitions, will control. Preferred methods and located. On the chromosome depictions are shown the materials are described below, although methods and mate locations of the MSDK tag sequences that indicated the 30 locations of the relevant AscI recognition sequences, which rials similar or equivalent to those described herein can be locations are also shown. The numbering on the bottom lines used in the practice or testing of the present invention. All indicates the (bp) numbers on the publications, patent applications, patents and other refer and the numbering on the top lines indicate the bp numbers, ences mentioned herein are incorporated by reference in in the chromosomes, of the transcription start sites and their entirety. The materials, methods, and examples dis 35 termination sites. The transcription initiation sites and the closed herein are illustrative only and not intended to be directions of transcription are also shown. limiting. FIG. 6A is a depiction of the nucleotide sequence (SEQ Other features and advantages of the invention, e.g., ID NO:3) of a region of the LHX3 gene containing the assessing the methylation of an entire genome, will be MSDK tag sequence (bold and underlined) that identified apparent from the following description, from the drawings 40 the relevant AscI recognition sequence (in capital letters and and from the claims. underlined) and multiple CpG dinucleotides (shaded). The segment of SEQ ID NO:3 subjected to methylation-detect DESCRIPTION OF DRAWINGS ing sequence analysis starts at the nucleotide after the 3' end of the forward PCR primer target sequence (shown in italics FIG. 1 is a diagrammatic representation of the generation 45 and underlined) used for the sequencing analysis and ends at of a restriction enzyme 5' cut sequence and 3' cut sequence the nucleotide before the 3' end of the reverse PCR primer by the restriction enzyme cutting DNA at the restriction target sequence (shown in italics and underlined). The enzyme’s recognition sequence. In the diagram are shown sequenced segment spans bp -196 to bp +172 (relative to the the two strands of a segment of double stranded DNA LHX3 gene transcription initiation site) and thus the last 23 containing a restriction enzyme recognition sequence in 50 CpG in the sequenced segment are within the promoter which each of the nucleotides constituting the recognition region and the first 26 CpG are in exon 1. sequence are shown as an N. The exemplary restriction FIG. 6B is a depiction of the nucleotide sequence (SEQ enzyme recognition sequence in the diagram is a six base ID NO:1545) of a region of the LHX3 gene within SEQ ID pair recognition sequence and cutting by the particular NO:3 containing the relevant AscI site (bold and underlined) restriction enzyme results in a 3' two nucleotide overhang. 55 and multiple CpG dinucleotides (shaded). The N-containing sequences constituting the restriction FIG. 7A is a depiction of the nucleotide sequence (SEQ enzyme recognition sequence and the restriction enzyme’s 3' ID NO:5) of a region of the LMX-1A gene containing the and 5' cut sequences are boxed and appropriately labeled. MSDK tag sequence (bold and underlined) that identified Those skilled in the art will appreciate that 5' and 3' termini the relevant AscI recognition sequence (in capital letters and generated by the multiple restriction enzymes available 60 underlined) and multiple CpG dinucleotides (shaded). The differ greatly (in nucleotide content, whether cohesive ter segment of SEQ ID NO:5 subjected to methylation-detect mini are generated, and, if they are, in the nature and number ing sequence analysis starts at the nucleotide after the 3' end of nucleotides in the overhang). Nevertheless, in the sense of the forward PCR primer target sequence (shown in italics that all termini (5' and 3' cut sequences) produced by the and underlined) used for the sequencing analysis and ends at action of restriction enzymes that cut at their recognition 65 the nucleotide before the 3' end of the reverse PCR primer sequences consist of nucleotides derived from the relevant target sequence (shown in italics and underlined). The restriction enzyme recognition sequence, 5' and 3' restriction sequenced segment spans bp -842 to bp -609 (relative to the US 9,556,430 B2 11 12 LMX-LA gene transcription initiation site) and thus the FIG. 16A is a depiction of the nucleotide sequence (SEQ whole of the sequenced segment is within the promoter ID NO: 1) of a region of the PRDM14 gene containing the region. relevant AscI recognition sequence (in capital letters and FIG. 7B is a depiction of the nucleotide sequence (SEQ underlined) and multiple CpG dinucleotides (shaded). The ID NO:1546) of a region of the LMX-1A gene within SEQ segment of SEQ ID NO:1 subjected to methylation-detect ID NO:5 containing the relevant AscI recognition sequence ing sequence analysis starts at the nucleotide after the 3' end (in bold and underlined) and multiple CpG dinucleotides of the forward PCR primer target sequence (shown in italics (shaded). and underlined) used for the sequencing analysis and ends at FIG. 8A is a depiction of the nucleotide sequence (SEQ the nucleotide before the 3' end of the reverse PCR primer ID NO:4) of a region of the TCF7L1 gene containing the 10 MSDK tag sequence (bold and underlined) that identified target sequence (shown in italics and underlined). The the relevant AscI recognition sequence (in capital letters and sequenced segment spans bp +666 to bp +839 (relative to the underlined) and multiple CpG dinucleotides (shaded). The PRDM14 gene transcription initiation site) and thus the segment of SEQ ID NO:4 subjected to methylation-detect whole sequenced segment is within intron 1-2. ing sequence analysis starts at the nucleotide after the 3' end 15 FIG. 16B is a depiction of the nucleotide sequence (SEQ of the forward PCR primer target sequence (shown in italics ID NO:1548) of a region of the PRDM14 gene within SEQ and underlined) used for the sequencing analysis and ends at ID NO:1 containing the relevant AscI recognition sequence the nucleotide before the 3' end of the reverse PCR primer (in bold and underlined) and multiple CpG dinucleotides target sequence (shown in italics and underlined). The (shaded). sequenced segment spans bp +782 to bp +1003 (relative to FIG. 17A is a depiction of the nucleotide sequence (SEQ the TCF7L1 gene transcription initiation site) and thus the ID NO:2) of a region of the ZCCHC14 gene containing the first six CpG in the sequenced segment are within exon 1 and relevant AscI recognition sequence (in capital letters and the last 19 CpG are in intron 3-4. underlined) and multiple CpG dinucleotides (shaded). The FIG. 8B is a depiction of the nucleotide sequence (SEQ segment of SEQ ID NO:2 subjected to methylation-detect ID NO:1547) of a region of the TCF7L1 gene within SEQ 25 ing sequence analysis starts at the nucleotide after the 3' end ID NO:4 containing the relevant AscI recognition sequence of the forward PCR primer target sequence (shown in italics (in bold and underlined) and multiple CpG dinucleotides and underlined) used for the sequencing analysis and ends at (shaded). the nucleotide before the 3' end of the reverse PCR primer FIGS. 9-15 are diagrammatic representations of the target sequence (shown in italics and underlined). The results of a methylation-detecting sequence analysis of the 30 sequenced segment spans bp +79 to bp +292 (relative to the segments of, respectively, the PRDM14 gene region (FIG.9; ZCCHC14 gene transcription initiation site) and thus the last SEQ ID NO:1), the ZCCHC14 gene region (FIG. 10; SEQ 14 CpG in the sequenced segment are within exon 1 and the ID NO:2), the HOXD4 gene region (FIG. 11; SEQ ID first 7 CpG are in intron 1-2. NO:6), the SLC9A3R1 gene region (FIG. 12; SEQ ID FIG. 17B is a depiction of the nucleotide sequence (SEQ NO:7), the LOC38.933 gene region (FIG. 13: SEQ ID 35 ID NO:1549) of a region of the ZCCHC14 gene within SEQ NO:10), the CDC42EP5 gene region (FIG. 14; SEQ ID ID NO:2 containing the relevant AscI recognition sequence NO:8), and the Cxorf12 gene region (FIG. 15: SEQ ID (in bold and underlined) and multiple CpG dinucleotides NO:9) shown in FIGS. 16A-22A, respectively. The circles (shaded). represent potential methylation sites (CpG) in the analyzed FIG. 18A is a depiction of the nucleotide sequence (SEQ segments. The order of circles (starting from the left of the 40 ID NO:6) of a region of the HOXD4 gene containing the rows of circles) is that of the CpG dinucleotides in the relevant AscI recognition sequence (in capital letters and analyzed segments (starting from the 5' end of the analyzed underlined) and multiple CpG dinucleotides (shaded). The segment nucleotide sequences). The analyses were per segment of SEQ ID NO:6 subjected to methylation-detect formed on DNA from the indicated cell obtained from the ing sequence analysis starts at the nucleotide after the 3' end indicated samples (see Table 3). Samples used for the 45 of the forward PCR primer target sequence (shown in italics generation of MSDK libraries are marked with an asterisk. and underlined) used for the sequencing analysis and ends at Each circle is a pie chart with the amount of shading the nucleotide before the 3' end of the reverse PCR primer indicating the frequency (0%-100%) at which the relevant target sequence (shown in italics and underlined). The potential methylation site was found to be methylated. The sequenced segment spans bp +986 to bp + 1,189 (relative to top (bold) lines under the circles are linear depictions of the 50 the HOXD4 gene transcription initiation site) and thus the relevant gene transcripts and include the exons (shaded whole sequenced segment is within intron 1-2. boxes) and introns (lines between the shaded boxes) and the FIG. 18B is a depiction of the nucleotide sequence (SEQ bottom lines under the circles are linear depictions of the ID NO:1550) of a region of the HOXD4 gene within SEQ chromosomes on which the genes are located. On the ID NO:6 containing the relevant AscI recognition sequence chromosome depictions are shown the locations of the 55 (in bold and underlined) and multiple CpG dinucleotides MSDK tag sequences that indicated the location of the (shaded). relevant AscI recognition sequences, which locations are FIG. 19A is a depiction of the nucleotide sequence (SEQ also shown. The numbering on the bottom lines indicates the ID NO:7) of a region of the SLC9A3R1 gene containing the bp numbers for the chromosomes and the numbering on the relevant AscI recognition sequence (in capital letters and top lines indicate the bp numbers, in the chromosomes, of 60 underlined) and multiple CpG dinucleotides (shaded). The the transcription start sites and termination sites. The tran segment of SEQ ID NO:7 subjected to methylation-detect Scription initiation sites and the directions of transcription ing sequence analysis starts at the nucleotide after the 3' end are also shown. of the forward PCR primer target sequence (shown in italics FIG. 15 provides the above-listed information for the and underlined) used for the sequencing analysis and ends at HCFC1 gene as well as the Cxorf12 gene. As can be seen for 65 the nucleotide before the 3' end of the reverse PCR primer the figure, the two genes are located relatively close together target sequence (shown in italics and underlined). The on the X chromosome. sequenced segment spans bp +11,713 to bp +11,978 (relative US 9,556,430 B2 13 14 to the SLC9A3R1 gene transcription initiation site) and thus SLC9A3R1 (FIG. 23C), CDC42EP5 (FIG. 23D), the whole sequenced segment is within intron 1-2. LOC389333 (FIG. 23E), and Cxorf12 (FIG. 23F) genes in FIG. 19B is a depiction of the nucleotide sequence (SEQ epithelial cells (left set of normal and tumor cell bars), ID NO:1551) of a region of the SLC9A3R1 gene within myoepithelial cells (middle set of normal and tumor cell SEQ ID NO:7 containing the relevant AscI recognition 5 bars), and fibroblast-enriched stromal cells (right set of sequence (in bold and underlined) and multiple CpG normal and tumor cells) isolated from the indicated normal dinucleotides (shaded). breast tissue and breast carcinoma samples. The average Ct FIG. 20A is a depiction of the nucleotide sequence (SEQ value for each gene was normalized against the ACTB value ID NO:10) of a region of the LOC38.9333 gene containing (see Example 1). The data (“Relative methylation (%)") are the relevant AscI recognition sequence (in capital letters and 10 percentages relative to the ACTB value. Samples used for underlined) and multiple CpG dinucleotides (shaded). The generation of MSDK libraries are indicated by asterisks. The segment of SEQID NO:10 subjected to methylation-detect PRDM14 gene is almost exclusively methylated in tumor ing sequence analysis starts at the nucleotide after the 3' end epithelial cells and the LOC38.9333 gene is preferentially of the forward PCR primer target sequence (shown in italics methylated in epithelial cells (both tumor and normal) and underlined) used for the sequencing analysis and ends at 15 compared to other cell types. The HOXD4, SLC9A3R1, and the nucleotide before the 3' end of the reverse PCR primer CDC42EP5 genes, besides being differentially methylated target sequence (shown in italics and underlined). The between normal and DCIS and myoepithelial cells, are also sequenced segment spans bp +518 to bp +762 (relative to the methylated in other cell types. The HOXD4 gene is differ LOC38.9333 gene transcription initiation site) and thus the entially methylated between normal and tumor epithelial last 10 CpG in the sequenced segment are within exon 1 and cells and frequently methylated in stromal fibroblasts, while the first 21 CpG are within intron 1-2. the SLC9A3R1 and CDC43EP5 genes are frequently meth FIG. 20B is a depiction of the nucleotide sequence (SEQ ylated in stromal fibroblasts and occasionally in epithelial ID NO:1552) of a region of the LOC389333 gene within cells. The Cxorf12 gene is hypermethylated in tumor fibro SEQ ID NO:10 containing the relevant AscI recognition blast enriched stromal cells compared to normal cells of the sequence (in bold and underlined) and multiple CpG 25 same type and is also methylated in a fraction of epithelial dinucleotides (shaded). cells. FIG. 21A is a depiction of the nucleotide sequence (SEQ FIG. 24 is a bar graph showing the results of qMSP ID NO:8) of a region of the CDC42EP5 gene containing the analyses of the PRDM14 gene in a panel of normal breast relevant AscI recognition sequence (in capital letters and tissues, benign breast tumors (fibroadenomas, papillomas, underlined) and multiple CpG dinucleotides (shaded). The 30 and fibrocystic disease), and breast carcinomas. The data segment of SEQ ID NO:8 subjected to methylation-detect were computed as described for FIG. 23. 500% was set as ing sequence analysis starts at the nucleotide after the 3' end the upper limit of relative methylation although a few of the forward PCR primer target sequence (shown in italics samples showed a difference above this threshold. and underlined) used for the sequencing analysis and ends at FIGS. 25A-D are a series of bar graphs showing the the nucleotide before the 3' end of the reverse PCR primer 35 results of expression analyses of the PRDM14 (FIG. 25A). target sequence (shown in italics and underlined). The Cxorf12 (FIG. 25B), CDC42EP5 (FIG. 25C), and HOXD4 sequenced segment spans bp +7.991 to bp +8,193 (relative (FIG. 25D) genes in normal breast and breast carcinoma to the CDC42EP5 gene transcription initiation site) and thus (tumor) epithelial cells, fibroblast-enriched stromal cells the whole the sequenced segment is within exon 3. (stroma), and myoepithelial cells and in invasive breast FIG. 21B is a depiction of the nucleotide sequence (SEQ 40 carcinoma cell myofibroblasts. The average Ct value for ID NO:1553) of a region of the CDC42EP5 gene within each gene was normalized against the RPL39 value (see SEQ ID NO:8 containing the relevant AscI recognition Example 1). The data (“Relative expression (%)') are per sequence (in bold and underlined) and multiple CpG centages relative to the RPL39 value. Using RPL19 and dinucleotides (shaded). RPS13 values for normalization gave essentially the same FIG. 22A is a depiction of the nucleotide sequence (SEQ 45 results. The PRDM14 gene was relatively overexpressed in ID NO:9) of a region of the Cxorf12 gene containing the invasive breast carcinoma epithelial cells. The Corf12 gene MSDK tag sequence (bold and underlined) that identified was expressed at a relatively higher level in normal than in the relevant AscI recognition sequence (in capital letters and tumor fibroblast-enriched stromal cells. The CDC42EP5 and underlined) and multiple CpG dinucleotides (shaded). The HOXD4 genes showed higher expression in DCIS myoepi segment of SEQ ID NO:9 subjected to methylation-detect 50 thelial cells and invasive breast carcinoma myofibroblasts ing sequence analysis starts at the nucleotide after the 3' end compared to normal myoepithelial cells and also, in the case of the forward PCR primer target sequence (shown in italics of the CDC42EP5 gene, to normal epithelial cells. and underlined) used for the sequencing analysis and ends at FIG. 26A is a schematic representation of the procedure the nucleotide before the 3' end of the reverse PCR primer used for tissue fractionation and purification of the various target sequence (shown in italics and underlined). The 55 cell types from normal breast tissue. Cells were captured by sequenced segment spans bp -838 to bp -639 (relative to the antibody-coupled magnetic beads as indicated by the figure. Cxorf12 gene transcription initiation site) and thus the whole FIG. 26B is a series of photographs of ethidium bromide sequenced segment is within the promoter region. stained electrophoretic gels of semi-quantitative RT-PCR FIG. 22B is a depiction of the nucleotide sequence (SEQ analyses of selected genes from the purified cell fractions ID NO:1555) of a region of the Cxorf12 gene within SEQID 60 isolated from normal breast tissue. PPIA was used as a NO:9 containing the MSDK tag sequence (bold and under loading control. The triangles indicate an increasing number lined) that identified the relevant AscI recognition sequence of PCR cycles (25, 30, and 35). (in capital letters and underlined) and multiple CpG dinucle FIG. 26C is a series of graphs showing the ratio and otides (shaded). location of statistically significant (p<0.05) tags, generated FIGS. 23 A-F are a series of bar graphs showing the results 65 by MSDK, that are differentially methylated in different cell of quantitative methylation specific PCR (qMSP) analyses types isolated from normal mammary tissue. Dots corre of the PRDM14 (FIG. 23A), HOXD4 (FIG. 23B), sponding to genes selected for further validation are circled. US 9,556,430 B2 15 16 The X-axis represents the ratio of normalized tags from the relevant AscI recognition sequence (in bold and underlined) indicated libraries in the various comparisons. CD44/All and multiple CpG dinucleotides (shaded). The sequenced indicates the comparison of mammary stem cells (CD44+) segment spans bp -285 to bp -614 (relative to the FNDC1 against all differentiated cells (CD10+, CD24+, and gene transcription initiation site) and thus the whole MUC1+). 5 sequenced segment is within the promoter region. FIG. 27A is a series of diagrammatic representations of FIG.31 is a depiction of the nucleotide sequence (SEQID the results of a methylation-detecting sequence analysis of NO:12) of a region of the FOXC1 gene containing the segments of the SLC9A3R1 gene region, the FNDC1 gene relevant AscI recognition sequence (in bold and underlined) region, the FOXC1 gene region, the PACAP gene region, the and multiple CpG dinucleotides (shaded). The sequenced DDN gene region, the CDC42EP5 gene region, the LHX1 10 gene region, the SOX13 gene region, and the DTX gene segment spans bp 5250 to bp 4976 (relative to the FOXC1 region. The circles represent potential methylation sites gene transcription initiation site) and thus the whole (CpG) in the analyzed segment of SEQ ID NOS:7, 8, and sequenced segment is within the promoter region. 11-18. The order of the circles (starting from the left of the FIG.32 is a depiction of the nucleotide sequence (SEQID rows of circles) is that of the CpG dinucleotides in the 15 NO:13) of a region of the PACAP gene containing the analyzed segments of SEQID NOS:7, 8, and 11-18 (starting relevant AscI recognition sequence (in bold and underlined) from the 5' end of the analyzed segment nucleotide and multiple CpG dinucleotides (shaded). The sequenced sequences). The analyses were performed on DNA isolated segment spans bp 4404 to bp 4736 (relative to the PACAP from CD44+, CD24+, MUC1+, and CD10+ cell popula gene transcription initiation site) and thus the whole tions. Each circle is a pie chart with the amount of shading sequenced segment is within the promoter region. indicating the frequency (0-100%) at which the relevant FIG.33 is a depiction of the nucleotide sequence (SEQID potential methylation site was found to be methylated. The NO:14) of a region of the DDN gene containing the relevant top lines under the circles are linear depictions of the AscI recognition sequence (in bold and underlined) and relevant gene transcripts and include the exons (shaded multiple CpG dinucleotides (shaded). The sequenced seg boxes) and introns (lines between the shaded boxes) and the 25 ment spans bp 2108 to bp 2290 (relative to the PACAP gene bottom line under the circles are linear depictions of the transcription initiation site) and thus the whole sequenced chromosome on which the genes are located. On the chro segment is within exon 2. mosome depictions are shown the locations of the MSDK FIG.34 is a depiction of the nucleotide sequence (SEQID tag sequences that indicated the locations of the relevant NO:15) of a region of the LHX1 gene containing the AscI recognition sequences, which locations are also shown. 30 relevant AscI recognition sequence (in bold and underlined) The numbering on the bottom lines indicates the base pair and multiple CpG dinucleotides (shaded). The sequenced (bp) numbers on the chromosomes and the numbering on the segment spans bp 3600 to bp 3810 (relative to the LHX1 top lines indicate the bp numbers, in the chromosomes, of gene transcription initiation site) and thus the whole the transcription start sites and termination sites. The tran sequenced segment is within introns 3-4. Scription initiation sites and the directions of transcription 35 FIG.35 is a depiction of the nucleotide sequence (SEQID are also shown. NO:16) of a region of the SOX13 gene containing the FIG. 27B is a series of bar graphs showing the results of relevant AscI recognition sequence (in bold and underlined) quantitative methylation specific PCR (qMSP) analyses of and multiple CpG dinucleotides (shaded). The sequenced the SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, segment spans bp 669 to bp 374 (relative to the SOX13 gene CDC42EP5, LHX1, and HOXA10 genes in CD44+, CD10+, 40 transcription initiation site) and thus the whole sequenced MUC1+, and CD24+ cells populations from women of segment is within the promoter area. different ages (18-58 years old) and reproductive history. FIG. 36 is a depiction of the nucleotide sequence (SEQID The average Ct value for each gene was normalized against NO:17) of a region of the DTX gene containing the relevant the ACTB value. The data (“Relative expression (%)") are AscI recognition sequence (in bold and underlined) and percentages relative to the RPL39 value. 45 multiple CpG dinucleotides (shaded). The sequenced seg FIG. 28 is a series of bar graphs showing the results of ment spans bp 228 to bp 551 (relative to the DTX gene expression analyses of the SLC9A3R1, FNDC1, FOXC1, transcription initiation site) and thus the whole sequenced PACAP, DDN, CDC42EP5, LHX1, and HOXA10 genes in segment is within the promoter area. CD44+, CD10--, MUC1+, and CD24+ cells isolated from FIG.37 is a depiction of the nucleotide sequence (SEQID normal breast tissue. The average Ct value for each gene was 50 NO:18) of a region of the HOXA10 gene containing the normalized against the RPL39 value. The data (“Relative relevant AscI recognition sequence (in bold and underlined) expression (%)') are percentages relative to the RPL39 and multiple CpG dinucleotides (shaded). The sequenced value. segment spans bp 4270 to bp 4634 (relative to the HOXA10 FIGS. 29A-29B are a series of bar graphs depicting the gene transcription initiation site) and thus the whole results of quantitative methylation specific PCR (qMSP) 55 sequenced segment is within the promoter area. analyses of DNA from (A) the SLC9A3R1, FNDC1, FIG.38 is a depiction of the nucleotide sequence (SEQID FOXC1, PACAP LHX1, and HOXA10 genes in putative NO:1543) of a region of the SLC9A3R1 gene containing the breast cancer stem cells (T-EPCR+) and cells with more relevant AscI recognition sequence (in bold and underlined) differentiated phenotype from the same tumor (T-CD24+), and multiple CpG dinucleotides (shaded). The sequenced and (B) the HOXA10, FOXC1, PACAP and LHX1 genes 60 segment spans bp 11713 to bp 11978 (relative to the from matched primary tumors (indicated by a star) and SLC9A3R1 gene transcription initiation site) and thus the distant metastases (DM) collected from different organs. The whole sequenced segment is within introns 1-2. average Ct value for each gene was normalized against the FIG. 39 is a depiction of the nucleotide sequence (SEQID RPL39 value (see Example 1). The data (“Relative expres NO:11544) of a region of the CDC42Ep5 gene containing sion (%)') are percentages relative to the RPL39 value. 65 the relevant AscI recognition sequence (in bold and under FIG. 30 is a depiction of the nucleotide sequence (SEQID lined) and multiple CpG dinucleotides (shaded). The NO:11) of a region of the FNDC1 gene containing the sequenced segment spans bp 7855 to bp 8058 (relative to the US 9,556,430 B2 17 18 CDC42Ep5 gene transcription initiation site) and thus the denotes the phosphate group that occurs between the C and whole sequenced segment is within exon 3. G nucleoside residues in the CpG dinucleotide sequence. The MMRE recognition sequence can contain one, two, DETAILED DESCRIPTION three, or four C residues that are susceptible to methylation. If one (or more) of the C residues in a MMRE recognition Various aspects of the invention are described below. sequence is methylated, the MMRE does not cut the DNA Methylation Specific Digital Karyotyping (MSDK) at the relevant MMRE recognition sequence Examples of MSDK is a method of assessing the relative level of useful MMRE include, without limitation, AscI, Aat, Aci, methylation of an entire genome, or part of a genome, of a Afe, AgeI, Asis Aval, BceAI, BssHI, ClaI, Eagl, Hpy99I, cell of interest. The cell can be any DNA-containing bio 10 Mlul, Narl, NotI, SacII, or ZraAI The AscI recognition logical cell in which the DNA is subject to methylation, e.g., sequence is GGCGCGCC and thus contains two methyl prokaryotic cells (e.g., bacteria) or eukaryotic cells (e.g., ation sites (CpG sequences). If either one or both is meth yeast cells, protozoan cells, invertebrate cells, or vertebrate ylated, the recognition site is not cut by AscI. There are (e.g., mammalian) cells). approximately 5,000 AscI recognition sites per human Vertebrate cells can be from any vertebrate species, e.g., 15 genome. reptiles (e.g., Snakes, alligators, and lizards), amphibians Exposure of the genomic DNA to the MMRE results in a (e.g., frogs and toads), fish (e.g., salmon, sharks, or trout), plurality of first fragments, the absolute number of which birds (e.g., chickens, turkeys, eagles, or ostriches), or mam will depend on the relative number of MMRE recognition mals. Mammals include, for example, humans, non-human sites that are methylated. The more that are methylated, the primates (e.g., monkeys, baboons, or chimpanzees), horses, fewer first fragments will result. Most of the first fragments bovine animals (e.g., cows, oxen, or bulls), whales, dol will have at one terminus the MMRE 5" cut sequence (see phins, porpoises, pigs, sheep, goats, cats, dogs, rabbits, definition below) and at the other terminus the MMRE 3' cut gerbils, guinea pigs, hamsters, rats, or mice. Vertebrate and sequence (see definition below). For each chromosome, two mammalian cells can be any nucleated cell of interest, e.g., fragments with MMRE cut sequences at only one terminus epithelial cells (e.g., keratinocytes), myoepithelial cells, 25 will be generated; these first fragments are referred to herein endothelial cells, fibroblasts, melanococytes, hematological as terminal first fragments. One such terminal first fragment cells (e.g., macrophages, monocytes, granulocytes, T lym contains the 5' terminus of the chromosome at one end and phocytes (e.g., CD4+ and CD8+ lymphocytes), B-lympho a MMRE 3" cut sequence at the other end and the other cytes, natural killer (NK) cells, interdigitating dendritic terminal fragment contains the 3' terminus of the chromo cells), nerve cells (e.g., neurons, Schwann cells, glial cells, 30 some at one end and a MMRE 5" cut sequence at the other astrocytes, or oligodendrocytes), muscle cells (Smooth and end. striated muscle cells), chondrocytes, osteocytes. Also of As used herein, a "5" cut sequence” of a restriction interest are stem cells, progenitor cells, and precursor cells enzyme that cuts DNA within the restriction enzyme’s of any of the above-listed cells. Moreover the method can be recognition sequence is the portion of the restriction applied to malignant forms of any of cells listed herein. 35 enzyme’s recognition sequence at the 5' end of a fragment The cells can be of any tissue or organ, e.g., skin, eye, containing the 3' end of the restriction enzyme recognition peripheral nervous system (PNS, e.g., vagal nerve), central sequence that is generated by cutting of DNA by the nervous system (CNS: e.g., brain or spinal cord), skeletal restriction enzyme. As used herein, a '3' cut sequence' of a muscle, heart, arteries, veins, lymphatic vessels, breast, restriction enzyme that cuts DNA within the restriction lung, spleen, liver, pancreas, lymph node, bone, cartilage, 40 enzyme’s recognition sequence is the portion of the restric joints, tendons, ligaments, gastrointestinal tissue (e.g., tion enzyme’s recognition sequence at the 3' end of a mouth, esophagus, stomach, Small intestine, large intestine fragment containing the 5' end of the restriction enzyme (e.g., colon or rectum)), genitourinary System (e.g., kidney, recognition sequence that is generated by cutting of DNA by bladder, uterus, vagina, ovary, ureter, urethra, prostate, the restriction enzyme. 5' and 3' cut restriction enzyme cut penis, testis, or scrotum). Cancer cells can be of any of these 45 sequences are illustrated in FIG. 1. organs and tissues and include, without limitation, breast To the termini of the first fragments are conjugated a first cancers (any of the types and grades recited herein), colon member of an affinity pair (see definition in Summary cancer, prostate cancer, lung cancer, pancreatic cancer, mela section), e.g., biotin or iminobiotin. This can be achieved by, Oa. for example, ligating to the MMRE 5' and 3' cut sequence MSDK can be performed on an entire genome of a cell, 50 containing termini a binding moiety. The binding moiety e.g., whole DNA extracted from an entire cell or the nucleus contains the first member of the affinity pair conjugated of a cell. Alternatively, it can be carried out on part of a cell, (e.g., by a covalent bond or any other stable chemical e.g., by extracting DNA from mutant cells lacking part of a linkage, e.g., a coordination bond, that can withstand the genome, chromosome microdissection, or subtractive/differ relatively mild chemical conditions of the MSDK method ential hybridization. The method is performed on double 55 ology) to either a MMRE 5" cut sequence or a MMRE 3' cut stranded DNA and, unless otherwise stated, in describing sequence. The majority of the fragments (referred to herein MSDK, the term “DNA refers to double-stranded DNA. as second fragments) resulting from attachment by this Method of Making a MSDK Library method of the first members of the affinity pair will have first In the first step of the MSDK, genomic DNA is exposed members of an affinity pair bound to both their termini. to a methylation-sensitive mapping restriction enzyme 60 Second fragments resulting from terminal first fragments (MMRE) that cuts the DNA at sites having the recognition will of course have first members of the affinity pair only at sequence for the relevant MMRE. The MMRE can be any one terminus, i.e., the terminus containing the MMRE cut MMRE. In eukaryotic cells, methylation generally occurs at Sequence. C nucleotides in CpG dinucleotide sequences in DNA. The The binding moiety can, optionally, also contain a linker term “CpG” refers to dinucleotide sequences that occur in 65 (or spacer) nucleotide sequence of any convenient length, DNA and consist of a C nucleotide and G nucleotide e.g., one to 100 base pairs (bp), three to 80 bp, five to 70 bp, immediately 3' of the C nucleotide. The “p” in “CpG” seven to 60 bp, nine to 50, or 10 to 40 bp. The linker (or US 9,556,430 B2 19 20 spacer) can be, for example, 30, 31, 32, 33, 34, 35, 26, 37, (also referred to herein sometimes as a tagging enzyme) 38, or 40 bp long. As will be apparent, the linker must not recognition sequence. This can be achieved by, for example, include a fragmenting restriction enzyme (see below) rec ligating to the free termini (containing a FRE 5' or 3' cut ognition sequence. sequence) releasing moieties containing the FRE 5' or 3 cut Instead of using the above-described binding moiety to 5 sequence and, 5' of the cut sequence, the RRE recognition attach the first members of an affinity pair to the termini of sequence. Restriction enzymes useful as RRE are those that first fragments, the attachment can be done by any of a cut DNA at specific distances (depending on the particular variety of chemical means known in the art. In this case, the type IIs restriction enzyme) from the recognition sequence, first member of an affinity pair can optionally contain a e.g., without limitation, the type IIs and type II. An example functional chemical group that facilitates binding of the first 10 of a useful RRE is MmeI that has the following non member of the affinity pair to the termini of the first palindromic recognition sequence: 5'-TCCPuAC, 3'-AGG fragments. It will be appreciated that by using this “chemical PyTG (Pu, purine; Py, pyrimidine) and cuts DNA after the method’, it is possible to attach first members of an affinity twentieth nucleotide downstream of the TCCPuAc sequence pair to both ends of terminal first fragments. Naturally, using Boyd et al. (1986) Nucleic Acids Res. 14(13): 5255-5274. the chemical method it is also possible to include the 15 Other useful type IIs restriction enzymes include, without above-described linker (or spacer) nucleotide sequences. limitation, Bsnfl, FokI, and AlwI, and useful type IIB Where a functional chemical group is attached to the first restriction enzymes include, without limitation, BsaXI, member of the affinity pair, the linker (or spacer) nucleotide CspCI, AloI, Ppil, and others listed in Tengs et al. (2004) sequence is located between the first member of the affinity Nucleic Acids Research 32(15):e21 (pages 1-9), the disclo pair and the chemical functional group. sure of which is incorporated herein by reference in its The second fragments are then exposed to fragmenting entirety. restriction enzyme (FRE). The FRE can be any restriction Releasing moieties can optionally contain, immediately 5' enzyme whose recognition sequence occurs relatively fre of the RRE recognition sequence, additional nucleotides as quently in the genomic DNA of interest. Thus, restriction an extending sequence. The extending sequence can be of enzymes having four nucleotide recognition sequence are 25 any convenient length, e.g., one to 100 bp, three to 80 bp, particularly desirable as FRE. In addition, the FRE should five to 70 bp, seven to 60 bp, nine to 50, or 10 to 40 bp. The not be sensitive to methylation, i.e., its recognition extending sequence can be, for example, 20, 21, 22, 23, 24. sequence, at least in eukaryotic DNA should not contain a 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 26, 37, 38, or 40 CpG dinucleotide sequence. Preferably, the FRE recognition bp long. sequence should occur at least 10 (e.g., at least: 20: 50: 100: 30 Conjugating the RRE recognition sequence to the free 500; 1,000; 2,000; 5,000; 10,000; 25,000: 50,000; 100,000; termini of the bound third fragments results in bound fourth 200,000: 500,000; 10: or 107) times more frequently in the fragments that (a) have RRE recognition sequences at their genome than does the MMRE recognition sequence. free termini, and (b) are bound by the first and second Examples of useful FRE whose recognition sequences con members of the affinity pair to the solid substrate. The bound sist of four nucleotides include, without limitation, Alu, 35 fourth fragments are then exposed to the RRE which cuts the BfaI, CviAII, Fati, HpyCH4V, Msel, NlaIII, or Tsp509I. The bound fourth fragments at a position that is characteristic of recognition sequence for NlaIII is CATG. Exposure of the the relevant RRE. In the case of the Mme RRE, the bound second fragments to the FRE results in a large number of fourth fragment is cut on the downstream side of the fragments, the majority of which will have FRE cut twentieth nucleotide after the terminal C residue of the sequences at both of their termini and a relatively few with 40 TCCPuAC recognition sequence. The exposure results in a FRE cut sequence (5' or 3') at one end and the first member the release from the solid substrates of a library of fifth of the affinity pair (corresponding to a MMRE cut sequence) fragments. Each of the fifth fragments contains the RRE at the other end. The latter fragments are referred to herein recognition sequence (and extending sequence if used) and as third fragments. a plurality of bp of the test genomic DNA, including the FRE The third fragments are then exposed to a solid substrate 45 recognition sequence closest to an unmethylated MMRE having bound to it the second member of the affinity pair recognition sequence. The absolute number of these bp of (e.g., avidin, Streptavidin, or a functional fragment of either; the test genomic DNA in the fifth fragments will vary from see Summary section for examples of other useful second one RRE to another and is, in the case of MmeI, 20 members) corresponding to the first member of the affinity nucleotides. The sequence of genomic DNA in the fifth pair in the third fragments. The third fragments bind, via the 50 fragment (but without the FRE recognition sequence) is physical interaction between the first and second members referred to herein as a MSDK tag. Since the MmeI and of the affinity pair, to the solid substrate. The solid substrate Nla recognition sequences overlap by one nucleotide, the can be any insoluble Substance such as plastic (e.g., plastic tags generated using MmeI as the RRE and NlaIII as the microtiter well or petri plate bottoms), metal (e.g., magnetic FRE are 17 nucleotides long. metallic beads), agarose (e.g., agarose beads), or glass (e.g., 55 The greater the number of bp between the RRE recogni glass beads or the bottom of a glass vessel Such as a glass tion sequence and the cutting site of the RRE, the longer the beaker, test tube, or flask) to which the third fragments can MSDK tags will be. The longer the MSDK tags are, the bind and thus be separated from fragments not containing lower the chances of redundancy due to a plurality of the first member of the affinity pair. occurrences of the tag sequence in the genome of interest Fragments not bound to the solid substrate are removed 60 will be. In addition, it will be appreciated that the number of from the mixture and the solid substrate is optionally rinsed bp between FRE recognition sequences and corresponding or washed free of any non-specifically bound material. The MMRE recognition sequences in the genomic DNA of third fragments bound to the solid substrate are referred to interest will optimally be greater than the number of bp as bound third fragments. between the RRE recognition sequence and the RRE cut site. The terminus of the bound third fragment not bound to the 65 However problems arising due to this criterion not being met solid substrate (referred to herein as the free terminus) is can be obviated by using the binding moiety method of then conjugated to a releasing restriction enzyme (RRE) attaching a first member of an affinity pair to first fragment US 9,556,430 B2 21 22 termini and including in the binding moiety a linker (or Since the chance of any one ditag combination occurring spacer) nucleotide sequence of appropriate length (see more than once as a result of step (a) above would be above); the shorter the distance between the any given FRE extremely low, replicate ditags would likely be due to the recognition sequence and a corresponding MMRE recogni PCR amplification procedure. Ways to estimate the numbers tion sequence in a genome being analyzed, the longer the of individual tag sequences include the same methods linker (or spacer) nucleotide sequence would need to be. described above for identifying the tag sequences. Methods of Using a MSDK Tag Library The relative abundance (number) of a given MSDK tag MSDK libraries generated as described above can be used obtained gives an indication of the relative frequency at for a variety of purposes. which the nearest MMRE recognition sequence to the FRE The first step in most of such methods would be to at least 10 recognition sequence associated with the given tag is unm identify the nucleotide sequences of as many MSDK tags ethylated. The higher the number of the MSDK tag obtained, obtained in making a library as possible. There are many the more frequently that MMRE recognition sequence is ways in which this could be done which will be apparent to unmethylated. Because, by the nature of the method, any those skilled in the art. For example, array technology or the given MMRE recognition sequence is correlated with a MPSS (massively parallel signature sequencing) method 15 MSDK tag associated with the nearest FRE recognition could be exploited for this purpose. Alternatively, the sequence upstream of it and with the nearest FRE recogni MSDK tag-containing fifth fragments (see above) can be tion sequence downstream of it, if any two MMRE recog cloned into sequencing vectors (e.g., plasmids) and nition sites occur without an appropriate FRE recognition sequenced using standard sequencing techniques, preferably site between them, it will always be possible to discriminate automated sequencing techniques. the methylation status (methylated or not methylated) of The inventors have used a technique for identifying both the MMRE recognition sites. On the other hand if three MSDK tag sequences (see Example 1 below) adapted from MMRE recognition sites occur without an FRE recognition the Sequential Analysis of Gene Expression (SAGE) tech sequence between the first and third, it might not be possible nique Porter et al. (2001) Cancer Res.61:5697-5702: Krop to discriminate the methylation status of the middle MMRE et al. (2001) Proc. Natl. Acad. Sci. U.S.A98:9796-9801: Lal 25 recognition sequence. However, the chances of this occur et al. (1999) Cancer Res. 59:5403-5407; and Boon et al. ring can be reduced to essentially Zero by choosing a FRE (2002) Proc. Natl. Acad. Sci. U.S.A. 99:11287-11292). This that has a recognition sequence occurring in the genomic adapted technique involves: DNA of interest much more frequently than the selected (a) adding a DNA ligase enzyme to a library of fifth MMRE. Indeed prior to the analysis, since generally the fragments and thereby ligating pairs of fifth fragments 30 sequence of the genome of interest is known, this potential having cohesive RRE-derived ends together to form fifth resolution-impairing eventuality can be tested for in advance fragment dimers (also referred to herein as “ditags'); and overcome by examining the genomic nucleotide (b) increasing the numbers of individual ditags by PCR sequences and, if necessary, an alternative MMRE-FRE using primers whose sequences correspond to nucleotide combination can be selected or a plurality of analyses can be sequences in extender sequences derived from a releasing 35 performed using a number of different MMRE-FRE com moiety (see above); binations. (c) digesting the PCR-amplified ditags with the FRE used MSDK tag profiles composed of all the tag sequences to generate the MSDK library and thereby generating obtained in an MSDK analysis, and preferably (but not digested ditags lacking the RRE site and extender sequences necessarily) the relative numbers of all the MSDK tags, can (if used); 40 be compared to corresponding profiles obtained with other (d) concatamerizing (polymerizing) the ditags using a cell types. Corresponding profiles will of course be those ligase enzyme (e.g., T4 ligase) to create ditag multimers; generated using the same MMRE, FRE, and RRE and in at (e) cloning the ditag multimers into sequencing vectors least an overlapping part, if not an identical portion, of the and sequencing the inserts (e.g., by automatic sequencing relevant genome. Such comparisons can be used, for methods); and 45 example, to identify a test cell of interest. For example, a test (f) deducing from the ditag multimer sequences the cell could be a cell of type X, type y, or type Z. The MSDK sequences of individual MSDK tags. profile obtained with the test cell can be compared to control One of skill in the art will naturally know of ways to corresponding MSDK profiles obtained from control cells of modify and adapt the above tag identification procedure to type X, type y, and type Z. The test cell will likely be of the his or her particular requirements. For example, one or more 50 same type, or at least most closely related, to the control cell of the steps (e.g., step (b), the ditag amplification step or step (type x, y, or Z) whose MSDK profile the test cell’s profile (c), the step that removes the RRE recognition site and any most closely resembles. Alternatively, the MSDK profile of extender sequence used) could be omitted. a test cell can be compared to that of a single control cell Having obtained the sequences of Some or all of the and, if the test cells profile is significantly different from MSDK tags, there are a number of analyses that could be 55 that of the control cells profile, it is likely to be of a different pursued. type than the control cell type. Statistical methods for doing Enumeration of MSDK Tags the above-described analyses are known to those skilled in The numbers of each tag, or a subgroup of tags, in a the art. MSDK library can be computed. Then, for example, option The number of MSDK tag species in any given MSDK tag ally having normalized the number of each to the total 60 profile varies greatly depending on how many are available number of cloned tag sequences obtained, the resulting and their relative discriminatory power. Indeed, where a MSDK profile (consisting of a list of MSDK tags and the particular MSDK tag can discriminate specifically between abundance (number) of each MSDK tag) can be compared two cell types of interest, the MSDK tag profile can contain to corresponding MSDK profiles obtained with other cells of it alone. Thus MSDK tag profiles can contain as few as one interest. In computing the total numbers of individual 65 MSDK tag. However, they will generally contain a plurality MSDK tags, where ditags have been amplified by PCR (step of different MSDK tags, e.g., at least: 2; 3: 4: 5; 6; 7: 8; 9. (b) above), ditag replicates are deleted from the analysis. 10: 12:15:20; 25:30; 35:40:50; 60:75; 85; 100: 120; 140; US 9,556,430 B2 23 24 160; 180; 200; 250: 300; 350: 400; 450; 500; 600; 700: 800; of the genome) of interest. This can be done manually but is 900; a 1,000; 2,000; 5,000; 10,000; or even more tag species. preferably done by computer. The relevant genomic The range of “cell types” that can be compared in the sequence information can be loaded into the computer from above analyses is of course enormous. Thus, for example, a medium (e.g., a computer diskette, a CD ROM, or a DVD) the MSDK profile of a test bacterium can be compared to or it can be downloaded from a publicly available internet control MSDK profiles of bacteria of various species of the database. same genus as the test bacterium (if its genus is known but One method by which the genomic tag sequences can be its species is to be defined); various strains of the same identified is by first creating a “virtual tag library using the species as the test bacterium (if its species is known but its following information: (a) the nucleotide sequence of the strain is to be defined) or even various isolates of the same 10 genome (or part of the genome) of interest; (b) the nucleo strain as the test bacterium but from, for example, various tide sequence of the MMRE recognition sequence; (c) the ecological niches (if the strain of the test bacterium, but not nucleotide sequence of the FRE recognition sequence; and its ecological origin, is known). The same principle can be (d) the number of nucleotides separating the RRE recogni applied to any biological cell and to any level of speciation tion sequence from the RRE cutting site. Optimally, virtual of a biological cell. Similarly the MSDK profiles of eukary 15 tag sequences that are not unique (i.e. that could arise in a otic (e.g., mammalian) test cells can be compared to corre MSDK library from more than one genetic locus) are deleted sponding MSDK profiles of control test cells of various from the virtual MSDK library. By comparing the sequences tissues, of various stages of development, and of various of the tags obtained in the test MSDK analysis to the virtual lineages. In addition, the MSDK profile of a test vertebrate tag library, it is possible to determine the genomic location cell can be compared to one or more control MSDK profiles of MSDK tags of interest, e.g., all the tags obtained by the of cells (of, for example, the same tissue as the test cell) that analysis or one or more of Such tags. are normal or malignant in order to determine (diagnose) Once the genomic location of the genomic tag sequences whether the test cell is a malignant cell. Moreover, the has been obtained, it is a simple matter to identify genes in MSDK profile of a cancer test cell can be compared to one which, or close to which, the genomic tag sequences are or more control MSDK profiles of cancers of a variety of 25 located. This step can be done manually, but can also be tissues in order to define the tissue origin of the test cell. In done by a computer. Such genes can be the Subject of addition, the MSDK profile of a test cell can be compared to additional analyses, e.g., those described below. that or those of (a) control test cell(s) that can be identical Methods of Determining Levels of DNA Methylation to, or similar to or even different from, the test cell but The invention features methods of assessing the level of has/have been exposed or subjected to any of large number 30 methylation of genomic regions (e.g., genes or Subregions of of experimental or natural influences, e.g., drugs, cytokines, genes) of interest. The methods can be applied to genomic growth factors, hormones, or any other pharmaceutical or regions identified by the MSDKanalyses described above or biological agents, physical influences (e.g., elevated and/or selected on any other basis, e.g., the observation of differ depressed temperature or pressure), or environmental con ential expression of a gene in two cell types (e.g., a normal ditions (e.g., drought or monsoon conditions). It will thus be 35 cell and a cancer cell of the same tissue as the normal cell) appreciated that the term “cell type' covers a large variety of interest. of cells and that (or those) used or defined in any particular The methods are of particular interest in the diagnosis of analysis will depend on the nature of analysis being per cancer. In broad terms, it has been claimed that the genomes formed. Those skilled in the art will be able to select of cancer cells are hypomethylated relative to corresponding appropriate control cell types for the analyses of interest. 40 normal cells Feinberg et al. (1983) Nature 301:89-92. Examples of MSDK profiles useful as control test profiles Moreover, gene hypermethylation is frequently associated are provided herein. Thus, for example, the MSDK profile of with decreased expression of the relevant gene. However, at a test breast cell (e.g., an epithelial cell, a myoepithelial cell, the individual gene level these generalizations do not apply. or a fibroblast) from a human subject could be compared to Thus, for example, some genes can be hypermethylated in the MSDK profiles of breast epithelial cells, myoepithelial 45 cancer cells in comparison to corresponding normal cells, cells, and fibroblast-enriched stromal cells from both control hypermethylation of some genes is associated with increased normal and control breast cancer (e.g., DCIS or invasive expression, and hypomethylation of Some genes is associ breast cancer) subjects in order to establish whether the test ated with decreased expression of the relevant genes. Inter breast tissue from which the test breast cell was obtained is estingly, in the examples below, it was observed that hyper cancerous breast tissue. Moreover, the MSDK profile of a 50 methylation of the promoter region of one gene (CXorf12) test cancer cell can be compared to those of control breast, was associated with decreased expression of the gene, while prostate, colon, lung, and pancreatic cancer cells as part of hypermethylation of the exons and/or introns of three other an analysis to establish the tissue of the test cancer cell. In genes (PRDM14, HOXD4, and CDC42EP5) was associated addition, the MSDK profile of a cell suspected of being with increased expression of the genes. either an epithelial or myoepithelial cell can be compared to 55 As used herein, the term "gene' refers to a genomic those of control normal (and/or cancerous, depending on region starting 10 kb (kilobases) 5' of a transcription initia whether the test cell is normal, cancerous, or not yet estab tion site and terminating 2 kb 3' of the polyA signal lished to be normal or cancerous) epithelial and myoepithe associated with the coding sequence within the genomic lial cells in order to establish whether the test cell is an region. Where the polyA signal of another gene is located epithelial or myoepithelial cell. 60 less than 10 kb 5’ of the transcription initiation site of a gene Mapping of MMRE Recognition Sequences of interest, for the purposes of the instant invention, the gene Alternatively, or in addition to enumerating MSDK tags, of interest is considered to start at the first nucleotide once the tags obtained in by the MSDK analysis have been immediately after the polyA signal of the other gene. More identified, the locations in the genome of interest corre over, where a transcription initiation site of another gene is sponding to the tags (referred to herein as 'genomic tag 65 less than 2 kb 3' prime of the polyA signal of the gene of sequences) can be established by comparison of the tag interest, for the purposes of the instant invention, the gene of sequences to the nucleotide sequence of the genome (or part interest terminates at the nucleotide immediately before the US 9,556,430 B2 25 26 transcription initiation site of the other gene. From these regions, all or parts of transcribed regions, exons, introns, definitions it will be appreciated that, as used herein, pro and regions 3' of polyA signals) can be analyzed moter regions and regions 3' of polyA signals of adjacent Specific genes of interest include, for example, the LMX genes can overlap. 14, COL5A, LHX3, TCF7L1, PRDM14, ZCCHC14, As used herein, the “promoter region' of a gene refers to HOXD4, SLC9A3R1, CDC42EP5, Cxorf12, LOC38.9333, a genomic region starting 10 kb 5’ of a transcription initia SOX13, SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, tion site and terminating at the nucleotide immediately 5' of CDC42EP5, LHX1, and HOXA10 genes. the transcription initiation site. Where a polyA signal of Methylation levels of one or more of these DNA another gene is located less than 10 kb 5’ of the transcription sequences (e.g., genes) can be used to determine, for 10 example, whether a test epithelial cell from breast tissue is initiation site of a gene of interest, for the purposes of the a normal or cancerous epithelial cell (e.g., a DCIS (high, instant invention, the promoter region of the gene of interest intermediate, or low grade) or invasive breast cancer cell). starts at the first nucleotide immediately following the polyA Particularly useful for such determinations are the PRDM14 signal of the other gene. and ZCCHC14 genes. For example, with respect to the As used herein, the terms “exons' and “introns' refer to 15 PRDM14 gene, a gene segment that is or contains all or part amino acid coding and non-coding, respectively, nucleotide of SEQ ID NO:1 (FIG. 6A) can be analyzed in order to sequences occurring between the transcription initiation site discriminate these cell types. Of particular interest for this and start of the polyA sequence of a gene. purpose are nucleotide sequences that include nucleotides: As used herein, a "CpG island' is a sequence of genomic 8-17: 341-392; 371-426; or 391-405 of SEQ ID NO:1. DNA in which the number of CpG dinucleotide sequences is Methylation of the PRDM14 can similarly be used to significantly higher than their average frequency in the determine whether a test cell from, for example, pancreas, relevant genome. Generally, CpG islands are not greater lung, or prostate is a cancer cell or normal cell. In addition, than 2,000 (e.g., not greater than: 1,900: 1,800; 1,700; 1,600; with respect to the ZCCHC14 gene, a gene segment that is 1,500: 1400; 1,300; 1,200; 1,100; 1,000; 900: 800; 700; or contains all or part of SEQ ID NO:2 (FIG. 17) can be 600: 500; 400: 300; 200; 100; 75; 50: 25; or 15) bp long. 25 analyzed in order to discriminate these cell types. Of par They will generally contain not less than one CpG sequence ticular interest for this purpose are nucleotide sequences that to every 100 (e.g., every: 90; 80; 70; 60: 50: 40; 35:30; 25: include nucleotides: 154-236; 154-279; 154–293; or 154-299 20; 15: 10; or 5) bp in sequence of DNA. CpG islands can of SEQ ID NO:2. Hypermethylation of these genes, and be separated by at least 20 (i.e., at least: 20:35:50; 60:80; particularly hypermethylation of their coding regions, would 100: 150: 200; 250: 300; 350; or 500) bp of genomic DNA. 30 indicate that the relevant test cells are cancer cells. In the methods of the invention, the degree of methylation In addition, methylation levels of one or more of the of one or more C residues (in CpG sequences) in a gene of above-listed genes can be used to determine, for example, a test cell is determined. This degree of methylation can then whether a test epithelial cell from colon tissue is a normal or be compared to that in one or more (e.g., two, three, four, cancerous epithelial cell. Particularly useful for such deter five, six, seven, eight, nine, ten, 11, 12, 15, 18, 20, 25, 30. 35 minations are the LHX3, TCF7L1, and LMX-1A genes. For 35, 40, 50, 75, 100, 200, or more) control cells. example, with respect to the LHX3 gene, a gene segment If the level of methylation in the test cell is altered that is or contains all or part of SEQID NO:3 (FIG. 6A) can compared to, for example, that of a control cell, the test cell be analyzed in order to discriminate these cell types. Of is likely to be different from the control cell. For example, particular interest for this purpose are nucleotide sequences the test cell can be a cell from any of the vertebrate tissues 40 that include nucleotides: 667-778; 739-788; 918-931; or recited herein, the control cell can be a normal of that tissue, 885-903 of SEQ ID NO:3. In addition, for example, with and the gene can be any one that is differentially methylated respect to the TCF7L1 gene, a gene segment that is or in cells from cancerous versus normal tissue (e.g., any of the contains all or part of SEQ ID NO:4 (FIG. 8A) can be genes listed in Tables 2, 5, 7, 8, 10, 12 and 15). If the degree analyzed in order to discriminate these cell types. Of par of methylation of the gene in the test cell is different from 45 ticular interest for this purpose are nucleotide sequences that that in the normal cell, the test cell is likely to be a cancer include nucleotides: 708–737: 761-780; 807-864; or 914-929 cell. of SEQID NO:4. Moreover, for example, with respect to the Alternatively, the level of methylation in the test cell can LMX-1A gene, a gene segment that is or contains all or part be compared to that in two more (see above) control cells. of SEQ ID NO:5 (FIG. 7A) can be analyzed in order to The cell will be the same as, or most closely related to, the 50 discriminate these cell types. Of particular interest for this control cell in which the degree of methylation is the same purpose are nucleotide sequences that include nucleotides: as, or most closely resembles, that of the test cell. 849-878; 898-940; 948-999; or 1,020-1039 of SEQ ID The whole of a gene or parts of a gene (e.g., the promoter NO:5. Hypermethylation of these genes would indicate that region, the transcribed regions, the translated region, exons, the test cell is a cancerous colon epithelial cell. introns, and/or CpG islands) can be analyzed. 55 Furthermore, methylation levels of the above-listed genes Test and control cells can be the same as those listed can be analyzed to determine, for example, whether breast above in the section on MSDK. Genes that can analyzed can tissue from which a test myoepithelial is obtained is normal be any gene differently methylated in two or more cell types or cancerous breast tissue. Particularly useful for such of interest. In the methods of the invention any number of determinations are the HOXD4, SLC9A3R1, and genes can be analyzed in order to characterize a test cell of 60 CDC42EP5 genes. For example, with respect to the HOXD4 interest. Thus, one, two, three, four, five, six, seven, eight, gene, a gene segment that is or contains all or part of SEQ nine, ten, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 25, 28, ID NO:6 (FIG. 18A) can be analyzed in order to discrimi 30, 35, 40, 45, 50, 60, 70, 80, 80, 100, 200, 500, or even nate these cell types. Of particular interest for this purpose more genes can be analyzed. The genes can be, for example, are nucleotide sequences that include nucleotides: 185-255; any of the DNA sequences (e.g., the genes) listed in Tables 65 288-313; 312-362; or 328-362 of SEQID NO:6. In addition, 2, 5, 7, 8, 10, 12, 15, and 16. The entire genes or one more for example, with respect to the SLC9A3R1 gene, a gene Subregions of the genes (e.g., all or parts of promoter segment that is or contains all or part of SEQID NO:7 (FIG. US 9,556,430 B2 27 28 19A) can be analyzed in order to discriminate these cell “+/-, and/or '-'; (c) methylated or not methylated (i.e., in types. Of particular interest for this purpose are nucleotide a digital fashion); (d) ranges such as "0%-10%, “11%- sequences that include nucleotides: 104-126; 104-247; 104 20%, 21%-30%”, “31%-40%, etc. (or any convenient range 283; or 246-283 of SEQ ID NO:7. Moreover, for example, intervals); (e) graphically, e.g., in pie charts. with respect to the CDC42EP5 gene, a gene segment that is Methods of measuring the degree of methylation of C or contains all or part of SEQ ID NO:8 (FIG. 21A) can be residues in the CpG sequences are known in the art. Such analyzed in order to discriminate these cell types. Of par methodologies include sequencing of Sodium bisulfite ticular interest for this purpose are nucleotide sequences that treated DNA and methylation-specific PCR and are include nucleotides: 181-247; 282-328; 336-359; or 336-390 described in the Examples below. of SEQ ID NO:8. Hypermethylation of these genes, and 10 Standardizing methylation assays to discriminate between particularly their coding regions, would indicate that the test cell types of interest involves experimentation entirely myoepithelial cell is from cancerous breast tissue. familiar and routine to those in the art. For example, the Methylation levels of the above-listed genes can also be methylation status of gene Q in a sample cancer cells of analyzed to determine, for example, whether breast tissue interest obtained from a one or more patients and in corre from which a test fibroblast is obtained is normal or can 15 sponding normal cells from normal individuals or from the cerous breast tissue. Particularly useful for such determina same patients can be assessed. From Such experimentation it tions is the Cxorf12 gene. For example, with respect to the will be possible to establish a range of "cancer levels of either of these genes, a gene segment that is or contains all methylation and a range of “normal levels of methylation or part of SEQID NO:9 (FIG.22A) can be analyzed in order of gene Q. Alternatively, the methylation status of gene Q in to discriminate these cell types. Of particular interest for this cancer cells of each patient can be compared to the meth purpose nucleotide sequences that include nucleotides: 120 ylation status of gene Q in normal cells (corresponding to the 134; 159-201: 206-247; or 293-313 of SEQ ID NO:9. cancer cells) obtained from the same patient. In Such assays, Hypermethylation of these genes, and particularly their it is possible that methylation of as few as one cytosine promoter regions, would indicate that the test fibroblast is residue could discriminate between cancer and non-cancer from cancerous breast tissue. 25 cells. In addition, methylation levels of the above-listed genes Other methods for quantitating methylation of DNA are can also be analyzed to determine, for example, whether a known in the art. Such methods are based on: (a) the test cell is an epithelial cell or a myoepithelial cell. Such inability of methylation-sensitive restriction enzymes to assays can be applied to both normal and cancerous cells. cleave sequences that contain one or more methylated CpG Particularly useful for such determinations are the 30 sites Issa et al. (1994) Nat. Genet. 7:536-540: Singer-Sam LOC389333 and CDC42EP5 genes. For example, with et al. (1990) Mol. Cell. Biol. 10:4987-4989; Razin et al. respect to the LOC38.9333 gene, a gene segment that is or (1991) Microbiol. Rev. 55:451-458; Stoger et al. (1993) Cell contains all or part of SEQ ID NO:10 (FIG. 20A) can be 73:61-71; and (b) the ability of bisulfite to convert cytosine analyzed in order to discriminate these cell types. Of par to uracil and the lack of this ability of bisulfite on methylated ticular interest for this purpose are nucleotide sequences that 35 cytosine Frommer et al. (1992) Proc. Natl. Acad. Sci. USA include nucleotides: 306–330:334-361; 373–407; or 415-484 89:1827-1831: Myóhanen et al. (1994) DNA Sequence 5:1- of SEQ ID NO:10. With respect to the CDC42EP5 gene, 8: Herman et al. (1996) Proc. Natl. Acad. Sci. USA 93:9821 examples of gene segments that can be analyzed include 9826; Gonzalgo et al. (1997) Nucleic Acids Res. 25:2529 those described above for discriminating whether tissue 2531; Sadri et al. (1996) Nucleic Acids Res. 24:5058-5059: from which a test myoepithelial was obtained was normal or 40 Xiong et al. (1997) Nucleic Acids Res. 25:2532-2534). cancerous. Significantly high levels of methylation of these Gene Expression Assays genes would indicate that the test cell was an epithelial Experiments described in the Examples herein show that rather than a myoepithelial cell. in a first cell in which methylation of a gene is altered In addition, methylation levels of the above-listed genes (increased or decreased) relative to a second cell, expression can also be analyzed to determine, for example, whether a 45 of the gene in the first cell is also altered relative to the test cell is a stem cell, or a differentiated cell derived second cell. In addition, previous findings and the data in the therefrom, such as an epithelial cell or a myoepithelial cell. Examples indicate that alterations in methylation status, and Such assays can be applied to both normal and cancerous hence also consequent alterations in expression, of certain cells. Particularly useful for such determinations are the genes correlate with phenotypic changes in cells. These SOX13, SLC9A3R1, FNDC1, FOXC1, PACAP, DDN, 50 findings provide the basis for assays (e.g., diagnostic assays) CDC42EP5, LHX1, and HOXA10 genes. For example, with to discriminate between two or more cell types. respect to the FOXC1 gene, a gene segment that is or In the methods of the invention, the level of expression of contains all or part of SEQ ID NO:12 (FIG. 27A) can be a gene of a test cell determined. This level of expression can analyzed in order to discriminate these cell types. In some then be compared to that in one or more (e.g., two, three, cases, significantly high levels of methylation of some of 55 four, five, six, seven, eight, nine, ten, 11, 12, 15, 18, 20, 25, these genes would indicate that the test cell was a stem cell 30, 35, 40, 50, 75, 100, 200, or more) control cells. rather than a differentiated cell derived therefrom, (e.g., an If the level of expression in the test cell is altered epithelial or a myoepithelial cell). compared to, for example, that of a control cell, the test cell Levels of methylation of C residues of interest can be is likely to be different from the control cell. For example, assessed and expressed in quantitative, semi-quantitative, or 60 the test cell can be a cell from any of the vertebrate tissues qualitative fashions. Thus they can, for example, be mea recited herein, the control cell can be a normal cell of that Sured and expressed as discrete values. Alternatively, they tissue, and the gene can be one shown to be differentially can be assessed and expressed using any of a variety of methylated in cells from cancerous and normal tissue (e.g., semi-quantitative/qualitative systems known in the art. any of the genes listed in Tables 2, 5, 7, 8, 10, 12, 15 and 16). Thus, they can be expressed as, for example, (a) one or more 65 If the level of expression of the gene in the test cell is of “very high”, “high”, “average'. “moderate”, “low”, and/ different from that in the normal cell, the test cell is likely or “very low; (b) one or more of "++++”, “ +++, 99 99. to be a cancer cell. US 9,556,430 B2 29 30 Alternatively, the level of expression in the test cell can be cell at level that is the same as or similar to that of a control compared to that in two more (see above) control cells. The epithelial cell would be an indication that the test cell is an cell will be the same as, or most closely related to, the epithelial cell. control cell in which the level of expression is the same as, Levels of expression of genes of interest can be assessed or most closely resembles that of the test cell. and expressed in quantitative, semi-quantitative, or qualita Test and control cells can be any of those listed above in tive fashions. Thus they can, for example, be measured and the section on MSDK. Genes whose level of expression can expressed as discrete values. Alternatively, they can be be determined can be any gene differently methylated in two assessed and expressed using any of a variety of semi more cell types of interest. They can be, for example, any of quantitative/qualitative systems known in the art. Thus, they the genes listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16. 10 can be expressed as, for example, (a) one or more of “very Specific genes of interest include the LMX-14, COL5A, high”, “high”, “average”, “moderate”, “low”, and/or “very LHX3, TCF7L1, PRDM14, ZCCHC14, HOXD4, SOX13, low”; (b) one or more of "++++', '+++', '++', '+', '+/-, SLC9A3R1, CDC42EP5, Cxorf12, and LOC38.9333 genes. and/or '-'; (c) expressed or not expressed (i.e., in a digital Expression levels of one or more of these genes can be 15 fashion): (d) ranges such as “0%-10%”, “11%-20%, 21%- analyzed to determine, for example, whether a test epithelial 30%, 31%-40%, etc. (or any convenient range intervals); cell from breast tissue is a normal or cancerous epithelial cell or (e) graphically, e.g., in pie charts. (e.g., a DCIS (high, intermediate, or low grade) or invasive In the description below, a “geneX' represents any of the breast cancer cell). Particularly useful for such determina genes listed in Tables 2, 5, 7, 8, 10, and 12; mRNA tions are the PRDM14 and ZCCHC14 genes. Moreover, transcribed from gene X is referred to as “mRNAX': protein expression of the PRDM14 can be used to test whether a test encoded by gene X is referred to as “protein X; and cDNA cell from prostate, pancreas, or lung tissue is a cancer cell. produced from mRNA X is referred to as “cDNA X. It is Thus, for example, enhanced expression of the PRDM14 understood that, unless otherwise stated, descriptions con gene, or altered expression of the ZCCHC14 gene, in the test taining these terms are applicable to any of the genes listed breast epithelial cell compared to a control normal breast 25 in Tables 2, 5, 7, 8, 10, 12, 15 and 16, mRNAs transcribed epithelial cell would be an indication that the test epithelial from Such genes, proteins encoded by Such genes, or cDNAS cell is a cancer cell. produced from the mRNAs. In addition, expression levels of one or more of the In the assays of the invention either: (1) the presence of above-listed genes can be analyzed to determine, for protein X or mRNA X in cells is tested for or their levels in 30 cells are assessed; or (2) the level of protein X is assessed in example, whether a test epithelial cell from colon tissue is a a liquid sample Such as a body fluid (e.g., urine, saliva, normal or cancerous epithelial cell. Particularly useful for semen, blood, or serum or plasma derived from blood); a such determinations are the LHX3, TCF7L1, and LMX-1A lavage such as a breast duct lavage, lung lavage, a gastric genes. Altered expression of these genes in the test colon lavage, a rectal or colonic lavage, or a vaginal lavage; an epithelial cell compared to a control normal control epithe 35 aspirate Such as a nipple aspirate; or a fluid Such as a lial cell would be an indication that the test colon epithelial supernatant from a cell culture. In order to test for the cell is a cancer cell. presence, or measure the level, of mRNA X in cells, the cells Expression levels of one or more of the above-listed genes can be lysed and total RNA can be purified or semi-purified in a test myoepithelial cell can be analyzed to determine, for from lysates by any of a variety of methods known in the art. example, whether breast tissue from which the test myoepi 40 Methods of detecting or measuring levels of particular thelial was obtained is normal or cancerous breast tissue. mRNA transcripts are also familiar to those in the art. Such Particularly useful for such determinations are the HOXD4. assays include, without limitation, hybridization assays SLC9A3R1, and CDC42EP5 genes. Enhanced expression using detectably labeled mRNA X-specific DNA or RNA of, for example, the HOXD4 and CSD42EP5 genes, or probes and quantitative or semi-quantitative RT-PCR meth altered expression of the SLC9A3R1 gene, in the test 45 odologies employing appropriate mRNA X and cDNA myoepithelial cell compared to a control myoepithelial from X-specific oligonucleotide primers. Additional methods for control normal breast tissue, would indicate that the test quantitating mRNA in cell lysates include RNA protection breast tissue is cancerous breast tissue. assays and serial analysis of gene expression (SAGE). Expression levels of one or more of the above-listed genes Alternatively, qualitative, quantitative, or semi-quantitative in a test fibroblast can also be analyzed to determine, for 50 in situ hybridization assays can be carried out using, for example, whether breast tissue from which the test fibroblast example, tissue sections or unlysed cell Suspensions, and was obtained is normal or cancerous breast tissue. Particu detectably (e.g., fluorescently or enzyme) labeled DNA or larly useful for such determinations is the Cxorf12 gene. RNA probes. Expression, for example, of this gene at the same or a greater Methods of detecting or measuring the levels of a protein level than in a control fibroblast from control normal breast 55 of interest in cells are known in the art. Many such methods tissue would indicate that the breast tissue is not cancerous employ antibodies (e.g., polyclonal antibodies or monoclo breast tissue. nal antibodies (mAbs)) that bind specifically to the protein. In addition, expression levels of one or more of the In Such assays, the antibody itself or a secondary antibody above-listed genes can also be analyzed determine, for that binds to it can be detectably labeled. Alternatively, the example, whether a test cell is an epithelial cell or a 60 antibody can be conjugated with biotin, and detectably myoepithelial cell. Such assays can be applied to both labeled avidin (a protein that binds to biotin) can be used to normal and cancerous cells. Particularly useful for such detect the presence of the biotinylated antibody. Combina determinations are the LOC3.89333 and CDC42EP5 genes. tions of these approaches (including “multi-layer assays) Expression of these genes in the test cell at level that is the familiar to those in the art can be used to enhance the same as or similar to that of a control myoepithelial cell 65 sensitivity of assays. Some of these assays (e.g., immuno would be an indication that the test cell is a myoepithelial histological methods or fluorescence flow cytometry) can be cell. On the other hand, expression of the genes in the test applied to histological sections or unlysed cell Suspensions. US 9,556,430 B2 31 32 The methods described below for detecting protein X in a the capture polyclonal antibody binds; or (b) a polyclonal liquid sample can also be used to detect protein X in cell antibody that binds to epitopes other than or in addition to lysates. that to which the capture polyclonal antibody binds. Assays Methods of detecting protein X in a liquid sample (see which involve the use of a capture and detection antibody above) basically involve contacting a sample of interest with include sandwich ELISA assays, sandwich Western blotting an antibody that binds to protein X and testing for binding assays, and Sandwich immunomagnetic detection assays. of the antibody to a component of the sample. In Such assays Suitable solid substrates to which the capture antibody the antibody need not be detectably labeled and can be used can be bound include, without limitation, the plastic bottoms without a second antibody that binds to protein X. For and sides of wells of microtiter plates, membranes such as example, by exploiting the phenomenon of Surface plasmon 10 nylon or nitrocellulose membranes, polymeric (e.g., without resonance, an antibody specific for protein X bound to an limitation, agarose, cellulose, or polyacrylamide) beads or appropriate solid Substrate is exposed to the sample. Binding particles. It is noted that protein X-specific antibodies bound of protein X to the antibody on the solid substrate results in to Such beads or particles can also be used for immunoaf a change in the intensity of Surface plasmon resonance that finity purification of protein X. can be detected qualitatively or quantitatively by an appro 15 Methods of detecting or for quantifying a detectable label priate instrument, e.g., a Biacore apparatus (Biacore Inter depend on the nature of the label and are known in the art. national AB, Rapsgatan, Sweden). Appropriate labels include, without limitation, radionuclides Moreover, assays for detection of protein X in a liquid (e.g., I, I, S, H. P. P. or 'C), fluorescent sample can involve the use, for example, of: (a) a single moieties (e.g., fluorescein, rhodamine, or phycoerythrin), protein X-specific antibody that is detectably labeled; (b) an luminescent moieties (e.g., QdotM nanoparticles Supplied unlabeled protein X-specific antibody and a detectably by the Quantum Dot Corporation, Palo Alto, Calif.), com labeled secondary antibody; or (c) a biotinylated protein pounds that absorb light of a defined wavelength, or X-specific antibody and detectably labeled avidin. In addi enzymes (e.g., alkaline phosphatase or horseradish peroxi tion, as described above for detection of proteins in cells, dase). The products of reactions catalyzed by appropriate combinations of these approaches (including “multi-layer” 25 enzymes can be, without limitation, fluorescent, lumines assays) familiar to those in the art can be used to enhance the cent, or radioactive or they may absorb visible or ultraviolet sensitivity of assays. In these assays, the sample or an light. Examples of detectors include, without limitation, (aliquot of the sample) Suspected of containing protein X X-ray film, radioactivity counters, Scintillation counters, can be immobilized on a solid Substrate Such as a nylon or spectrophotometers, calorimeters, fluorometers, luminom nitrocellulose membrane by, for example, 'spotting an 30 eters, and densitometers. aliquot of the liquid sample or by blotting of an electropho In assays, for example, to diagnose breast cancer, the level retic gel on which the sample or an aliquot of the sample has of protein X in, for example, serum (or a breast cell) from been Subjected to electrophoretic separation. The presence a patient Suspected of having, or at risk of having, breast or amount of protein X on the Solid Substrate is then assayed cancer is compared to the level of protein X in sera (or breast using any of the above-described forms of the protein 35 cells) from a control Subject (e.g., a Subject not having breast X-specific antibody and, where required, appropriate detect cancer) or the mean level of protein X in sera (or breast cells) ably labeled secondary antibodies or avidin. from a control group of Subjects (e.g., Subjects not having The invention also features “sandwich' assays. In these breast cancer). A significantly higher level, or lower level sandwich assays, instead of immobilizing samples on Solid (depending on whether the gene of interest is expressed at substrates by the methods described above, any protein X 40 higher or lower level in breast cancer or associated Stromal that may be present in a sample can be immobilized on the cells), of protein X in the serum (or breast cells) of the solid substrate by, prior to exposing the solid substrate to the patient relative to the mean level in sera (or breast cells) of sample, conjugating a second ("capture') protein X-specific the control group would indicate that the patient has breast antibody (polyclonal or mAb) to the solid substrate by any CaCC. of a variety of methods known in the art. In exposing the 45 Alternatively, if a sample of the subjects serum (or breast sample to the solid substrate with the second protein X-spe cells) that was obtained at a prior date at which the patient cific antibody bound to it, any protein X in the sample (or clearly did not have breast cancer is available, the level of sample aliquot) will bind to the second protein X-specific protein in the test serum (or breast cell) sample can be antibody on the solid substrate. The presence or amount of compared to the level in the prior obtained sample. A higher protein Xbound to the conjugated second protein X-specific 50 level, or lower level (depending on whether the gene of antibody is then assayed using a "detection' protein X-spe interest is expressed at higher or lower level in breast cancer cific antibody by methods essentially the same as those or associated stromal cells) in the test serum (or breast cell) described above using a single protein X-specific antibody. sample would be an indication that the patient has breast It is understood that in these sandwich assays, the capture CaCC. antibody should not bind to the same epitope (or range of 55 Moreover, a test expression profile of a gene in a test cell epitopes in the case of a polyclonal antibody) as the detec (or tissue) can be compared to control expression profiles of tion antibody. Thus, if a mAb is used as a capture antibody, control cells (or tissues) previously established to be of the detection antibody can be either: (a) another mAb that defined category (e.g., DCIS grade, breast cancer stage, or binds to an epitope that is either completely physically state of differentiation). The category of the test cell (or separated from or only partially overlaps with the epitope to 60 tissue) will be that of the control cell (or tissue) whose which the capture mab binds; or (b) a polyclonal antibody expression profile the test cells (or tissues) expression that binds to epitopes other than or in addition to that to profile most closely resembles. These expression profile which the capture mab binds. On the other hand, if a comparison assays can be used to compare any of the normal polyclonal antibody is used as a capture antibody, the breast tissue with any stage and/or grade of breast cancer detection antibody can be either (a) a mAb that binds to an 65 recited herein and/or to compare between breast cancer epitope to that is either completely physically separated grades and stages. The genes analyzed can be any of those from or partially overlaps with any of the epitopes to which listed in Tables 2, 5, 7, 8, 10, 12, 15, and 16 and the number US 9,556,430 B2 33 34 of genes analyzed can be any number, i.e., one or more. gene X (e.g., an allelic variant, or all possible hypothetical Generally, at least two (e.g., at least: two; three; four, five; variants). The array can be used, for example, to sequence six; seven; eight; nine; ten; 11; 12:13; 14:15; 17: 18; 20:23; gene X, mRNA X, or cDNA X by hybridization (see, e.g., 25; 30; 35:40, 45; 50; 60, 70; 80; 90; 100: 120; 150: 200; U.S. Pat. No. 5,695,940) or assess levels of expression of 250; 300; 350: 400; 450; 500; or more) genes will be gene X. analyzed. It is understood that the genes analyzed will In another embodiment, at least one address of the plu include at least one of those listed herein but can also include rality includes a polypeptide capture probe that binds spe others not listed herein. cifically to protein X or fragment thereof. The polypeptide One of skill in the art will appreciate from this description can be a naturally-occurring interaction partner of protein X. how similar “test level' versus “control level comparisons 10 e.g., a ligand for protein X where protein X if a receptor or can be made between other test and control samples a receptor for protein X where protein X is ligand. Prefer described herein. ably, the polypeptide is an antibody, e.g., an antibody It is noted that the patients and control subjects referred to specific for protein X. Such as a polyclonal antibody, a above need not be human patients. They can be for example, monoclonal antibody, or a single-chain antibody. non-human primates (e.g., monkeys), horses, sheep, cattle, 15 Antibodies can be polyclonal or monoclonal antibodies; goats, pigs, dogs, guinea pigs, hamsters, rats, rabbits or methods for producing both types of antibody are known in mice. the art. The antibodies can be of any class (e.g., IgM, IgG, Arrays and Kits and Uses Thereof IgA, Ig), or IgE) and be generated in any of the species The invention features an array that includes a substrate recited herein. They are preferably IgG antibodies. Recom having a plurality of addresses. At least one address of the binant antibodies, such as chimeric and humanized mono plurality includes a capture probe that binds specifically to clonal antibodies comprising both human and non-human any of the MSDK tags listed in Tables 2, 5, 7, 8, 10, 12, 15, portions, can also be used in the methods of the invention. and 16, a nucleic acid X (e.g., a DNA sequence (ASCI site) Such chimeric and humanized monoclonal antibodies can be defined by the location of the MSDK tags listed in Tables 2, produced by recombinant DNA techniques known in the art, 5, 7, 8, 10, 12, 15, and 16), or a protein X. The array can have 25 for example, using methods described in Robinson et al., a density of at least, or less than, 10, 2050, 100, 200, 500, International Patent Publication PCT/US86/02269; Akira et 700, 1,000, 2,000, 5,000 or 10,000 or more addresses/cm, al., European Patent Application 184, 187; Taniguchi, Euro and ranges between. In a preferred embodiment, the plural pean Patent Application 171.496; Morrison et al., European ity of addresses includes at least 10, 100, 500, 1,000, 5,000, Patent Application 173,494; Neuberger et al., PCT Applica 10,000, 50,000 addresses. In a preferred embodiment, the 30 tion WO 86/01533: Cabilly et al., U.S. Pat. No. 4,816,567; plurality of addresses includes equal to or less than 10, 100, Cabilly et al., European Patent Application 125,023; Better 500, 1,000, 5,000, 10,000, or 50,000 addresses. The sub et al. (1988) Science 240, 1041-43; Liu et al. (1987) J. strate can be a two-dimensional Substrate such as a glass Immunol. 139, 3521-26; Sun et al. (1987) PNAS 84, 214-18: slide, a wafer (e.g., silica or plastic), a mass spectroscopy Nishimura et al. (1987) Canc. Res. 47, 999-1005: Wood et plate, or a three-dimensional Substrate Such as a gel pad. 35 al. (1985) Nature 314, 446-49; Shaw et al. (1988) J. Natl. Addresses in addition to address of the plurality can be Cancer Inst. 80, 1553-59; Morrison, (1985) Science 229, disposed on the array. 1202-07: Oietal. (1986) BioTechniques 4, 214; Winter, U.S. An array can be generated by any of a variety of methods. Pat. No. 5.225,539; Jones et al. (1986) Nature 321,552-25; Appropriate methods include, e.g., photolithographic meth Veroeyan et al. (1988) Science 239, 1534; and Beidler et al. ods (see, e.g., U.S. Pat. Nos. 5,143,854: 5,510,270; and 40 (1988) J. Immunol. 141, 4053-60. 5,527,681), mechanical methods (e.g., directed-flow meth Also useful for the arrays of the invention are antibody ods as described in U.S. Pat. No. 5,384.261), pin-based fragments and derivatives that contain at least the functional methods (e.g., as described in U.S. Pat. No. 5,288,514), and portion of the antigen-binding domain of an antibody. Anti bead-based techniques (e.g., as described in PCT US/93/ body fragments that contain the binding domain of the 04145). 45 molecule can be generated by known techniques. Such In one embodiment, at least one address of the plurality fragments include, but are not limited to: F(ab')2 fragments includes a nucleic acid capture probe that hybridizes spe that can be produced by pepsin digestion of antibody mol cifically to any of the MSDK tags listed in Tables 2, 5, 7, 8, ecules; Fab fragments that can be generated by reducing the 10, 12, 15, and 16, e.g., the sense or anti-sense (complement) disulfide bridges of F(ab'), fragments; and Fab fragments Strand of the tag sequences. Each address of the Subset can 50 that can be generated by treating antibody molecules with include a capture probe that hybridizes to a different region papain and a reducing agent. See, e.g., National Institutes of of the MSDK tag. Such an array can be useful, for example, Health, 1 Current Protocols. In Immunology, Coligan et al., for detecting the presence and, optionally, assessing the ed. 2.8, 2.10 (Wiley Interscience, 1991). Antibody fragments relative numbers of one or more of the MSDK tags (or the also include Fv fragments, i.e., antibody products in which complements thereof) listed in Tables 2, 5, 7, 8, 10, 12, 15, 55 there are few or no constant region amino acid residues. A and 16 in a sample, e.g., a MSDK tag library. single chain FV fragment (ScPV) is a single polypeptide chain In another embodiment, at least one address of the plu that includes both the heavy and light chain variable regions rality includes a nucleic acid capture probe that hybridizes of the antibody from which the schv is derived. Such specifically to a nucleic acid X, e.g., the sense or anti-sense fragments can be produced, for example, as described in strand. Nucleic acids of interest include, without limitation, 60 U.S. Pat. No. 4,642.334, which is incorporated herein by all or part of any of the genes identified by the tags listed in reference in its entirety. For a human subject, the antibody Tables 2, 5, 7, 8, 10, 12, 15, and 16, all or part of mRNAs can be a “humanized' version of a monoclonal antibody transcribed from such genes, or all or part of cDNA pro originally generated in a different species. duced from such mRNA. Each address of the subset can In another aspect, the invention features a method of include a capture probe that hybridizes to a different region 65 analyzing the expression of gene X. The method includes of a nucleic acid. Each address of the Subset is unique, providing an array as described above; contacting the array overlapping, and complementary to a different variant of with a sample and detecting binding of a nucleic acid X or US 9,556,430 B2 35 36 protein X to the array. In one embodiment, the array is a for analyzing gene expression. The method includes: pro nucleic acid array. Optionally the method further includes viding a first two dimensional array having a plurality of amplifying nucleic acid from the sample prior or during addresses, each address of the plurality being positionally contact with the array. distinguishable from each other address of the plurality In another embodiment, the array can be used to assay 5 having a unique capture probe, contacting the array with a gene expression in a tissue to ascertain tissue specificity of first sample from a cell or subject which express or mis genes in the array, particularly the expression of gene X. If express gene X or from a cell or Subject in which a gene a sufficient number of diverse samples is analyzed, cluster X-mediated response has been elicited, e.g., by contact of ing (e.g., hierarchical clustering, k-means clustering, Bayes the cell with nucleic acid X or protein X, or administration ian clustering and the like) can be used to identify other 10 to the cell or subject of nucleic acid X or protein X; genes which are co-regulated with gene X. For example, the providing a second two dimensional array having a plurality array can be used for the quantitation of the expression of of addresses, each address of the plurality being positionally multiple genes. Thus, not only tissue specificity, but also the distinguishable from each other address of the plurality, and level of expression of a battery of genes in the tissue is each address of the plurality having a unique capture probe, ascertained. Quantitative data can be used to group (e.g., 15 and contacting the array with a second sample from a cell or cluster) genes on the basis of their tissue expression perse Subject which does not express gene X (or does not express and level of expression in that tissue. as highly as in the case of the as in the case of the cell or For example, array analysis of gene expression can be subject described for the first array) or from a cell or subject used to assess gene X expression in one or more cell types which in which a gene X-mediated response has not been (see above). 2O elicited (or has been elicited to a lesser extent than in the first In another embodiment, the array can be used to monitor sample); and comparing the binding of the first sample with expression of one or more genes in the array with respect to the binding of the second sample. Binding, e.g., in the case time. For example, samples obtained from different time of a nucleic acid, hybridization with a capture probe at an points can be probed with the array. Such analysis can address of the plurality, is detected, e.g., by a signal gener identify and/or characterize the development of a gene 25 ated from a label attached to the nucleic acid, polypeptide, X-associated disease or disorder (e.g., breast cancer Such as or antibody. The same array can be used for both samples or invasive breast cancer); and processes, such as a cellular different arrays can be used. If different arrays are used the transformation associated with a gene X-associated disease same plurality of addresses with capture probes should be or disorder. The method can also evaluate the treatment present on both arrays. and/or progression of a gene X-associated disease or disor- 30 All the above listed capture probes useful for arrays can der also be provided in the form of a kit or article of manufac The array is also useful for ascertaining differential ture, optionally also containing packaging materials. In such expression patterns of one or more genes in normal and kits or articles of manufacture, the capture probes can be abnormal (e.g., malignant) cells. This provides a battery of provided as preformed arrays, i.e., attached to appropriate genes (e.g., including gene X) that could serve as a molecu- 35 substrates as described above. Alternatively they can be lar target for diagnosis or therapeutic intervention. provided in unattached form. In another aspect, the invention features a method of The capture probes can be supplied in unattached form in analyzing a plurality of probes. The method is useful, e.g., any number. Moreover, each capture probe in a kit or article for analyzing gene expression. The method includes: pro of manufacture can be provided in a separate vessel (e.g., viding a first two dimensional array having a plurality of 40 bottle, vial, or package), all the capture probes can be addresses, each address (of the plurality) being positionally combined in the same vessel, or a plurality of pools of distinguishable from each other address (of the plurality) capture probes can be provided, with each pool being having a unique capture probe, e.g., wherein the capture provided in a separate vessel. In the kit or article of probes are from a cell or Subject which express gene X or manufacture there can optionally be instructions (e.g., on the from a cell or Subject in which a gene X-mediated response 45 packing materials or in a package insert) on how to use the has been elicited, e.g., by contact of the cell with nucleic arrays or unattached capture probes, e.g., on how to perform acid X or protein X, or administration to the cell or subject any of the methods described herein. of a nucleic acid X or protein X; providing a second two The following examples are intended to illustrate, not dimensional array having a plurality of addresses, each limit, the invention. address of the plurality being positionally distinguishable 50 from each other address of the plurality, and each address of EXAMPLES the plurality having a unique capture probe, e.g., wherein the capture probes are from a cell or subject which does not Example 1 express gene X (or does not express as highly as in the case of the cell or subject described above for the first array) or 55 Materials and Methods from a cell or subject which in which a gene X-mediated response has not been elicited (or has been elicited to a lesser Tissue Specimens and Primary Cell Cultures extent than in the first sample); contacting the first and Human breast tumor and fresh, frozen, or formalin fixed, second arrays with one or more inquiry probes (which are paraffin embedded tumor specimens were obtained from the preferably other than a nucleic acid X, protein X, or antibody 60 Brigham and Women’s Hospital (Boston, Mass.), Columbia specific for protein X), and thereby evaluating the plurality University (New York, N.Y.), University of Cambridge of capture probes. Binding, e.g., in the case of a nucleic acid, (Cambridge, UK), Duke University (Durham, N.C.), Uni hybridization with a capture probe at an address of the versity Hospital Zagreb (Zagreb, Croatia), the National plurality, is detected, e.g., by signal generated from a label Disease Research Interchange (Philadelphia, Pa.), and the attached to the nucleic acid, polypeptide, or antibody. 65 Breast Tumor Bank of the University of Liege (Liege, The invention also features a method of analyzing a Belgium). All human tissue was collected without patient plurality of probes or a sample. The method is useful, e.g., identifiers using protocols approved by the Institutional US 9,556,430 B2 37 38 Review Boards of the institutions. In the case of matched Calif.) and sequenced. 21-bp tags were extracted and dupli tissue samples (i.e., normal and tumor tissue samples cate ditags (arising due to the PCR expansion step) were obtained from the same individuals), the normal tissue removed using SAGE 2002 software. P values were calcu corresponding to the tumor was obtained from the ipsilateral lated based on pair-wise comparisons between libraries breast several centimeters away from the tumor. Fresh tissue using a Poisson-based algorithm Cai et al. (2004) Genome samples were immediately processed for immunomagnetic Biol. 5:R51; Allinen et al. (2004) supra. Raw tag counts purification and cell subsets were purified as previously were used for comparing the libraries and calculating p described Allinen et al. (2004) Cancer Cell 6:17-32 and values, but Subsequently tag numbers were normalized in co-pending U.S. Patent Application Serial No. PCT/ order to control for uneven total tag numbers/library (aver US2004/08866, the disclosures of which are incorporated 10 age total tag number 28,456/library). herein by reference in its entirety. Following the purifica In order to determine their chromosomal location, tags tion procedure, in Some cases the purity of each cell popu that appeared only once in each library were filtered out and lation was confirmed by RT-PCR and primary cultures of the matched to a virtual AscI library derived from a human different cell types were initiated. Primary stromal fibro genome sequence. sequence and mapping blasts were cultured in DMEM medium supplemented with 15 information (July 2003, hg16) were downloaded from 10% iron fortified bovine calf serum (Hyclone, Logan, Utah) UCSC Genome Bioinformatics Site. A virtual AscI tag prior to lysis and DNA and RNA isolation. Human embry library was constructed based on the genome sequence as onic stem cells were cultured on feeder layers using estab follows: predicted AscI sites were located in the genomic lished protocols (for example, see, REF). DNA and RNA sequence, the nearest NlaIII sites in both directions to the were isolated from the other cell-types without prior cultur AscI sites were identified, and the corresponding virtual 1ng. MSDK sequence tags were derived. All virtual tags that RNA and Genomic DNA Isolation, and cDNA Synthesis were not unique in the genome were removed in order to RNA (total and polyA) isolation was performed using a ensure unambiguous mapping of the data. Genes neighbor uMACSTM kit (Miltenyi Biotec, Auburn, Calif.) from small ing the AscI sites were also identified in order to determine numbers of cells, while from large tissue samples, primary 25 the effect of methylation on their expression. cultures and cell lines total RNA was isolated using a Alignment of MSDK, SAGE, and CpG Islands Across the guanidium/cesium method Allinen et al. (2004), Supra. Genome Column flow-through fractions (in the uMACSTM method) The frequency of AscI digestion was calculated as per and unprecipitated soluble material (guanidium/cesium centage of samples (N-EPI-17, I-EPI-7, N-MYOEP-4, method) were used for the purification of genomic DNA 30 D-MYOEP-6, N-STR-17, I-STR-7, N-STR-117, I-STR-17) using SDS/proteinase K digestion followed by phenol-chlo having raw tag counts of 2 or more at each predicted AscI roform extraction and isopropanol precipitation. cDNA syn site. SAGE counts from corresponding samples (N-EPI-1 thesis was performed using the OMNI-SCRIPTTM kit form plus N-EPI-2, I-EPI-7, N-MYOEP-1, D-MYOEP-6, Qiagen (Valencia, Calif.) following the manufacturers D-MYOEP-7, N-STR-1, N-STRI-17, I-STR-7) were nor instructions. 35 malized to tags per 200,000. Gene and CpG island position Generation and Analysis of MSDK (Methylation Specific information were downloaded from UCSC Genome Bioin Digital Karyotyping) Libraries formatics Site (Human genome sequence and mapping MSDK libraries were generated by a modification of the information, July 2003, hg16). AscI sites were predicted (as digital karyotping protocol Wang et al. (2002) Proc. Natl. mentioned above) from the genome sequence, and AscI site Acad. Sci USA 16156-16161. For each sample, 1-5 lug 40 frequency, SAGE counts, and CpG island positions were genomic DNA was sequentially digested with the methyl drawn together along all chromosomes. ation-sensitive enzyme AscI and the resulting fragments Bisulfite Sequencing, Quantitative Methylation Specific were ligated at their 5' and 3' ends to biotinylated linkers PCR (qMSP), and Quantitative RT-PCR (qRT-PCR) (5'-biotin-TTTGCAGAGGTTCGTAATCGAGT To determine the location of methylated cytosines, TGGGTGG-3',5'-phos-CGCGCCACCCAACTCGATTAC 45 genomic DNA was bisulfite treated, purified, and PCR GAACCTCTGC-3'). The biotinylated fragments were then reactions were performed as previously described Herman digested with Nla as a fragmenting restriction enzyme. et al. (1996) Proc. Natl. Acad. Sci. USA 93:9821-0826. Resulting DNA fragments having biotinylated linkers at PCR products were “blunt-ended, subcloned into their termini were immobilized onto Streptavidin-conjugated pZERO1.0 (Invitrogen), and 4-13 independent colonies magnetic beads (Dynal, Oslo, Norway). 50 were sequenced for each PCR product. The remaining steps were essentially the same as those Based on the above sequence analysis qMSPPCR primers described for LongSAGE with minor modifications Allinen were designed for the amplification of methylated or unm et al. (2004) supra; Saha et al. (2002) Nat. Biotechnol. ethylated DNA. Quantitative MSP and RT-PCR amplifica 20:508-512. Briefly, linkers containing the type IIs restric tions were performed as follows. Template (2-5 ng bisulfite tion enzyme Mme.I recognition site were ligated to isolated 55 treated genomic DNA or 1 ul clNA) and primers were DNA fragments and the bead bound fragments were cut by mixed with 2xSYBR Green master mix (ABI, CA) in a 25 the MmeI enzyme 21 base pairs away from the restriction ul volume and the reactions were performed in ABI 7500 enzyme site, resulting in release from the beads into the real time PCR system (50° C., 20 sec; 95°C., 10 min: 95° Surrounding solution of tags containing the Mme.I recogni C., 15 sec, 60° C., 1 min (40 cycles): 95°C., 15 sec; 60° C., tion site, a linker and 21 base pairs of test genomic DNA. 60 20 sec; 95° C., 15 sec). Triplicates were performed and The tags were ligated to form ditags which are formed average Ct values calculated. The Ct (cycle threshold) value between single tags containing 5' and 3' Mme digestion is the PCR cycle number at which the reaction reaches a (cut) sites (depending on whether the relevant fragment fluorescent intensity above the threshold which is set in the bound to a bead was derived by from an NlaIII site 5' or 3' exponential phase of the amplification (based on amplifica of an unmethylated AscI site). The ditags were expanded by 65 tion profile) to allow accurate quantification. In the case of PCR, isolated, and ligated to form concatamers, which were qMSP, methylation of the samples was normalized to meth cloned into the pZero 1.0 vector (Invitrogen, Carlsbad, ylation independent amplification of the B-actin (ACTB) US 9,556,430 B2 39 40 gene:% ACTB=100x2''''8"). For qRT-PCR expres predicted sites/human genome) allowing identification of sion of the samples was normalized to that of the RPL39 tags that are highly statistically significantly differentially (ribosomal protein L39) gene: % RPL39–10x present in the different libraries at reasonable sequencing 2'''''8"). Normalizations to the expression of the depths (20,000-50,000 tags/library). Methylation of either or ribosomal protein L19 (RPL19) and ribosomal protein S13 both methylation sites in an AscI recognition sequence (RPS13) genes were also performed and gave essentially the prevents cutting by AscI. The use of AscI and NlaIII as same results. Due to the very high abundance of ribosomal mapping and fragmenting enzymes, respectively, with protein mRNAs, cDNA was diluted ten-fold for these PCR human genomic DNA, respectively, is expected to result in reactions relative to that of specific genes. The frequency of a total of 7,205 virtual tags (defined as possible tags that can methylation of the PRDM14 gene in normal and tumor 10 samples was calculated by setting a threshold of methylation be obtained and uniquely matched to the human genome as the median--2xstandard deviation value of the relative based on the predicted location of AscI and NlaIII sites). methylation of the normal samples (excluding the one Since AscI will cut only unmethylated DNA, the presence of outlier case; see below). Samples above this value (10.66) a tag in the MSDK library indicates that the corresponding were defined as methylated. 15 AscI site is not methylated, while lack of a virtual tag indicates methylation. Example 2 To demonstrate the feasibility of the MSDK method for epigenome profiling, MSDK libraries were generated from Methylation Specific Digital Karyotyping (MSDK) genomic DNA isolated from the wild-type HCT 116 human colon cancer cell line (HCTWT) and its derivative in which The MSDK protocol used in the experiments described both the DNMT1 and DNMT3b DNA methyltransferase below is schematically depicted in FIG. 2. genes have been homozygously deleted (HCT DKO) Rhee MSDK is a modification of the digital karyotyping (DK) et al. (2002) Nature 416, 552-556). Due to the deletion of technique recently developed for the analysis of DNA copy these two DNA methyltransferases, methylation of the number in a quantitative manner on a genome-wide scale 25 genomic DNA in the HCT DKO cells is reduced by greater Wang et al. (2002) supra. DK is based on two concepts: (i) than 95% relative to the HCT WT cells. Thus, MSDK short (e.g., 21 base pair) sequence tags can be derived from libraries generated from HCTWT and HCT DKO cells were specific locations in the human genome; and (ii) these expected to depict dramatic differences in DNA methylation. sequence tags can be directly matched to the human genome 21,278 and 24,775 genomic tags were obtained from the WT sequence. The original DK protocol used SacI as a mapping 30 and DKO cells, respectively. These tags were matched to a enzyme and Nla as a fragmenting enzyme. Using this virtual AscI tag library generated as described in Example 1. enzyme combination the tags were obtained from the two Unique tags (7,126 from the WT and 7.964 tags from the (both 5' and 3) NlaIII sites closest to the SacI sites. DKO cells) were compared and 219 were identified as being In the MSDK method, instead of SacI, a mapping enzyme statistically significantly (p<0.05) differentially present in that is sensitive to DNA methylation was used. AscI was 35 the two libraries (Table 1). 137 and 82 of these tags were chosen because its recognition sequence (GGCGCGCC) has more abundant in the DKO and WT libraries, respectively. two CpG (potential methylation) sites, is preferentially Correlating with the overall hypomethylation of the genome found in CpG islands associated with transcribed genes of DKO cells, almost all of the 137 tags were at least 10 fold rather than repetitive elements Dai et al. (2002) Genome more abundant in the DKO library, while nearly all 82 tags Res. 12:1591-1598), and it is a rare cutter enzyme (-5,000 showed only 2-5 fold difference between the two libraries. TABLE 1. Chromosomal location and analysis of the frequency of MSDK tags in the HCT116 WT and DKO MSDK libraries. Tag Variety Virtual Observed WT DKO Ratio Tag Copy Ratio - Differential Tag (P is 0.05 Chr Tag Tag Variety Copies Variety Copies DKO/WT DKOWT DKO > WT WT & DKO 1 551 119 73 431 89 538 219 1.248 10 6 2 473 94 51 383 72 499 .412 1.303 10 5 3 349 83 48 478 59 473 229 O.990 8 5 4 281 62 33 266 49 26S 485 O.996 3 5 5 334 74 41 437 56 536 366 1.227 10 3 6 338 65 36 229 51 315 417 1.376 8 4 7 403 90 60 359 66 344 1OO O.958 4 4 8 334 89 S4 460 73 433 352 O.941 3 5 9 349 86 50 397 67 468 340 1179 9 5 10 387 84 43 386 71 468 651 1.212 10 4 11 379 96 55 408 75 392 364 O.961 6 4 12 299 72 42 330 52 329 .238 O.997 7 4 13 138 25 12 109 19 105 .583 O.963 1 1 14 228 51 28 234 36 225 286 O.962 4 3 15 260 52 38 243 37 163 O.974 0.671 2 4 16 340 82 43 297 65 347 S12 1.168 4 2 17 400 116 S4 401 1OO 781 852 1948 16 3 18 181 39 19 115 29 199 .526 1.730 7 O 19 463 99 59 429 70 391 186 O.911 9 7 2O 236 58 32 213 41 287 281 1.347 4 2 21 71 11 7 27 6 43 0.857 1.593 1 O 22 217 51 31 328 38 260 226 0.793 1 4 US 9,556,430 B2 41 42 TABLE 1-continued Chromosomal location and analysis of the frequency of MSDK tags in the HCT116 WT and DKO MSDK libraries. Tag Variety Virtual Observed WT DKO Ratio Tag Copy Ratio - Differential Tag (P is 0.05 Chr Tag Tag Variety Copies Variety Copies DKO/WT DKOWT DKO > WT WT & DKO X 18S 22 16 166 18 103 1.125 O.62O O 2 Y 9 O O O O O

Matches 7205 162O 925 7126 1239 7964 1.339 1118 137 82 No Matches 1353 799 S183 816 5805 1021 1.120 29 13

Total 7205 2973 1724 12309 2055 13769 1.192 1119 166 95

Chr, Chromosome, Virtual tags, the number of MSDK tag species predicted for the indicated chromosome. Observed Tags, the number of different unique tag species observed in both MSDK libraries for the indicated chromosome. Variety, the number of different unique tag species for the indicated chromosome and MSDK library, Copies, the abundance (total number) of all the observed unique tags for the indicated chromosome and MSDK library, Tag Variety Ratio, the ratio of the numbers of unique tag species for the indicated chromosome detected in the indicated two libraries. Tag CopyRatio, the ratio of the abundances (total numbers) of all the unique tags for the indicated chromosomes detected in the indicated two libraries, Differential Tag (P<0.05), the number of unique tag species observed for the indicated chromosome that were present in higher abundance in the one indicated MSDK library than in the other indicated MSDK library (P< 0.050).

Single nucleotide polymorphism (SNP) array analysis of previously been found to be differentially methylated the DNA samples used for the generation of MSDK libraries 25 between HCT 116 WT and DKO cells, and are also fre demonstrated that the two cell lines are indistinguishable quently methylated in primary colorectal carcinomas and using this technique and the observed differences in MSDK tag numbers are unlikely to be due to underlying overt DNA colon cancer cell lines Paz et al. (2003) Hum. Mol. Genet. copy number alterations. Mapping of the tags to the genome 12:2209-2210). Similarly SCGB3A1/HIN-1, a gene fre revealed that many of the differentially methylated AscI sites quently methylated in multiple cancer types Shigematsu et are located in CpG islands and in promoter areas of genes 30 al. (2005) Int. J. Cancer 113:600-604: Krop et al. (2004) implicated in development and differentiation including Mol. Cancer Res. 2:489-494; Krop et al. (2001) Proc. Natl. numerous homeogenes (Table 2). Consistent with these Acad. Sci. USA'98:9796-9801 was identified as one of most results, two of these genes, LMX-1A and COL5A, have highly significantly differently present tags (Table 2).

US 9,556,430 B2 55 56 In order to further validate the MSDK technique, three arrays in order to rule out the possibility of overt DNA copy highly differentially present tags were selected from the HCT libraries, the corresponding genomic loci (correspond number alterations. ing to the LHX3. LMX-1A, and TCF7L1 genes) were identified, and sequencing of bisulfite treated genomic DNA (the same as that used for the generation of the MSDK libraries) was performed. In all three cases, the relevant AscI Pair-wise comparisons and statistical analyses of the site was completely methylated in the WT and unmethylated MSDK libraries revealed that the largest fraction of highly in the DKO cells (FIGS. 3-5). In addition, almost all other Surrounding CpG showed the same methylation/unmethyl (>10 fold difference) differentially present tags occurred ation pattern. In FIGS. 6-8 are shown the nucleotide 10 between normal and tumor epithelial cells and the majority sequences of regions of these three gene segments of which were subjected to the described methylation-detecting of these tags were more abundant in tumor cells (Tables 4 sequencing analysis. These results indicated that the MSDK and 5) correlating with the known overall hypomethylation method is suitable for genome-wide analysis of methylation patterns and the identification of differentially methylated 15 of the cancer genome Feinberg et al. (1983) Nature 301: sites. 89-92). Example 3 Analysis of MSDK Libraries from Cell Populations Isolated from Normal and Cancerous Breast Tissue MSDK libraries were generated from epithelial cells, myoepithelial cells, and fibroblast-enriched stroma isolated from normal breast tissue, in situ (DCIS-ductal carcinoma in situ) breast carcinoma tissue, and invasive breast carcinoma tissue. A detailed description of the samples is in Table 3.

TABLE 3 List of breast tissue samples used for methylation analyses. Name Organ Histology Cell type Tumor name Age Histology Grade LN ER PR Her2 D-MYOEP-6 breas tumor myoepithelial DCIS-6 29 pure extensive DCIS high D-EPI-6 (8S tumor epithelial DCIS-6 29 pure extensive DCIS high D-MYOEP-7 breas tumor myoepithelial DCIS-7 29 ext. DCIS adjacent to IDC intermediate pos low pos neg N-EPI-IT (8S normal epithelial 47 normal matched to tumor I-EPI-7 (8S tumor epithelial IDC-7 47 invasive ductal carcinoma low poS pos pos neg N-STR-I7 (8S normal SO8. 47 normal matched to tumor I-STR-7 (8S normal SO8. IDC-7 47 invasive ductal carcinoma low poS pos pos neg N-STR-I17 breas normal SO8. 44 normal matched to tumor I-STR-17 (8S tumor SO8. IDC-17 44 invasive ductal carcinoma intermediate N-MYOEP-4 breas normal myoepithelial 25 normal reduction N-EPI-4 (8S normal epithelial 25 normal reduction N-MYOEP-6 breas normal myoepithelial 19 normal reduction N-MYOEP-3 breas normal myoepithelial 24 normal reduction N-STR-7 (8S normal SO8. 26 normal reduction I-STR-11 (8S tumor SO8. IDC-11 43 invasive ductal carcinoma low poS pos pos neg N-PBS-1 (8S normal culture 38 normal reduction N-EPI-5 (8S normal epithelial 58 normal matched to tumor high neg neg neg neg I-EPI-9 (8S tumor epithelial IDC-9 45 invasive ductal carcinoma intermediate pos pos neg HCTWT colon tumor cell line HCT-DKO colon tumor cell line The numbers at the ends of the tissue sample names indicate patients from which the tissue samples were obtained. Age is the age of the particular patient, LN indicates whether the carcinoma in the relevant patient had spread to one or more lymph nodes, ER indicates whether the relevant carcinoma cells expressed the estrogen receptor, PR indicates whether the relevant carcinoma cells expressed the progesterone receptor, Her2 indicates whether the relevant carcinoma cells expressed Her2/Neu. Grade is the histologic grade,

Whenever possible, normal and tumor tissue were derived from the same patient in order to control for possible 60 epigenetic variations due to age, and reproductive and disease status. Fibroblast-enriched stroma were the cells remaining after removal of epithelial cells, myoepithelial 65 cells, leukocytes, and endothelial cells and consist of over 80% fibroblasts. DNA samples were also analyzed with SNP US 9,556,430 B2 57 58 TABLE 4 Chromosomal location and analysis of the frequency of MSDK tags in the I-EPI-7 and N-EIP-I7 MSDK libraries. Differential Tag (P is 0.05 Virtual Observed I-EPI-7 N-EPI-IT Tag Variety Ratio Tag Copy Ratio N-EPI-I7, Chr Tags Tags Variety Copies Variety Copies I-EPI-7/N-EPI-I7 I-EPI-7/N-EPI-I7 I-EPI-7 > N-EPI-I7 I-EPI-7 1 551 273 26S 3330 98 496 2.704 6.714 28 5 2 473 92 83 979 62 517 2.952 3.828 11 4 3 349 53 42 792 58 535 2448 3.350 8 2 4 281 22 18 595 42 244 2.8.10 6.537 15 O 5 334 36 26 296 55 399 2.291 3.248 7 3 6 338 30 2O 994 50 245 2400 4.057 1 O 7 403 93 86 757 61 340 3.049 S.168 7 3 8 334 41 37 327 51 3OO 2.686 4.423 6 3 9 349 53 45 370 60 40S 2.417 3.383 3 3 10 387 58 49 599 59 378 2.525 4230 7 1 11 379 69 61 434 69 327 2.333 4.385 6 1 12 299 27 21 O60 49 331 2.469 3.2O2 5 4 13 138 53 51 474 2O 108 2.550 4.389 1 1 14 228 96 91 838 28 16S 3.2SO 5.079 5 O 15 260 16 O8 936 40 158 2.700 S.924 8 O 16 340 45 37 355 55 279 2.491 4.857 15 3 17 400 96 91 952 70 496 2.729 3.935 7 4 18 181 72 69 527 19 125 3.632 4.216 1 O 19 463 73 65 711 83 388 1988 4.410 8 1 2O 236 95 90 O09 38 244 2.368 4.135 4 O 21 71 24 24 255 8 69 3.OOO 3.696 2 O 22 217 88 85 781 31 205 2.742 3.810 3 O X 18S 55 53 462 19 116 2.789 3.983 1 O Y 9

Matches 7205 3060 2917 298.33 1125 6870 2.593 4.343 159 38 No Matches 1510 82O 6835 930 4463 O.882 1531 13 32

Total 7205 4570 3737 36668. 2055 11333 1818 3.236 172 70

The column headings are as indicated for Table 1.

TABLE 5 MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

CAACGGAAACAAAAACA. 277 4 O -13 O. O.294 64 MMP23A matrix metallopro - 5' 6922 teinase 23A

CAACGGAAACAAAAACA. 278 4 O -13 O. O.294 64 HSPC182 HSPC182 protein s' 111089

CCCGCCACGCCGCCCCG 279 O 13 3 O. O158 ENO1 enolase 1 3' 23 O

CTCCAAAAATCCCTTGA, 280 5 O -16 O. O4 6199 NBL1. neuroblastoma, sup - 5 1585.83 pression of tumori genicity 1.

CTCCAAAAATCCCTTGA. 281 5 O -16 O. O4 6199 CAPZB F-actin capping s' 64897 protein beta subunit

GTGCCGCCGCGGGCGCC 28211 61 2 O. O32251 KIAAO478 KIAAO478 gene s' 3O8 OO6 product

GTGCCGCCGCGGGCGCC 283 11 61 2 O. O32251 WNT4 wingless-type MMTV 5" 733 integration site family

CTGCAACTTGGTGCCCC 284 2 22 3 O. O27586 PRDX1 peroxiredoxin 1 3' 150

GCCTCTCTGCGCCTGCC 28, 18 10 -6 O. O.23961 GFI1 growth factor in- 3' 4842 dependent 1 US 9,556,430 B2 59 60 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

CTCCGTTTTCTTTTGTT 2 g6 4 O -13 O. O.294 64 ALX3 aristalless-like 3' 1631 homeobox 3

AGCGCTTGGCGCTCCCA. 287 is 54 3 O. O.O2O39 NPR1 natriuretic peptide 3' 677 receptor A/ guanylate cyclase

TCTGGGGCCGGGTAGCC 288 9 216 7 7.35 x 10 P66 beta transcription re- s' 1176 OS pressor pe 6 beta component of

CACCCGCGGGGGTGGGG 289 O 17 17 O. O.28576 IL6R interleukin 6 re- 3' 898 ceptor isoform 2 precursor

CGTGTGTATCTGGGGGT 29 O 6 51 3 O. O. Off O2 MUC1 mucin 1, 3' 188528 transmembrane

GCAGCGGCGCTCCGGGC 291 9 12 O 4. 1.75 x 107 MUC1 mucin 1, 3' 1391.19 transmembrane

TGTTCAGAGCCAGCTTG 292 2 25 4. O. O.1729 LMNA lamin A/C isoform 2 3' 236

CCAGGCTGGCTCACCCT 293 O 27 27 O. O.O3867 HAPLN2 brain link protein - 3 4728 1.

CCAGGGCCTGGCACTGC 294 15 89 2 O. O.O3766 IGSF9 immunoglobulin s' 393 superfamily, member 9

TTCGGGCCGGGCCGGGA 295 17 9 O 2 O. O.O936.9 LMX1A. LIM homeobox trans - 5 7s2 cription factor 1, alpha

AGCCCTCGGGTGATGAG 29 7 83 4. 4.14 x 10 LMX1A. LIM homeobox trans - 5 7s2 cription factor 1, alpha

CATTCCAGTTACAGTTG 297 is 4 O 2 O. O.271.43 GPR161 G protein-coupled 3' 198 receptor 161

TCCACAGCGGACGTTCC 298 O 32 32 O. O.O4 O49 TOR3A tors in family 3, 3' 1 OO member A.

ACATTGTCCTTTTTGCC 299 2 25 4. O. O.1729 C1orf24 niban protein 3' 292

CCGAGGGGCCTGGCGCC 3 OO O 12 12 O. O.261.52 BTG2 B-cell transloca- 3' 431 tion gene 2

TCCAGGCAGGGCCTCTG 301 8 91 4. 2. O6 x 10 BTG2 B-cell transloca- 3' 431 tion gene 2

CCCCCGCGACGCGGCGG 34 10 4. -8 O. O399.11 SOX13 SRY-box 13 s' sf1

CCCCCGCGACGCGGCGG 34 10 4. -8 O. O399.11 FLJ4 O343 hypothetical pro- s' 31281 tein FLJ40343

TGGATTTGGTCGTCTCC 3 O4 O 25 25 O. O. Offs PLXNA2 plexin A2 3' 428

GCCCCCGTGGCGCCCCG 3. Os 8 97 4. 6.47 x 10 CENPF centromere protein 5' 513 OO F (350/400 kD) GCCCCCGTGGCGCCCCG 3 O6 8 97 4. 6.47 x 10 PTPN14 protein tyrosine s' 589 phosphatase, non receptor type

TCGGTGGTCGCTCGTGG 3. Of O 19 19 O. O.19333 MGC42493 hypothetical pro- s' 244.931 tein MGC42493 US 9,556,430 B2 61 62 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

TCGGTGGTCGCTCGTGG 3 OS O 19 19 O. O.19333 CDC42BPA CDC42-binding pro- 5' 486 tein kinase alpha isoform A

GCTAGGGAAAAACAGGC 309 11 59 2 O. O43511 MGC42493 hypothetical pro- s' 244.931 tein MGC42493

GCTAGGGAAAAACAGGC 31011 59 2 O. O43511 CDC42BPA CDC42-binding pro- 5' 486 tein kinase alpha isoform A

GACGCGCTCCCGCGGGC 311 5 42 3 O. O.1897 WNT3A. wingless-type MMTV 5" 59111 integration site family

GACGCGCTCCCGCGGGC 312 5 42 3 O. O.1897 WNT9A wingless-type MMTV 5" 41 integration site family

CAAAGGAGCTGTGGAGC 313 2 23 4. O. O.26376 TAF5L PCAF associated 3' 192 factor 65 beta

GAGCGGCCGCCCAGAGC 314 6 61 3 O. OO1212 TAF5L PCAF associated 3' 192 factor 65 beta

GCCAATGACAGCGGCGG 315 O 17 17 O. O.O9019 EGLN1 egl nine homolog 1 3" 34 49

ATGCGCCCCGCAGCCCC 31610 138 4. 1.24 x 10 MGC13186 hypothetical pro- s' 32.1138 tein MGC13186

ATGCGCCCCGCAGCCCC 3171 O 138 4. 1.24 x 10 SIPA1L2 signal-induced s' 114742 proliferation associated 1 like

CTGGAACCCCGCACACC 318 O 6 16 O. O.10329 FLJ12606 hypothetical pro- s' 82 tein FLJ12606

GTCCCCGCGCCGCGGCC 31928 3 -7 3. O5 x 107 2 UBXD4 UBX domain con- s' 553390 taining 4

GTCCCCGCGCCGCGGCC 32O28 3 -7 3. O5 x 107 2 APOB apolipoprotein B s' 2343O39 precursor

AACTTTTAAAGTTTCCC 321 O 4. 4. O. O.17811 2 UBXD4 UBX domain con- s' 97 taining 4

AACTTTTAAAGTTTCCC 322 O 4. 4. O. O.17811 2 APOB apolipoprotein B s' 289 6332 precursor

GCCACCCAAGCCCGTCG 323 O 8 8 O. OO6642 2 RAR1O ras-related GTP- s' 106 binding protein RAB10

GCCACCCAAGCCCGTCG 324 O 8 8 O. OO6642 2 KIF3C kines in family s' 51464 member 3C

CCTTTGCTTCCCTTTCC 325 O 5 5 O. O.1316.1 2 CRIM1 cysteine-rich s' 1 OO motor neuron 1

CCTTTGCTTCCCTTTCC 326 O 5 5 O. O.1316.1 2 MYADML myeloid-associated 5' 263 OO25 differentiation marker-like

CACACAAGGCGCCCGCG 327 4 37 3 O. O.22534 2 SIX2 Sine oculis homeo - 5 16 O394 box homolog 2

TAAGAGTCCAGCAGGCA 328 4 O -13 O. O.294 64 2 RTN4 reticulon 4 isoform 5' 295 C US 9,556,430 B2 63 64 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

TCATTGCATACTGAAGG 329 2 23 4. O. O.26376 2 SLC1A4 solute carrier s' 3353 O2 family 1, member 4

TCATTGCATACTGAAGG 33 O 2 23 4. O. O.26376 2 SERTAD2 SERTA domain con- s' 245 taining 2

GCGCTACACGCCGCTCC 331 3 35 4. O. O.1477 2 SLC1A4 solute carrier s' 111 family 1, member 4

GCGCTACACGCCGCTCC 332 3 35 4. O. O.1477 2 SERTAD2 SERTA domain con- s' 33.5436 taining 2

GACGACAGCGCCGCCGC 333 O 18 18 O. OO6642 2 UXS1 UDP-glucuronate s' 66 decarboxylase 1

AAATTCCATAGACAACC 334 13 7 -6 O. O47343 2 HOXD4 homeo box D4 3' 1141

GGCGTGGGGAGAGGGGG 335 4 35 3 O. O.32525 2 ZNF533 zinc finger pro- s' 114958 tein 533

GCTGCAGGCACTGGGTT 336 4 O -13 O. O.294 64 2 ATIC 5-aminoimidazole - 4 - 5 2O3 carboxamide ribonucleotide

GCTGCAGGCACTGGGTT 337 4 O -13 O. O.294 64 2 ABCA12 ATP-binding cas- s' 173481 sette, sub-family A member 12

ATGGTGTCGCTGGACAG 338 3 37 4. O. O.10O34 2 ARPC2 actin related pro- 5' 94 tein 2/3 complex subunit 2

ATGGTGTCGCTGGACAG 339 3 37 4. O. O.10O34 2 IL8RA interleukin 8 re- s' 5 OO 63 ceptor alpha

GACTTCTGGCAAGGGAG 340 O 17 17 O. O.28576 2 DOCK1 O dedicator of cyto- 5' 2O8215 kinesis 10

ACTGCATCCGGCCTCGG 341 16 89 2 O. OO6496 2 PTMA prothymosin, alpha 5' 93674 (gene sequence 28

CCTAGCATCTCCTCTTG 342 6 O -19 O. O.16381 3 GRM7 glutamate receptor, 5' 70 metabotropic 7 isoform b

GAGGACTGGGGGCTGGG 343 O 14 14 O. O.17811 3 HRH1 histamine receptor 5' 984 O9 H1

CTTTGGCCGAGGCCGAG 344 5 O -16 O. O.10561. 3 FGDS FYVE, RhoGEF and PH 5' 85.78 domain containing 5

CGGCGCGTCCCTGCCGG 34533 146 1. O. OOS894 3 DKFZp313NO621 hypothetical pro- s' 3396.65 tein DKFZp313NO621

GAGAAGCCGCCAGCCGG 346 7 49 2 O. O.217 3 PXK PX domain contain- 3" 346 ing serine/ threonine kinase

CCTGCCTCTGGCAGGGG 347 17 82 1. O. O.291.36 3 PLXNA1 plexin A1 s' 5386

GTTTCTTCTCAATAGCC 348 O 22 22 O. O.114:11 3 FLJ12O57 hypothetical pro- s' 284.32 tein FLJ12O57

TCCTTGATGAAATGCGC 349 O 14 14 O. O.17811 3 SSB4 SPRY domain- s' 434 containing SOCS box protein SSB-4

GCTGGCGATCTGGGGCT 350 O 12 12 O. O.261.52 3 MGC4 O579 hypothetical pro- 3' 4 OS tein MGC4 O579 US 9,556,430 B2 65 66 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

ACCCTTGGAGGAAGGGG 351 O 12 12 O. O.261.52 3 C3orf21 chromosome 3 open 3' 134 reading frame 21

GGGCGGTGGCGGGGACG 352 O 14 14 O. O.17811 4 RGS12 regulator of G- s' 210. Of protein signalling 12 isoform 2

CCTGCGCCGGGGGAGGC 353 66 24 O 1. O. O.11585 4 ADRA2C alpha-2C-adrenergic 3." 432 receptor

ATTTAGGGGTCTGTACC 354 O 15 15 O. O.1316.1 4 KIAAO232 KIAAO232 gene s' 58 product

GTCCGTGGAATAGAAGG 355 8 69 3 O. OO1269 4. Not Found

GTGGCGCGCTGGCGGGG 356 O 13 13 O. O158 4 RASL1B RAS-like family s' 2O2915 11 member B

GTGGCGCGCTGGCGGGG 357 O 13 13 O. O158 4 USP46 ubiquit in specific 5' 139 protease 46

CTGCCCAGTACCTGAGG 358 O 18 18 O. OO6642 4 SLC4A4 solute carrier s' 151833 family 4, sodium bicarbonate

CCGCGGATCTCGCCGGT 359 2 25 4. O. O.1729 4 ASAHL N-acyl sphingosine 3' 67 amidohydrolase-like protein

AGCCACCTGCGCCTGGC 36014 81 2 O. O. OfS48 4. PAQR3 progestin and s' 1 O1 adipoo receptor family member III

TGCGGAGAAGACCCGGG 361 2 24 4. O. O.19587 4 ELOWI6 ELOVL family member 3" 1583 6, elongation of long chain

GCTGTCCGCACGCGGCC 362 O 15 15 O. O.1316.1 4 SMAD1 Sma - and Mad-re- s' 3 O 1087 lated protein 1

GCTGTCCGCACGCGGCC 363 O 15 15 O. O.1316.1 4 HSHIN1 HIV-1 induced pro- 5' 596.7 tein HIN-1 isoform 1.

TGCACGCACACTCTTCC 364 2 29 4. O. O.19901 4 LOC152485 hypothetical pro- 3' 851 tein LOC152485

GCGTTTGGGGGTGTCGG 365 O 21 21 O. O.O3436 4 LOC152485 hypothetical pro- 3' 851 tein LOC152485

GTGGGGAGGCTGGGGCG 366 O 43 43 OOOO42 4 DCAMKL2 doublecortin and s' 1633,428 CaM kinase-like 2

GTGGGGAGGCTGGGGCG 367 O 43 43 OOOO42 4 NR3 C2 nuclear receptor s' 31.89 subfamily 3, group C member 2

CTGCACTAAAATATTCG 368 3 29 3 O. O4 6121 4 MGC458OO hypothetical pro- s' 3O46 O6 tein LOC90768

CTTAGATCTAGCGTTCC 369 6 58 3 O. O.O2 127 4 DKFZP564J102 DKFZP564J1 O2 s' 4. protein

CCATATTTGCCCAAGCC 37O O 12 12 O. O.261.52 5 EMB embigin homolog 3' 410

TGACAGGCGTGCGAGCC 371. 2 43 7 O. OO11.98 5 MGC33648 hypothetical pro- s' 926.17 tein MGC33648

TGACAGGCGTGCGAGCC 372 2 43 7 O. OO11.98 5 FLJ11795 hypothetical pro- s' 699674 tein FLJ1795 US 9,556,430 B2 67 68 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

CTAGAAAGACAGATTGG 373 O 12 12 O. O.261.52 5 TIGA1 TIGA1 s' 4O2673

CTAGAAAGACAGATTGG 37.4 O 12 12 O. O.261.52 5 C5orf13 neuronal protein s' 594 3.1

CTGGGTTGCGATTAGCT 37s 23 25 -3 O. O.18417 PPIC peptidylprolyl s' 621.81 isomerase C

CGTGGCTCGGATTCGGG 376 O 13 13 O. O158 5 ARHGAP26 GTPase regulator 3' 8 associated with the focal

CCAGAGGGTCTTAAGTG 377 11 71. 2 O. OO663 5 NR3 C1 nuclear receptor 3' 553 subfamily 3, group C member 1

CTGCGGGAGCTGCGGCC 378 O. 17 17 O. O.28576 5 SGCD delta-sarcoglycan s' 597771 isoform 1

TCCGACAAGAAGCCGCC 379 O 26 26 O. O.O45O2 5 MSX2 mish homeo box 3' 605 homolog 2

CGTCTCCCATCCCGGGC 38O 18 17 -3 O. O16276 5 CPLX2 complexin 2 3' 1498

GCAGAAAAAGCACAAAG 381. 11 4. -9 O. O.266 O9 5. FLT4 fms-related tyro- s' 24508 Sine kinase 4 isoform 1

GTCAGCGCCGGCCCCAG 382 5 44 3 O. O.131.97 6 EGFL9 EGF-like-domain, 3' 134 multiple 9

ATGAGTCCATTTCCTCG 38.331 4 O -3 O. O.2984.1 7 MGC10911 hypothetical pro- s' 96664 tein MGC10911

GCGAGGGCCCAGGGGTC 384 12 7s 2 O. OO6269 7 SLC29A4 solute carrier 3' 67 family 29 (nucleoside

GGGGGGGAACCGGACCG 385 O 18 18 O. OO6642 7 ACTB beta actin 3' 865

AACTTGGGGCTGACCGG 386 O 3 O 3 O O. OO6104 7 AUTS2 autism suscepti- 3' 1095,850 bility candidate 2

CCTTGACTGCCTCCATC 387 is O -16 O. O4 6199 7 WBSCR17 Williams Beuren s' 512 syndrome chromosome region 17

CCCAGGCTTGGAATCCC 388 2 23 4. O. O.26376 7 AP1S1 adaptor-related s' 1. Of protein complex 1, sigma 1.

TACTTTTAACTGCCTGC 389 O 23 23 O. O.O.317 7 FOXP2 forkhead box P2 s' 328728 isoform II

TACTTTTAACTCCCTGC 390 O. 23 23 O. O.O.317 7 PPP1R3A protein phospha- s' 167483 tase 1 glycogen binding

ATTGCATTCTTGAGGGC 391 O 12 12 O. O.261.52 7 SLC4A2 solute carrier 3' 1O family 4, anion exchanger, member

GAGCTGGCAAGCCTGGG 392 O 14 14 O. O.17811 7 ASB10 ankyrin repeat and 3 1148O SOCS box- containing protein

GATGCCACCAGGTTGTG 393 13 7 -6 O. O47343 7 HTRSA 5-hydroxytryptamine 5' sfe (serotonin) recep tor 5A US 9,556,430 B2 69 70 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

GATGCCACCAGGTTGTG 394 13 7 -6 O. O47343 7 PAXIP1L PAX transcription s' 673.72 activation domain interacting

TCCCGCCGCGCGTTGCC 395 O 16 16 O. O.10329 8 PCM1 peric entriolar 3' 243 material 1

CCCTGTCCTAGTAACGC 396 2 36 6 O. O. O4927 8 DDHD2 DDHD domain con- 3' 541 taining 2

CGAGGAAGTGACCCTCG. 397 O 14 14 O. O.17811 8 CHD7 chromodomain heli- 5' 156 case DNA binding protein 7

GCGGGGGCAGCAGACGC 398 9 O -29 O. O.O2372 8 PRDM14 PR domain contain- 3" 768 ing 14

TAACTGTCCTTTCCGTA 399 23 5 -15 6.66 x 10' 8 Not Found

TCTGTATTTTCCCGGGG 4 OO O 22 22 O. O.114:11 8 FAM49B family with se- s' 528 quence similarity 49 member B AAGAGGCAGAACGTGCG 4 O134 12 -9 2.68 x 109 8 KCNK9 potassium channel, 3" 360 subfamily K, member 9

GCCTCAGCCCGCACCCG 4 O2 O 21 21 O. O15063 8 DGAT1 diacylglycerol O- s' 84 acyltransferase 1

GACCGGGGCGCAGGGCC 4 O3 O 21 21 O. O15063 8 ZNF517 zinc finger protein 5." 13 O 17

GACCGGGGCGCAGGGCC 4 O4 O 21 21 O. O15063 8 RPL8 ribosomal protein s' 6362 L8

GTGCGGGCGACGGCAGC 405 12 72 2 O. O.101.35 9 KLF9 Kruppel-like factor 3" 995 9

GCCCGCCTGAGCAAGGG 4 O6 44 23 -6 5.46 x 109 9 C9cf 125 chromosome 9 open 3' 738 reading frame 125

GGTGGAGGCAGGCGGGG 4. Of O 15 15 O. O.1316.1 9 TXN thioredoxin 3' 266

GGCGTTAATAGAGAGGC 408 4 O -13 O. O.294 64 9 PRDM12 PR domain contain- 5' 501.7 ing 12

AGGTTGTTGTTCTTGCA 4 O92 O 14 -5 OOOO8O3 9 PRDM12 PR domain contain- 3" 1427 ing 12

AGCCGCGGGCAGCCGCC 410 O 21 21 O. O15063 9 BARHL.1 BarH-like 1 s' 87

AGCCACCGTACAAGGCC 411. 8 49 2 O. O39.937 O PFKP phosphofructo- 3' 1056 kinase, platelet

GCGGGCAGCTCGAGGCG 412 O 19 19 O. O.19333 O BAMBI BMP and activin 3' 2O3 memorane-bound inhibitor

GCGGCCGCGGGCAGGGG 413 O 2O 2O O. O.1441 O TRIM8 tripartite motif- s' 375 containing 8

CCCCGTGGCGGGAGCGG 414 22 119 2 O. OO1632 O NEURL neuralized-like s' 630

CCCCGTGGCGGGAGCGG 41522 119 2 O. OO1632 O FAM26A family with se- s' 1442O quence similarity 26, member A.

GCCTGGCTCTCCTTCGC 416 O 15 15 O. O.1316.1 O KIAA1598 KIAA1598 3' 509 US 9,556,430 B2 71 72 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

AAAAGTAAACAGGTATT 417 4 O -13 O. O.294 64 O PLEKHA1 pleckstrin homology 5' 162 domain containing, family A

CCGCGCTGAGGGGGGGC 418 O 17 17 O. O.28576 O CTBP2 C-terminal binding 3' 1219 protein 2 isoform 1

TCAGAGGCTGATGGGGC 419 6 52 3 O. OO6.425 O MGMT O- 6-methylguanine - 5 134. Of 65 DNA methyltrans ferase

TCAGAGGCTGATGGGGC 42O 6 52 3 O. OO6.425 O MKI67 antigen identified 5' 232 by monoclonal antibody Ki-67

CGGAGCCGCCCCAGGGG 421 O 28 28 O. O.O91.96 RNH ribonuclease/ 3' 381 angiogenin inhibitor

ATGCCACCCCAGGTTGC 422 O 21 21 O. O15063 OSBPL5 oxysterol-binding 3' 397 protein-like pro tein 5 isoform

GCGCTGCCCTATATTGG 423 11 7s 2 O. O.O341 FLJ11336 hypothetical pro- 3' 375 tein FLJ11336

TCGTCCTGGGTGGAGGG 424 2 22 3 O. O27586 C11CRF4 chromosome 11 hy- s' 458 pothetical protein ORF4

TCGTCCTGGGTGGAGGG 425 2 22 3 O. O27586 BAD BCL2-antagonist s' 7 OS of cell death protein

GCCTCTGCAGCCAGGTG 426 6 O -19 O. O. O5543 DRAP1 DR1-associated 3' 3.68 protein 1

CCACAGACCAGTGGGTG 427 6 42 2 O. O375. Of TPCN2 two pore segment 3' 3. OS channel 2

CCCCGGCAGGCGGCGGC 428 17 89 2 O. O.10843 ROBO3 roundabout axon s' 64774 guidance receptor, homolog 3

CCCCGGCAGGCGGCGGC 429 17 89 2 O. O.10843 FLJ23342 hypothetical pro- s' 2O8 tein FLJ23342

GAACAAACCCAGGGATC 430 18 11 -5 OOOO558 2 KCNA1 potassium voltage- 5' 14 O3 gated channel, shaker-related

TCGGAGTCCCCGTCTCC 431 is 56 3 O. OO1392 2 ANKRD33 ankyrin repeat s' 73619 domain 33

AGAACGGGAACCGTCCA 43229 15 -6 6.88 x 107 2 CENTG1 centaurin, gamma 1 3" 3647

GCCTGGACGGCCTCGGG 433 2 23 4. O. O.26376 2 CSRP2 cysteine and 3' 185 glycine-rich pro tein 2

GTGCGGCGCGGCTCAGC 434 O 18 18 O. O.22346 2 DIP13B DIP13 beta 3' 6

TTGCAAAGAACGGAGCC 435 O 12 12 O. O.261.52 2 CUTL2 cut-like 2 3' 265

TTTCAGCGGGAGCCGCC 436 24 19 - 4 OOOO 698 2 KIAA1853 KIAA1853 protein s' 64

CGAACTTCCCGGTTCCG 437 43 19 -7 4. OO x 10' 2 Not Found

CAGCGGCCAAAGCTGCC 43832 129 1. O. O3O85 2 RAN ras-related nuclear 5" 27 protein US 9,556,430 B2 73 74 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

CAGCGGCCAAAGCTGCC 439 32 129 1. O. O3O85 2 EPIM epimorphin isoform 5' 3.2499 2

GTAGGTGGCGGCGAGCG 44 O O 22 22 O. O.114:11 3 USP12 ubiquitin-specific 3' 653 protease 12-like 1

CTGTACATCGGGGCGGC 441 6 O -19 O. O.16381 3 SOX1 SRY (Sex determin- 5' 425 ing region Y) -box. 1

GCTGCTGCCCCCAGCCC 442 O 19 19 O. O.O.525.4 4 KIAAO323 KIAAO323 3' 158

CGCAGTTCGGAAGGACC 443 O 12 12 O. O.261.52 4 MTHFD1 methylenetetra hydrofolate s' 559 dehydrogenase 1

CGCAGTTCGGAAGGACC 444 O 12 12 O. O.261.52 4 ESR2 estrogen receptor 2 5' 93455

CTGAGGCTGCGCCCGCC 445 O 12 12 O. O.261.52 4 GPR68 G protein-coupled s' 164 O3O receptor 68

GGGCGGTGCCGCCAGTC 446 3 49 5 OOOO941 4 EML.1 echinoderm micro- s' 629 Of tubule associated protein like 1

GCCCCACGCCCCCTGGC 447 9 65 2 O. O.O516 4 C14 orf153 chromosome 14 open 5' 681 reading frame 153

GCCCCACGCCCCCTGGC 448 9 65 2 O. O.O516 4 BAG5 BCL2-associated s' 19 athanogene 5

CTCGTGCGAGTCGCGCG 449 O 17 17 O. O.28576 5 NDNL2 necdin-like 2 s' 4 OS2O9

GCCCCGGCCGCCGCGCC 450 4 38 3 O. O.18724 5. Not Found

AGAGCTGAGTCT CACCC 451, 5 45 3 O. O.1099 5 CDAN1 codanin 1 3' 359

GAGCCTCTTATGGCTCG 42 O 12 12 O. O.261.52 5 RORA RAR-related orphan 3' 2O5 receptor A isoform c

TCAGGCTTCCCCTTCGG 453. 1s 81 2 O. O.12835 5 PIAS1 protein inhibitor s' 1904 SO of activated STAT, 1.

GCCGGGCCCCGCCCTGC 454 O 21 21 O. O15063 5 C15orf17 chromosome 15 open 5' 295 reading frame 17

CCTTGAGAGCAGAGAGC 455 6 41 2 O. O44 419 5 LRRN6A leucine-rich repeat 3" 43 neuronal 6A

CTAAGTGGGCAGCACTG 456 O 19 19 O. O.O.525.4 5 ARNT2 aryl-hydrocarbon 3' 128 receptor nuclear translocator

GGCCGGGCTGGCACCGG 457 O 19 19 O. O.O.525.4 6 TMEM8 transmembrane pro- 3" 496 tein 8 (five membrane-spanning

GGTGCAGCTCTGAGGCG 458 O 44 44 OOOO342 6 RHOT2 ras homolog gene s' 119 family, member T2

GAGTGCCCGGCTCGCCC 459 O 18 18 O. O.22346 6 C1OTNF8 C1q and tumor ne- 3' 5691 crosis factor related protein 8

CCCGCGGGAGAGACCGG 46O 5 48 3 O. OO6311 6 E4F1 p12 OE4F s' 8954

CCCGCGGGAGAGACCGG 461 5 48 3 O. OO6311 6 MGC21830 hypothetical pro- s' 36.23 tein MGC21830 US 9,556,430 B2 75 76 TABLE 5- continued MSDK tags significantly (p < 0.050) differentially present in N-EPI-I7 and I - EPI-7 MSDK libraries and genes associated with the MSDK tags. Position Distance Ratio of AscI of AscI I- site in site SEON- I- EPI- relation from tr. ID EPI- EPI- 7/N- to t . Start MSDK Tag NO. If 7 EPI - If P value Chir Gene Description Start (bp)

CGCAGTGTCCTAGTGCC 462 O 24 24 O. O.O2 455 6 CGI-14 CGI-14 protein s' 89

GAGCTCAGAGCTCCTCC 463 O 2O 2O O. OO615 6 CGI-14 CGI-14 protein s' 89

CCTTCCTGCGAACCCCT 464 O 13 13 O. O158 6 MMP25 matrix metallo- 3' 119 OS proteinase 25

CGGGCCGGGTCGGCCTC 465 O 41 41 OOOO 635 6 NUDT16L1 nudix-type motif s' 11O 1.6-like 1

GTGGCGCTCGGGGTGCG 466 O 13 13 O. O158 6 PPL periplakin s' 283

CCGGGTCCGCGGGCGAG 46714 123 3 5. 66 x 10 6 USP7 ubiquit in specific 3' 72 protease 7 (herpes

ATCCGGCCAAGCCCTAG 468 8 62 2 O. O.O4 442 6 ATFFIP2 activating trans- s' 244.550 cription factor 7 interacting

ATCCGGCCAAGCCCTAG 469 8 62 2 O. O.O4 442 6 GRIN2A N-methyl-D- s' 809 aspartate receptor subunit 2A

GTTAAAAACTTCCAGCC 470 O 12 12 O. O.261.52 6 DNAH3 dynein, axonemal, 3' 895 heavy polypeptide 3

GGGTAGGCACAGCCGTC 471 4 61 5 OOOO219 6 TBX6 T-box. 6 isoform 1 s' 85

TGCGCGCGTCGGTGGCG 472 4 45 3 O. O.O4991 6 LOC51333 mesenchymal stem 3' 9832 cell protein DSC43

CGGTGCCCGGGAGGCCC 473 4 O -13 O. O.294 64 6 CHD9 chromodomain heli- 5' 2004. 6 OO case DNA binding protein 9

CGGTGCCCGGGAGGCCC 474 4 O -13 O. O.294 64 6 SALL1 sal-like 1 s' 654

GTGCAGTCTCGGCCCGG 47s 2 43 7 O. OO11.98 6 FBXL8 F-box and leucine- 3" 39 OS rich repeat protein 8

TCCCGCGCCCAGGCCCC 476 9 O -29 O. O.O2372 6 ZCCHC14 zinc finger, CCHC 3' 143 domain containing 14

GCAGCCCCTTGGTGGAG 477 21 8 -8 2.32 x 10 6 TUBB3 tubulin, beta, 4 3' 843

CCGTGTTGTCCTGGCCG 478 3 4 O 4. O. O. O559 7 MNT MAX binding protein 3' 228

CCACAC CTCTCTCCAGG 479. O 18 18 O. OO6642 7 SENP3 SUMO1/sentrin/SMT3 5' 326 specific protease 3

GGCAACCACTCAGGACG 48O 2 51 8 OOOO235 7 HCMOGT-1 sperm antigen 3' 697 OS HCMOGT-1

CACAGCCAGCCTCCCAG 21323 9 -8 8.64 x 10' 7 LHX1 LIM homeobox pro- 3' 3701 tein 1

CCAAGGAACCTGAAAAC 482 O 14 14 O. O.17811 7 ACLY ATP citrate lyase 3' 446 isoform 1

GCCCAAAAGGAGAATGA 483 6 O -19 O. O.16381 f PHOSPHO1 phosphatase, orphan 3" 5786 1.

CACGCCACCACCCACCC 484 O 16 16 O. O.10329 7 NXPH3 neurexophilin 3 s' 3.18

GAAACCCCTCTGAGCCC 485 O 17 17 O. O.28576 7 ABC1 amplified in breast 3" 235 cancer 1