US 20140065620A1 (19) United States (12) Patent Application Publication (10) Pub. No.: US 2014/0065620 A1 Perez et al. (43) Pub. Date: Mar. 6, 2014

(54) NUCLECACDS FOR DETECTING BREAST (22) Filed: Dec. 21, 2012 CANCER Related U.S. Application Data (71) Applicant: MAYO FOUNDATION FOR (60) Provisional application No. 61/581,627, filed on Dec. MEDICAL EDUCATION AND 29, 2011. RESEARCH, Rochester, MN (US) Publication Classification (72) Inventors: Edith A. Perez, Rochester, MN (US); E. (51) Int. Cl. Aubrey Thompson, JR., Jacksonville, CI2O I/68 (2006.01) FL (US); Yan Asmann, Rochester, MN (52) U.S. Cl. (US) CPC ...... CI2O I/6886 (2013.01) USPC ...... 435/6.12:536/23.4 (73) Assignee: MAYO FOUNDATION FOR (57) ABSTRACT MEDICAL EDUCATION AND This document provides methods and materials involved in RESEARCH, Rochester, MN (US) detecting breast cancer. For example, nucleic acids for detect ing rearrangements (e.g., translocations) associated with breast cancer as well as methods and materials for (21) Appl. No.: 13/725,414 detecting breast cancer are provided. Patent Application Publication Mar. 6, 2014 Sheet 1 of 50 US 2014/0065620 A1

Figure

3: E: *:::::::::g:

i&ff gaias & gasik; É:8: Si:

Faises 388:88-8ering

3:3:33.33333: ;8& 3E383:38ty in

S388. S333i: &:38: Patent Application Publication Mar. 6, 2014 Sheet 2 of 50 US 2014/0065620 A1

38&:

Patent Application Publication Mar. 6, 2014 Sheet 4 of 50 US 2014/0065620 A1

3. : 8.33:3: 38883

8.83:3: $8&x8:

{:}; 82.3.xxx8:

8.833:83 8:88 3x38

88.888:8

A Patent Application Publication Mar. 6, 2014 Sheet 5 of 50 US 2014/0065620 A1

Figure 5

Patent Application Publication Mar. 6, 2014 Sheet 7 of 50 US 2014/0065620 A1

F igure 7

3 :

: $88:38:8 s: 888 Patent Application Publication Mar. 6, 2014 Sheet 8 of 50 US 2014/0065620 A1

Figure 8

Patent Application Publication Mar. 6, 2014 Sheet 9 of 50 US 2014/0065620 A1

Figure 9 . PRIVATE FUSION PROTENS A. Fusions in untranslated regions A 1. Fusions in 3'UTR IGFBP5->AMD1 3'UTR of IGFBP5 fused into AMD1, No AMD 1 in the fused APOOL->DCAF8 3'UTR of APOOL fused into DCAF8, No DCAF8 in the fused protein

TMEM119->AR-2 3'UTR of TMEM119 fused into ARH2, No ARH2 in the fused protein

ASAP1->MAA1 3'UTR of ASAP1 fused into MALAT1, No MALAT1 in the fused protein

C2Orf56->SAMD4B 3'UTR of C2orf56 fused into SAMD4B, No SAMD4B in the fused protein CDK4->UBA1 3'UTR of CDK4 fused into UBA1, No UBA 1 in the fused protein

CRBP->UGP2 3'UTR of CIRBP fused into UGP2, No UGP2 in the fused protein

DNAJA2->COL14A1 3'UTR of DNAJA2 fused into COL14A1, No COL14A1 in the fused protein

HEATR5A->COL1A1 3'UTR of HEATR5A fused into COL1A1, No COL1A1 in the fused protein

POLD3->COL3A1 3'UTR of POLD3 fused into COL3A1, No COL3A1 in the fused protein

CO3A1->ZNF43 3'UTR of COL3A1 fused into ZNF43, No ZNF43 in the fused protein

F27->CPNE3 3'UTR of FI27 fused into CPNE3, No CPNE3 in the fused protein

ROBTB3->CRNK1 3'UTR of RHOBTB3 fused into CRNKL1, No CRNKL1 in the fused protein

EPHA2->CTS) Patent Application Publication Mar. 6, 2014 Sheet 10 of 50 US 2014/0065620 A1 Figure 9 (continued) 3'UTR of EPHA2 fused into CTSD, No CTSD in the fused protein

LTBP4->CTSD 3'UTR of TBP4 fused into CTSD, No CTSD in the fused protein

PACSN3->CTSD 3'UTR of PACSIN3 fused into CTSD, No CTSD in the fused protein

TMEM109->CSD 3'UTR of TMEM109 fused into CTSD, No CTSD in the fused protein

CTTC->NCRNA002O1 3'UTR of CTTN fused into NCRNA00201, No NCRNA00201 in the fused protein

YWHAG->CYB561 3'UTR of YWHAG fused into CYB561, No CYB561 in the fused protein

KRT81->EMP2 3'UTR of KRT81 fused into EMP2, No EMP2 in the fused protein

SBF1->FNA 3'UTR of SBF1 fused into FLNA, No FLNA in the fused protein

GAPDH->KRT13 3'UTR of GAPDH fused into KRT13, No KRT13 in the fused protein

GAPDH->MRPS18B 3'UTR of GAPDH fused into MRPS18B, No MRPS18B in the fused protein

GNB1->TRH 3'UTR of GNB 1 fused into TRH, No TRH in the fused protein

PMA->GNB4 3'UTR of PTMA fused into GNB4, No GNB4 in the fused protein

NTN 1->HDLBP 3'UTR of NTN1 fused into HDLBP, No HDLBP in the fused protein

TES->HNRNPU 3'UTR of TES fused into HNRNPU, No HNRNPU in the fused protein

RAB3P->GFBP5 3'UTR of RAB3P fused into GFBP5, No IGFBP5 in the fused protein

MAF->GFBP7 3'UTR of MAF fused into GFBP7, No GFBP7 in the fused protein Patent Application Publication Mar. 6, 2014 Sheet 11 of 50 US 2014/0065620 A1

Figure 9 (continued) TGA3->KHK 3'UTR of ITGA3 fused into KHK, No KHK in the fused protein

SERPNA->KAA1217 3'UTR of SERPNA1 fused into KIAA1217, No KIAA1217 in the fused protein

MAPK1 P1->XPO1 3'UTR of MAPK1 IP1L fused into XPO1, No XPO1 in the fused protein

MGP->REPS2 3'UTR of MGP fused into REPS2, No REPS2 in the fused protein

MRP52->USP22 3'UTR of MRPL52 fused into USP22, No USP22 in the fused protein

NAV2->WDFY1 3'UTR of NAV2 fused into WDFY1, No WDFY1 in the fused protein YWHAG->PDA3 3'UTR of YWHAG fused into PDiA3, No PDiA3 in the fused protein

PKFYVE->TMEM119 3'UTR of PKFYVE fused into TMEM119, No TMEM119 in the fused protein POSTN->TM9SF3 3'UTR of POSTN fused into TM9SF3, No TM9SF3 in the fused protein POSTN->TRM33 3'UTR of POSTN fused into TRIM33, No TRIM33 in the fused protein

A.2. Fusions in 5'UTR GEMN7->SLC39A14 5'UR of GEMN 7 fused into the 5' UR of SC39A14 RBM6->SC38A3 5UTR of RBM6 fused into the 5' UTR of SLC38A3

DDO 1->REPS1 5'UTR of DiDO1 fused into the coding region of REPS1, No ORF that corresponds to any known protein. Patent Application Publication Mar. 6, 2014 Sheet 12 of 50 US 2014/0065620 A1

Figure 9 (continued) B. Fusions in coding regions B.1. Frame shift fusions that give rise to C-terminal truncation events P5313->ABCA10 Coding to coding fusion-C-terminal truncation product containing aa 1-356 of TP5313, (full length TP5313 = 393 aa) MAPPPPSPOLLLAAARLLGPSEWMAGPAEEAGAHCPESWPPPOVSPRWYTRWSPGQA EDWFYHPCAHPWLKLQAAYACMANPSLTPDFSQDRPVTAWGALEMAWWEPAW AAHWMRRRRRKORKKKAWIYCESSGPAPSEPTPGRGRCRRGCVOAAAFARSWRPP GTEVTSOGPRORPSSSGAKRRRLRAALGPOPTRSALRFPSASPGSKAKOSMAGPGRESNAP SVPTVSLLPGAPGGNASSRTEAOVPNGOGSPGGCVCSSOASPAPRAAAPPRAARGPPRTE EAAWAAMA TFL WLTLATLCTRHRNFRRGESYWGPADSODTWAG

: is 8-8. & Six Sics

EFAG1->ABCC5 Coding region of EIF4G1 fused to 3' UTR of ABCC5-C-terminal truncation containing aa 1 1317 of EIF4G 1. (full length EIF4G1 = 1599 aa) MNKAPCRSTGPPPAPSPGPOPAFPPGCRTAPWVFSPORATORMNTPSOPROHFYPSRAQPPSS AASRVOSAAPARPGPAAHWYPAGSQWMMPSQSYPASQGAYYPGQGRSTYWVPOOYPWO PGAPGFYPGASPTEFGTYAGAYYPAQGVCROFPTGVAPTPWLMNOPPORAPKRERKTRRDPN OGGKDETEEMSGARASPTPPOGGGEPORANGETPQVAVEVRPDDRSOGAADRPGPGP EHSPSESQPSSPSPTPSPSPWEPGSEPNAVSPG)TMTTOMSWEESTPSRETGEPYRS PEPTPLAEPLEVEVTLSKPWPESEFSSSPORAPPLASHTWE HEPNGMVPSEDEPEVESSPE APPPACPSESPVPAPACRPEELNGAPSPPAVO SPVSEPEECRAKEVASMAPPTPSATPAT APSA SPACREEEMEEEEEEEEGEAGEAGEAESEKGGEELPPESTPPANSONLEAAAACRW AVSWPKRRRKKELNKKEAVGD DAFKEANPAVPEVENOPPAGSNPGPESEGSGVPPRPEEA DETWDSKEDKHNAENORPGEQKYEYKSDQWKPLNEEKKRYDREFL GFCRFIFASMCKPEG PHSDWVLDKANKTPRPLDPTROGNCGPFPSFANLGRTTLSTRGPPRGGPGGE PRGP AGGPRRSCROGPRKEPRKATVLMEDKNKAEKAWKPSSKRAADKDRGEEDADGSKTOD FRRVRSNK TPOMFOOLMKOVTOLADTEERKGWOFEKASEPNFSWAYANMCRCMA KVPTTEKPTVTVNFRKLLNRCOKEFEKOKDDDEVFEKKOKEMDEAAAEERGRLKEELEEA ROARRRSGNKFGELFKKMEAMHDCVWKLKNHDEESECCRGKDD FEKAKP RMOOYFNQMEKKEKKTSSRRFMQ)VDLRGSNWVPRRGOOGPKDOHKEAEMEEHRE HKVOOLMAKGSDKRRGGPPGPPSRGPVODGGWNTVPSKGSRPDTSRTKTKPGSDS NNOLFAPGGRLSWGKGSSGGSGAKPSDAASEAARPATSTNRFSALCOAVPTESTDNRRVW ORSSSRERGEKAGDRGORLERSERGGDRGORLDRARPATKRSFSKEVEERSRERPSOPE GLRKAASTEDRDRGRDAVKREAALPPVSPLKAASEEEEKKSKAEEY HNDMKEAVOCV OELASPSFFVRHGVESTLERSAAREHMGOh-OCAGHLSTAQYYQG*

Patent Application Publication Mar. 6, 2014 Sheet 16 of 50 US 2014/0065620 A1

Figure 9 (continued)

MFSFWDLRLLL LAAALTHGOEEGOVEGODEOPPTCVONGRYHDROVWKPEPCRCVCD NGKVLCDDWCDETKNCPGAEVPEGECCPWCPDGSESPTDOETTGVEGPKGDTGPRGPRGP AGPPGRDGEPGOPGPGPPGPPGPPGPPGGGNFAPOSYGYDEKSGGSVPGPMGPSGP RGPGPPGAPGPOGFOGPPGEPGEPGASGPMGPRGPPGPPGKNGOOGEAGKPGRPGERG PPGPORGARGLPGAGPGMKGHRGFSGLDGAKGDAGPAGPKGEPGSPGENGAPGQMGPR GPGERGRPGAPGPAGARGNOGATGAAGPPGPTGPAGPPGFPGAVGAKGEAGPOGPRGSE GPOGWRGEPGPPGPAGAAGPAGNPGADGOPGAKGANGAPGAGAPGFPGARGPSGPOGPG GPPGPKGNSGEPGAPGSKGOTGAKGEPGPWGVQGPPGPAGEEGKRGARGEPGPGPGPP GERGGPGSRGFPGADGWAGPKGPAGERGSPGPAGPKGSPGEAGRPGEAGLPGAKGGSP GSPGPDGKTGPPGPAGODGRPGPPGPPGARGOAGVMGFPGPKGAAGEPGKAGERGVPGP PGAWGPAGKDGEAGAOGPPGPAGPAGERGEOGPAGSPGFOGPGPAGPPGEAGKPGEQG WPGDLGAPGPSGARGERGFPGERGVCRGPPGPAGPRGANGAPGNOGAKGDAGAPGAPGSO GAPGLOGMPGERGAAGPGPKGDRGDAGPKGADGSPGKDGVRG TGPGPPGPAGAPGDK GESGPSGPAGPGARGAPGDRGEPGPPGPAGFAGPPGADGOPGAKGEPGDAGAKGDAGP PGPAGPAGPPGPGNWGAPGAKGARGSAGPPGATGFPGAAGRWGPPGPSGNAGPPGPPGP AGKEGGKGPRGETGPAGRPGEWGPPGPPGPAGEKGSPGADGPAGAPGTPGPOGAGORGV WGPGORGERGFPGPGPSGEPGKQGPSGASGERGPPGPMGPPGLAGPPGESGREGAPGA EGSPGRDGSPGAKGDRGETGPAGPPGAPGAPGAPGPWGPAGKSGDRGETGPAGPGPWGP WGARGPAGPOGPRGDKGETGEORGDRGKGHRGFSGLOGPPGPPGSPGEQGPSGASGPAGP RGPPGSAGAPGKDG NGPGPGPPGPRGRTGOAGPWGPPGPPGPPGPPGPPSAGFDFSF POPPOEKAHDGGRYYRADDANVVRORDLEVOTT KSLSOOENRSPEGSRKNPARTCROLK MCHSDWKSGEYWOPNCRGCNDAKVFCNMETGETCWYPTCRPSWACKNWYSKNPKDKRHV WFGESMOGFOFEYWGSEHSFQ

kegs x-specifix 8. Scalersities

GOPH3->CSS coding to coding fusion-C-terminal truncation containing aa 1-143 of GOLPH3L. (full length GOLPH3 = 285 aa) MTTLTHRARRTESKNSEKKMESEEDSNWEKSPONEDSGDSKORTLMEEVLLG KOKEGY TSFWNOCSSGRGGIE AMRGREYEPPTMRKKRLDRKWKSOSPGDW DETKHKAT EPTETWOTWELL AGATTLVKKDFGWOEKEVGLAFPLTOKSRGS FTNGRE*

ski's 38. 8x838: s Specises

CWC25->ROBO2 Coding to Coding fusion-C-terminal truncation Containing aa 1-63 of CWC25. (full length CWC25 = 425 aa)

Patent Application Publication Mar. 6, 2014 Sheet 18 of 50 US 2014/0065620 A1

Figure 9 (continued)

HMGN3->PAOR8 Coding of HMGN3 fused to 5' of PAQR8-121 aa ORF with no significant homology to any known protein.

MPKRKVAYPVRARHGVHAGRCHDDRHGAPEHPVGORAAAAPPAODPGGWASODALHCP RNGCAPALPGAHPHRPPHGARVAL LOPLSETORGGORLDPFTGSPGRPLAGLCRG

SFTPC->G5 Coding of SFTPC fused to 5' of GLL5-C-termina truncation containing aa 1-35 of SFTPC. (full length SFTPC = 197aa)

MDVGSKEWLMESPPDYSAAPRGRFGPCCPWHLKRLLEVVVGPPNPPNRATSTRPAA*

3. & cars sk: coxsectic: sts Sps: xiii 8:wskrit:

JOSD1->RPS19BP1 Coding to Coding fusion-C-terminal truncation containing aa 1-61 of JOSD1 (full length JOSD1 202 aa) MSCVPWKGDKAKSESE PCRAAPPQYHEKCRRELCAHANNVFQDSNAFTRD QEFQSP PGPSRSGCAERGSGETAPEDEGNSGPETAE GOGKGACRWGGRVPEARVSRPPOSKPEWSD ODEKHRG*

KCTD3->TXNDC 16 Coding to Coding fusion-C-terminal truncation Containing aa 1-208 of KCTD3 (full length KCTD3 = 815 aa) MAGGHCGSFPAAAAGSGEVOLNVGGTRFSTSRQTLMWPDSFFSSLLSGRSTLRDETGAF DROPAAFANFRTKE DRGVSNVRHEAEFYGTP WRRLL CEELERSSCGSWFHGYP PPGPSRKNNTVRSADSRN.GNSTEGEARGNGOPWSGGEEWRGFPVDPRKVVAGH HNWWAAYAHFAVCYSKRRF* Patent Application Publication Mar. 6, 2014 Sheet 19 of 50 US 2014/0065620 A1

Figure 9 (continued)

ress: espergicic sis spies:

RAGPS2-> AMB3 Coding to coding fusion-C-terminal truncation containing aa 1-477 of RALGPS2. (full length RALGPS2 = 583 aa) MDLMNGOASSVNAATASEKSSSSESSDKGSEKKSFDAVWFOWLKWTPEEYAGOTLMOVPW FKAQPDESSCGWNKKEKYSSAPNAVAFTRRFNHVSFWWVRELHACRTLKRAEWSHYKTAK KYENN HAMAWWSG GRSAPFR KWASRKD KTFEKEYVMSKEDNYKRRDYSSLK MTPCPYLGIYLSDLEYEDSAYPSEGSLENEQRSNLMNNLRHSDLOOSCEYDPMLPHVOKYLN SVOYEEOKFVEDDNYK SKEPGSTPRSAASREDLVGPEVGASPOSGRKSWAAEGALLPO PPSPRNPHGHRKCHSGYNFHKMNTAEFKSATFPNAGPRHODSVMEPHAPSRGOAES STSSGSGSS) GSE SEETSWPAFERNRLYHSLGPWRVARNGYRSHMKASSSAESEDAV HYPGAVTICRGVLRRKKEGKKPTPWTWSGEORARAATRHQAAASAA APGPAVSASEA WATRCAWPATASRPMMRTSGSRPCALVDSAMPPPACGCRGGWRTVAWPPGS*

Rike's sers. sei:::::: Rospolic at: Sociates: Sirakass:

SC 16A3->MRPL4 Coding of SLC16A3 fused to the 3' UTR of MRPL4-C-terminal truncation containing aa 1-213 of SLC16A3... (full length SLC16A3 - 465 aa) MGGAVWDEGPTGVKAPDGGWGWAV FGCFVTGFSYAFPKAVSVFFKECREFGGYSDTAW SS AMLYGTGPLCSVCVNRFGCRPVM VGGLFASLGVWAASFCRS OVY TTGVG GLAL NFCRPSLMLNRYFSKRRPMANGAAAGSPWF CALSPLGOL Q DRYGWRGGFGG LLNCC WCAA MRPVVTAQPGSRPPAPPPPRPQA

80'43: Se::::::::Sitsk specisixx 8:-exce:

NPOC4->PDE6G Coding to Coding fusion-C-terminal truncation Containing aa 1-427 of NPOC4 (full length NPLOC4 = 608 aa) Patent Application Publication Mar. 6, 2014 Sheet 20 of 50 US 2014/0065620 A1

Figure 9 (continued)

MAESERVOSPDGVKRTATKRETAATF KKWAKEFGFONNGFSVYNRNKTGEASSNKSNL KKHGDLLFFPSSAGPSSEMETSVPPGFKVFGAPNVVEDEDOYSKQOGKYRSRDPCRLC RHGPLGKCWHCVPLEPFDEDYNHEPPVKHMSFHAYRKLGGAOKGKFWA ENSCKKSGC EGH PWPNGCTKCORPSAT NROKYRHVDNMFEN-TVADRFLDFWRKTGNOHFGY YGRY TEHKDPLGRAEWAAYEPPOGTONSEEDPKAEVVDEAAKGLRKVGWFTDVSEDTRK GTVRYSRNKOTYFSSEECTAGOFONKHPNMCRLSPDGHFGSKFWTAVATGGPDNQVHFEG YQVSNOCMALVRDEC PCKOAPEGYAKESSSECRYWPDWFYKWWGRHPWNGRPGNRHHS HLPGGOPPGAARAGPWHHARGPAEVORTLPPTV"

assesse: is s Sixx'sses:

sixts:

OA1->ORMD 3 coding region of OLA1 fused in front of the 5' UTR of ORMDL3-C-terminal truncation Containing aa 1-124 of OLA1 (full length OLA1 F 396 aa) MPPKKGGOGKPPPGRFGSKGWGPNVGKSFFNVLNSOASAENFPFCTOPNESRVP VPDERFOFCOYHKPASKPAFLNWDAG WKGAHNGOGLGNAFLSHSACDGFHLOOORG AAG Patent Application Publication Mar. 6, 2014 Sheet 21 of 50 US 2014/0065620 A1

Figure 9 (continued)

xers. x:

essessess s:

$xterfassass kiwists

S F1->YPE1 Patent Application Publication Mar. 6, 2014 Sheet 22 of 50 US 2014/0065620 A1

Figure 9 (continued) coding region of SF1 fused in front of the 5' UTR of YPEL1-C-terminal truncation containing aa 1-150 of SF1 (full length SF 1 - 1242 aa) MKNLLEKCSSHNFHQKVKORMEKKVDSRYFKDGAVKKPYSAKTSNKKSSASFGRRELPS TSHLVOYRGTHTCTROGRRE RERCWARKFYLWRMTFGRVFPSKARFYYECRL RKVFEE WKEEWWWFORHEWKLCVRADCHYRLRFSVPAVWTVPHALDNDGSCVHFPWLLLSERPAEP RFFONOP* No putative conserved domains detected, PCNX->MKKS coding region of PCNX fused in front of the 5' UTR of MKKS-C-terminal truncation containing aa 1-814 of PCNX (full length PCNX F 2341 aa) MGSQTLOLROGWWAASGGWYYDPHOATFVNALHLY WFL GPFTYMALPSTMEWAVYC PWAAVFVLKMVNYRLHRADAGEVVDRANEFTDORTKAEOGNCSTRRKDSNGPSDPGGG EMSEFREAPPVGCSSRN.SYAGOPSNOGSGSSRLGTAATKGDTDTAKTSDDSS GOSS SLCKEGSEECDLAADRK-FRLVSNDSFSQPSSSCGODLPRDFSDKVNLPSHNHHHHVDOS LSSACDTEVASVPLHSHSYRKDHRPRGVPFRTSSSAVAFPOTSLNDFP YOORRGDPWSEE SSKPLSGSKESLVENSGSGEFQAGDKNTSOPPTKSGKSKPLKAEKSMDSLRSLSTRSSG STESYCSGDROTNSTWSSYKSEQTSSTHESSEHEESPKAGTKSGRKKECCAGPEEKNSC ASDKRSSEKAWEASNSGVHEAKDPTPSDEMHNORGLSTSASEEANKNPHANEFSOGOR PPGNTAENKEEKSDKSAVSVDSKVRKDWGGKQKEGDVRPKSSSVEHRTASAHKSGRRRGKK RASSFOSSRHRDYWCFRGVSGTKPHSAFCHDEDSSDQSDSRASSVQSAHQFSSDSSSST SHSCCRSPEGRYSALKKHTHKERGTOSEHTHKAH VPEGTSKKRARRSSTNSAKRARVS LDSGTWACLNDSNRMAPESKPTTSKSDEAKEGEVOESLGRASOEVERSRNSPNO WAFPEGEECDAVSGAACASEEAVSFRRERSTFRROAVRRRHNAGSNPTPPTGSPR* No putative conserved domains detected.

SC39A6->RG1 coding to coding fusion-C-terminal truncation containing aa 1-234 of SLC39A6. (full length SLC39A6 = 755 aa) MARKLSWLLFALSVTNPLHELKAAAFPQTEKSPNWESGNVDASTROYHCRQLFYRYGE NNSSVEGFRKLONGOKKRHH-OHD-SDHEHHSDHERHSD-E--SEHEHHSDHDH SHHNHAASGKNKRKALCPDHOSDSSGKDPRNSOGKGAHRPEHASGRRNVKDSWSASEWTST WYNTVSEGHF ETEETPRPGKFPKDWSSSTPPSWSKSRVSRRCPGCPPGRGA No putative Conserved domains detected. G3->FAM3B Coding to coding fusion-C-terminal truncation Containing aa 1-41 of GLI3 (full length GL3 r 1580 aa) MEAOSHSSTTEKKKVENSIVKCSRTDVSEKAVASSTSN)* No putative conserved domains detected

Patent Application Publication Mar. 6, 2014 Sheet 24 of 50 US 2014/0065620 A1

Figure 9 (continued) PSHSQEEKTSTASKTOTREGEVTPNSLSTSYK WSPSSPNKNTSPKRGCKREEG WKEWVRRSKKSVPASVVSRMGRGGCNTAODVTGAHDVDKOKDKNGERMTRGGTESTR YAVOLINALCDPAKEEDLPKNHRTPASTKSHANFSSGWGTTAASSKNAFPGAPTLVSQA TLSFQPANKLNKNVPNVRSSFPWSLPLAYPHPHFALAAQTMORQRHPRPMAQFGGTFS PSPNEWGPFPVRPVNPGNNSSPKHNNSRPNONGTVPSESAGLATASCPTVSSWWAASQ CLCVNTRPSSVRKOLFACVPKTSPPAVSSVTSTCSSLPSVSSAPSGOAPTTFPASSO AOLSSQKMESFSAVPPTKEKVSTODOPMANCPSSTANSCSSSASNTPGAPEHPSSSPTP TSSNTOEEAQPSSVSDSPMSMPFASNSEPAPTLSPRMWAADNQDSNPOAVPAPRVS HRMORPRGSFYSMVPNATHQDPOSFWTNPWTTPPOGPPAAVORSSAVNMNGSCRMHNPAN KSPPTFGPATFNHFSSLFDSSQVPANOGWGDGPLSSRVATDASFTVQSAFLGNSWLGHEN MHPDNSKAPGFRPPSQRVSTSPVGLPSDPSGSSPSSSSAPASFSGPGTRVFLQGPAPVGT PSFNROHFSPHPWTSASNSSTSAPPTLGOPKGVSASODRKPPPGERLARRORGGSVAQAP AGTSFWAPVGHSGWSFGWNAVSEGSGWSQSWMGNHPMHQOSDPSTFSOHOPMERDDS GMWAPSNFHQPMASGFWDFSKGPSMYGGPSHPOAOWPGGPFNGHNPDPAWNPM KVIQNSECTDAQOWPGWAP-GNMHKYWN*

&es sex See:::::::sts Bases::::: 33s.

Sigratiis

8-paxi

ANP32E->MYST4 Coding to Coding fusion-aa1-246 of ANP32E fused in frame to aa 1377-2073 of MYS 4.

MEMKKKNERNRSPEEVTELV)NCCVNGEEGLNDFKELEFSMANVESS ARPSNK LRKLE SDNSGG EVLAEKCPN YLNSGNKKDLSVEALONLKNLKSDFNCETNEDY RESFELLOOTYLDGFDOEDNEAPOSEEEDOEDGDEDDEEEEENEAGPPEGYEEEEEEEEEE DEOEDEDEDEAGSE GEGEEEWGSYMKEEODEEDDDDYVEEGEEEEEEEGEEEEGGGNV EKDPDGAKSOEKEEPESTEKEDSARDDHEEEEEEDEEPSHNEDHDADDEDDSHMESAEVE KEEPRESFKEVENOEFDLNVORPGHSNPEVMDCGVDLTASCNSEPKEAGDPEAVPESO EEPPPGECRAOKODOKNSKEVDTEFKEGNPATMEDSETWORAVOSTORESSEC) OFODCAET CEACRSLONYRADORSPQATTLDDCOOSDHSSPWSSVHSHPGOSVRSVNSPSVPAENSYA CRISPDORSASVPSONMETSPMMDWPSVSDHSQOVVDSGFSDGSEST ENYENPSSYDSTM GGSCGNGSSORNSCSYSNLTSSSTORSSCAVOORMSNSGSCSMLOOR SSSPPCSVKSPORG CVVERPPSSSGRCRAOCSMAANFTPPMQAEPETSNANGLYERMGOSDFGAGHYPORPSATF SAKOOLNTD-SLPYSHSAAVTSYANSAS STPSNTGWOLSOSPHSVPGGPCRACAM PPPNPPP.MNPPPLCRRNMAASNGS-SQROTOASKGHSMRTKSAS SPAAAHOSOY GRSCRWAMOGPARTMORGMNMSVNMPAPAYNVNSWNMNMNTNAMNGYSMSORPMMN SGYHSNHGYMNORPORYPMOMOMGMMGTOPYACRORPMORPPHGNMMY APGHHGYMNGM SKCRSLNGSYMRR

Patent Application Publication Mar. 6, 2014 Sheet 32 of 50 US 2014/0065620 A1

Figure 9 (continued) GWARSAEAEKVLALPEPSPAAPTRSEEGKEQVRSL-SAYLEKKTSLVIRGOGAEEW RAHEEORKEAGRAVPATLPELEATKASKKRACAEAQQPTFDARDERGAQEVGERLOORH GERDWEVERWRERVAQLERWOAVLAQTDWROREECRLGRORYYRESADPGAWLOOAR RRCEQQAMPLADSQAVREQLRQEQALLEEERHGEKVEECQRFAKQYNAKDYELQLVTYKA QLEPVASPAKKPKVOSGSESVQEYVORTHYSELTSOYKFSETRRMEEEER AECORA EERERAEVEAAEKQROLAEAHAOAKACAEREAKELOORMOEEWRREEAAWDACROQKRS QEE COROSSEAECRAKAROAEAAERSRREEERVWRLQEAERORGGAEGELOALRAR AEEAEAOKRCRACREEAERRROWODESQRKROAEVELASRVKAEAEAAREKORALOALEERL QAEEAERRRCRAEVERARQWOVA ETAORSAEAEOSKRASFAEKTAOLERSLOEEHVAWACR REEAERRAOOCRAEAERAREEAERELERWOKANEARRLQAEEVAOOKSAOAEAEKOKE EAEREARRRGKAEECRAVRORRELAEQEEKORO AEGTAGORAAECRERRAETEORGEOOR QEEELAR CRREAAAATORKRCRELEAEAKVRAEMEV LASKARAEEESRSTSEKSKQRLEAE AGRFRELAEEAARRAAEEAKRQRCRAEEDAARORAEAERVLAEK AAGEARKTEAEA KEKEAENERRRLAEDEAFORRRREEQAACRHKADEERAORKASDSE EROKG VEDTROR RROVEEEAKASFEKAAAGKAEELEGRRSNAEDTRSKEQAELEAARORCRAAEEERRR REAEERVQKSLAAEEEAARORKAAEEVERLKAKVEEARRRERAECRESAROOLACEAAOK RLQAEEKAHAFAVOOKEQEQQTLQQEQSVDQRGEGGGGGGGGGGGOGGSAMEPGEV KDRENSLSWKKLOSYFAACEDEPARNH)KVLORLCEHDHALLYGLODSSGYWWWVHF TRREAKCREVOHVATNGRSRAWYLANENSESYR FORENLGLHKY'YVKNAWCSHDH FTLVSGEFRFELDDAPYLDAPYMPOYYKPOYLDFEDRPSSVHGSDSSLNSFNSV TSNEWDDSAAPSSEDYDFGDVFPAVPSVPSTDWEDGDLTDTVSGPRSTASD TSSKASTR SPTORRONPFNEEPAETWSSSDTPVHTTSQEKEEAQALDPPDACEEWRVTKKKKGKKKKS RSDEEASPHPACSOKKCAKOGOGOSRNGSPSLGRDSPDTMLASPQEEGEGPSSTESSER SEPGLLPEMKDSMERGORPLSKVDQLNGORLDPSTWCSRAEPPDQSFRTGSPGDAPERPPL CDFSEGSAPMDFYRFTVESPSTWTSGGGHHDPAGGOPHVPSSPEAAGOEEEGGGGEGOR TPRPLEDTREACRELEACRLSWREGPWSEPEPGCREVLCQLKRDORPSPCSSAEDSGVDEGO GSPSEMVHSSEFRVDNNHLMHVFRENEECFKMRMSTGH MEGNOLYWLTDCYWYL RKGATEKPYLVEEAVSYNEDYWSVG DOCWKLVCTNRRKOFDTADVALAEFFASKSA MKGCREPPYPSLTDAMEKAAKFVAQESKCEASAVTVRFYG VHWEDPDESGPPCH CSPPEGTKEGM-YKAGTSYGKE-WKTCFVVSNGYOYPDRDWPSVNMGGEOCG GCRRANTORPHAFORVLSDRPCLELSAESEAEMAEWMOHCORAVSKGVPQGVAPSPCPCC VLTDDRLFTCHEDCORTSFFRSGTAKLGDSAVSTEPGKEYCWEFSQSQQLPPWVYSC TSEDRSANSGWKYOVDLPHTAOEASNKKKFEDALSHSAWORSDSCRGRASRDP WC

8:ssiss &

Satisfssf$8&

--8::::::

RPS15->PLEC Coding to Coding fusion-aa 1-73 of RPS15 fused in frame to aa 4554-4574 of PLEC with a single aa insertion at the fusion junction.

Patent Application Publication Mar. 6, 2014 Sheet 34 of 50 US 2014/0065620 A1

Figure 9 (continued)

sick: its ki:oxoscosits:

Soxosities

TNRC18->SC9A3R1 Coding of TNRC18 fused to 5' of SLC9A3R1-aa 1-2884 of NRC18 fused in frame to aa 20 428 of SLC9A3R1 MDGRDFGPORSVHGPPPPLSGLAMDSHRVGAATAGRLPASGLPGPPPGKYMAGLN HPH PGEAFGSFWASGMGPSASSHGSPVPPSOSFRSPPSNPMVO WAAHAHEGFSHPSG YPSYHLNHEPPSSGSPLSOGOPSFOTOKGOGPGGOGFY PTAGAPGSLHSHAPSARP GGGHSSGAPAKGSSSRDGPAKERAGRGGEPPPLFGKKDPRARGEEASGPRGVVD TO EARA EGRODRGPPRLAER SPFAESKTKNAAOPSVLTMCNGGAGOVG PAVAEAGRGGAKEAA RODEGAR LRRETLPGPRPCPSPPPPPAPPKGPPAPPAATPAGVY VFREOGREHRVWA PTFWPSVEAFOERPGPCRASOARDARAREREAGRPGVOAPPGSPRPDRPEGREKNSWR SKRPPPADAPTWRARASPDPRAYVPAKELKPEADPRPCERAPRGPAGPAAQQAAKFGLE PGRPPPTGPEHKWKPFELGNFAAQMAV-AAQHHHSRAEEEAAVVAASSSKKAYDPGAVP RSAATCGRPVADMHSAAHGSGEASAMOSKYSGSFARDAVAVRPGGCGKKSPFGGGTMK PEPAPTSAGASRAQARPHSGGPAAGGGROKROPERPESAKAFGREGSGACRGEAEVRHPP WGAVAVARQKDSGGSGRGPGVOERSSSNVKGHGRADEDCVDDRARHREER GAR DRDOEKRESKELADARLHPTSCAPNGNPN MVGGPALAGSGRWSADPAAHLAHPW PRSGNASMWAGHPYGGPPSHORGMAPAFPPGLGGSPSAYOFVRDPCRSGOLWPSOH PHFAEMERAVPP WPAYPPGRSPHHAQOLOFSCROHFROQEFLY OOCRAACRALELOR SAOLVCRER KAQEHRAEMEEKGSKRGEAAGKAGATAGPGPRKPPGAAGPAGTYGKAV SPPPSPRASPWAAKAKVOKLEDWSKPPAYAYPATPSSHPTSPPPASPPPTPGTRKEEAPEN WWEKKDEEKEAPSPFORALFSDPPRYPFOALPPHYGRPYPFL CRPTAAADADGLAPDVPPA DGPERASPEDKPRSPSKTEP REGPEEEPLAEREVKAEVEDMDEGPTEPPLESPPLPA AEAMATPSPAGGCGGGLEAGASA GOSCAEPSECPDFVEGPEPRVDSPGRTEPCAA DL GVCR PETWEAKEEPVEVPVAWPVVEAVPEEGAQVAPSESQPTEMSDCOVPAGEGORCPS EPCREAWPWGSTCFLEEASSDOFPS EDPAGMNAAAAAEPOARPLPSPGAAGACRAEK EAAES VEGSFHGLSEAELEERRSOEMGGAERAVARPSESAAGSHMREVOG PWWDP KNIR PREKPNKKYSWMRKKEERMYAMKSS EDMDALEDFRMRAEVOROY KEK QRE WKORRRDSEDRREEPHRSARRGPGRPRKRTHAPSALSPPRKRGKSGHSSGKSSK STSODYELGAGRKR-KGSEEEHDAGMGKARGRNCRTWOEHEASSDFSCRKKKKKMAS DGECASKDKASKCROKKSPFKFSDSAGGKSKTSGGCGRYPYDS GKNRKAAKG GLSLKSSREGKHKRAAKRKMEVGFKARGOPKSAHSPFASEVSSYSYNTOSEEDEEFLKDEW PAOGPSSSKLPS, CSMVAKNSKAAGGPKLTKRGAAPRTLKPKPASRKOPFCL REAEA RSSFSDSSEESFOODESSEEEDEEEEEEEDEASGGGYRLGARERA SPGEESGGLARFA ASALPSPTVGPSLSWVO EAKOKARKKEEROSLLGTEFEYTDSESEVKVRKRSPAGLLRPKKG GEPGPS AAPTPGARGPDPSSPDKAKAWEKGRKARKRGPKEPGFEAGPEASDDDWTRR RSERF HDASAAAPAPVSAPAKSRCAKGGP SPRKDAGRAKDRKDPRKKKKGKEAGPG AGPPPRAPALPSEARAPHASSTAAKRSKAKAKGKEWKKENRGKGGAVSKMESMAAEEDF EPNQDSSFSEDEHPRGGAVERPTPAPRSCOKDELKEDGLRVLPMDDKLYAGHWOWHSP DYRVVVEGERGNRPHYCLEOLLOEADVRPASTRFLPQGRAAYWSQQYRCLYPGVWRG DEDDGDTVEFDDGO GRPLSHRPPDYKOCAEPSPALVPSAKRRSRKSKOTGEG KOGGTAGSEEPGAKARGRGRKPSAKAKGDRAATLEEGNPTDEVPSTP AEPSSTPGSKKSP Patent Application Publication Mar. 6, 2014 Sheet 35 of 50 US 2014/0065620 A1

Figure 9 (continued) PEPVDKRAKAPKARPAPPOPSPAPPAFSCPAPEPFAE PAPATS APAPTMPATRPKPKKA RAAEESGAKGPRRPGEEAELVKDHEGWTSPKSKKAKEAL REDPGAGGWOEPKSSLG SYPPAAGSSEPKAPWPKATDGDAQEPGPG TFEDSGNPKSPOKAQAECRDGAEESESSSSS SSGSSSSSSSSSSSGSEEGEEEGDKNGOGGCGTGGRNCSAASSRAASPASSSSSSSSSSS SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSTTDEOSSCSSODEAAP APAGPSAGAAPKAKQAGKARPSAHSPGKKPAPOPOAPPPOPTOPOPKACRAGAKSRP KKREGWHPTKELAKRORPSVENRPKIAAF PARQWKWFGKPTORRGMKGKARKLFYKA WRGKEMERGDCAVFLSAGRPNLPYIGROSMWESWGNNMVVRVKWFYHPEETSPGKQFHCG QHWOOKSSRSPWPGTPACLARPAAOGFSAALPVRWTGRRAGPSRPVPGPSRAADPSQG EMSADAAAGAP PRCC EKGPNGYGFHHGEKGKLGOYER WEPGSPAEKAGLAGDR VEV NGENVEKETHOROVWSRRAANAVRLVVD PETOECRLQKLGVOVREELRAQEAPGCRAEPPAA AEVOGAGNENEPREADKSHPECRRELRPRCTMKKGPSGYGFNHSDKSKPGCFRSVDPDSP AEASGRAQDRIVEVNGWCMEGKQHGOVVSARAGGDETKLWWDRETDEFFKKCRWPSQEH NGPPVPFNGEOKENSREALAEAALESPRPALVRSASSDSEENSORDSPPKORDSTAPSST SSSDP LDFNSAMAKERAHQKRSSKRAPCMDWSKKNELFSN*

query ses. x: 8:8:Ex- 3888 x:33:8ts coaxii: its

88sixties ice::::wckoxts

UBR2->SRPK1 Coding region of UBR2 fused to 3' UTR of SRPK1-aa 1-26 of UBR2 fused in frame to aa 434 548 of SRPK1 MASEEPEVOADRSECSAEEAGAFEAGDYLFEPHSGEEYRDEDHALEGKVPRKL WAGKYSKEFFTKKGOKHTKKPWGFEWLVEKYEWSQEEAAGFTDFLPMEPEKRATAA ECLRHPWLNS Patent Application Publication Mar. 6, 2014 Sheet 36 of 50 US 2014/0065620 A1

Figure 9 (continued)

He seg. sksweeks: it:

is: 83.38: estimates: Patent Application Publication Mar. 6, 2014 Sheet 37 of 50 US 2014/0065620 A1 Figure 9 (continued) E. REDUNDANT NON-SOTYPE-SPECIFIC FUSION

A. Fusions in untranslated regions A 1. FLSions in 3'UTR

RHOB->GATA3 3'UTR of RHOB fused into GATA3, No GATA3 in the fused protein

LOC7286O6->KCTD 1 3'UTR of LOC728606 fused into KCTD1, No KCTD1 in the fused protein

OGT->ACB 3'UTR of OGT fused into ACTB, No ACTB in the fused protein

H1 FO->ACB 3'UTR of H1FO fused into ACTB, No ACTB in the fused protein

PTP4A2->MAAT1 Not found

CO-1A2-> AMP2 3'UTR of COL1A2 fused into LAMP2, No LAMP2 in the fused protein

SPAS2->CO3A1 3'UTR of SPATS2 fused into COL3A1, No COL3A1 in the fused protein

PLXNA1->CTSD 3'UTR of PLXNA1 fused into CTSD, No CTSD in the fused protein VPS35->OCN 3'UTR of WPS35 fused into DCN, No DCN in the fused protein

F->ADD3 3'UTR of FTL fused into ADD3, No ADD3 in the fused protein

COL1A1->BASP1 3'UTR of COL1A1 fused into BASP1, No BASP1 in the fused protein

BAT22->CO3A1 3'UTR of BAT2L2 fused into COL3A1, No COL3A1 in the fused protein

CD68->NEAT1 3'UTR of CD68 fused into NEAT1, No NEAT 1 in the fused protein

Patent Application Publication Mar. 6, 2014 Sheet 40 of 50 US 2014/0065620 A1

Figure 9 (continued)

ACTG1->PPP1R12C Coding region of ACTG1 fused to 3' UTR of PPP1R12C-C-terminal truncation containing aa 1 343 of ACTG1(full length ACTG1 F 375aa) MEEEAA VDNGSGMCKAGFAGDDAPRAVFPSVGRPRHORGVMVGMGOKDSYWGDEACRSK RGLKYPEHGVTNWOMEKEWH-FYNERVAPEE-PVTEAPNPKANREKMTOMFET FNTPAMYWAOAVS YASGRTTGVMDSGDGV-TVPYEGYALPHARD AGRDLOYMK TERGYSFTAEREVROKEKCYVADFECREMATAASSSSLEKSYE POGOVTIGNERFRC PEAFOPSF GMESCGHETTFNSMKCOVDRKDYANVSGGTTMYPGADRMORKETALAP SMKKAPPERKYSVWGGEARMEVGGRAKCRGOVARGDOGORRDOEGEDOEVORGSVART RWAAWRSPSV

8te 3.8:w.

$8::::::::::: 83rs: 838.

Stassesses stilesias

MTF2->ARL3 coding to coding fusion-no significant ORF MRACSQFCAS

EPN1->COL1A1 coding to coding fusion-C-terminal truncation containing aa 1-538 of EPN 1... (full length EPN1 = 576 aa)

MSSSRROMKNVHNYSEAEKVREATSNDPWGPSSSLMSEAD TYNWAFSEMSMWKR NDH GKNWRHWYKAMTLMEYLKTGSERVSQQCKENMYAVORTKDFCYWORDGKDOGVNVRE KAKOLVALRDEDRREERAHALKTKEKAQATASSAAVGSGPPPEAECRAWPOSSGEEEO OLAAMSKEEADOEERRRGOORLOMAEESKREGGKEESSMDLADVFTAPAPAPTOPW GGPAPWAAAVPTAAPSDPWGGPPVPPAADPWGGPAPTPASGDPWRPAAPAGPSVDPWGG TPAPAAGEGPPDPWGSSDGGVPWSGPSASPWTPAPAFSDPWGGSPAKPSTNGAGGF DTEPDEFSDFDRLRTAPSGSSAGELELLAGEVPARSPGAFOMSGWRGSAEAVGSPPPAAT PPPPRKTPESFLGPNAALVDSLVSRPGPPPGAKASNPFLPGGGPATGPSVTNPFOPA PPATTLNORLSPVPPWPGAPPYSPLGGGPGLPPMMPPLAVNKVPLEPLVLVPEVPLA VAKMDSTVSAPLGPLVAVALVMWVPPADLVPWPAVSTSASCPSHKRRM WAATTGLMMPMWFWWTSRWTPPSRAASRSRTSGAORAAARTPPAPAVTSRCATLTGRVES TGTPTKAAWMPSKSSAWRLVRPACPSPWWPRR GTSARTPRRGMSGSARA Patent Application Publication Mar. 6, 2014 Sheet 41 of 50 US 2014/0065620 A1

Figure 9 (continued)

*o-specifs: s:

scieyesis&

DNM2->PIN1 coding to coding fusion-C-terminal truncation containing aa 1-474 of DNM2, (full length DNV2 F 870 aa)

MGNRGMEELPVNKOOAFSSGOSCHOPOAWGGOSAGKSSWLENFWGRDFPRGSG VTRRPIOFSKTEHAEFLHCKSKKFTDFDEVRCREEAETDRVTGTNKGSPVPNRVYSPHV LNTLDPGTKWPWGDCPPDEYOKDM OFSRESS LAWTPANMDANSDALKLAKEVDPC) GLRTGVTKOMDEGDAROWLENKLPRRGYGWWNRSOKDEGKKORAAAAERKFFS HPAYRHMAORMGTPHLCRKTLNOOTNHRESPALRSKLQSQL SEKEVEEYKNFRPDOPR KTKAL CRMWOOFGWOFEKREGSGDOWDTESGGARNRFHERFPFE VKMEFDEKOLRRE SYAKNHGVRTGLFTPDAFEAVKKQVVKLKEPCLKCVD VOELINTWROCSKLSSYPRRE ETERIWTYREREGRTKOQAECTSTSTPASGSGPAATAAVVAKTGRGSPGSAARCW*

3.38. 8xce:iii: :

3.x: aesis:

issists

ELAC 1->SMAD4 Coding of ELAC1 fused to 5' of SMAD4-C-terminal truncation containing aa 1-53 of ELAC 1 (full length ELAC 1 - 363 aa)

MSVDWTF-GTGAAYPSPRGASAVWRCEGECWLFDCGEGTOTO MKSOLKAGYPEYMSNN FPCNVSCCFSLFPKDCRNCFRNWRH

& Sess 8ors:x::it: isis:

Sixx:iiies Patent Application Publication Mar. 6, 2014 Sheet 42 of 50 US 2014/0065620 A1

Figure 9 (continued) SPARC->RPS1 coding region of SPARC fused in front of the 5' UTR of TRPS1-C-terminal truncation containing aa 1-120 of SPARC. (full length SPARC = 303 aa)

MRAWFFCAGRAAAPCROEAPDETEVVEETWAEWTEVSVGANPVOVEVGEFDDGAEEE EEWVAENPCONHHCKH GKVCEDENNTPMCVCCROPTSCPAPGEFEKWCSNONKFDNSWGV K*

8ters set 8tainese:SS: s:

RP23->WUC 1 Coding to Coding fusion-C-terminal truncation containing aa 1-118 of RPL23, (full length RPL23 = 140 aa) MSKRGRGGSSGAKFRSGLPVGAVEN CADNIGAKNYSVKGKGRNRLPAAGVGDMVMA WKKGKPERKKVHPAVWRORKSYRRKOGVFLY FEDNAGVVNNKGEMKGSATWSC

irst as & as aspecific s

spersees

Coding of ITC7A fused to 5' of SOCS5-C-terminal truncation containing aa 1-62 of TC7A (full length TC7A = 858 aa) MAAKGAHGSYLKVESEERCRAEG-WDRMPELVROLOTLSMPGGGGNRRGSPSAAFFPD DFING" No putative conserved domains detected. Patent Application Publication Mar. 6, 2014 Sheet 43 of 50 US 2014/0065620 A1

Figure 9 (continued) B.2. In frame fusions that give rise to chimaeric proteins

RP 19->RPS16 Coding to Coding fusion-aa 1-27 of RPL19 fused in frame to aa 33-129 of RPS16. MSMROKRASSVRCGKKKWWDPNGKVNGRPEMEPRTLOYKLEPVGKERFAGV DERVRWKGGGHVAQYARQSSKALWAYYQKYWDEASKKEKDELQYDRTVADPRRCESKKF GGPGARARYORKSYR

888S$3. sta. & & & & News883. s:

pix

assie

MSB10->RPS16 Coding to Coding fusion-aa 1-34 of TMSB 1O fused in frame to aa 29-145 of RPS16 with single aa insertion at fusion junction. MADKPDMGEASFOKAKKKTETOEKNTPTKETNGKVNGRPLEMEPRTLOYKLLEPWLLG KERFAGVDRVRVKGGGHWAOYAROSSKALWAYYQKYWDEASKKEKOLOYDRVADPR RCESKKFGGPGARARYORKSYR

Risis : ki:sNon-sex: it:

Sueixies Patent Application Publication Mar. 6, 2014 Sheet 44 of 50 US 2014/0065620 A1

Figure 9 (continued) ... REDUNDANT SOTYPE-SPECIFICFUSION PROTEINS A. Fusions in untranslated regions A 1. Fusions in 3'UTR CO-1A1->FMN 3 3'UTR of COL1A1 fused into FMN 3, No FMNL3 in the fused protein

CYB5R3->XNP 3'UTR of CYB5R3 fused into TXNP, No TXNEP in the fused protein

EF3->SLC39A6 3'UTR of ELF3 fused into SLC39A6, No SLC39A6 in the fused protein

TAX1 BP1->MALAT1 3'UTR of TAX1 BP1 fused into MALAT1, No MALAT1 in the fused protein

MGP->NCRNAOO188 3'UTR of MGP fused into NCRNA00188, No NCRNA00188 in the fused protein APOL1->ACB 3'UTR of APOL1 fused into ACTB, No ACTB in the fused protein

ACTB->C2OOrf112 3'UTR of ACTB fused into C20orf1 12, No C20orf1 12 in the fused protein

CD74->MBO6 3'UTR of CD74 fused into MBD6, No MBD6 in the fused protein

GNB2->CTSD 3'UTR of GNB2 fused into CTSD, No CTSD in the fused protein

SPAN14->HLA-E 3'UTR of TSPAN14 fused into HLA-E, No HLA-E in the fused protein YWHAZ->ZBB33 3'UTR of YWHAZ fused into ZBTB33, No ZBTB33 in the fused protein

DCLK1->CO3A1 Not found

A.2. Fusions in 5'UTRs

Not detected

Patent Application Publication Mar. 6, 2014 Sheet 46 of 50 US 2014/0065620 A1

Figure 9 (Continued) HYWTYAKNPNCKWYCYNCSSCKELHPDEDTDSAYLFYEOQGDYACFLPKDGKKMADTSS MDEDFESDYKKYCVO

sery see

ow:specik: sks

Sessies Rassis

KR18->PLEC Coding to Coding fusion-aa 1-325 of KRT18 fused in frame to aa 1396-4574 of PLEC MSFTRSTFSTNYRSGSWCAPSYGARPVSSAASVYAGAGGSGSRSVSRSTSFRGGMGSGG AGAGGAGMGGONEKETMOSNORASYDRVRSEENRRESKRE-EKKGPOVRD WSHYFKEDLRAQFANTVDNARIVLCDNARLAADDFRVKYETELAMROSVENDHGLRKVFDD TNTROLETEEALKEELFMKKNHEEEVKG OAQASSGLTVEVOAPKSODAKMADRAQYO ELARKNREELDKYWSCROEESTTWWTOSAEWGAAETTLERRTWOSEDDSMRN KASLE NSLRVEKCRRCAEAHACRAKACAEREAKELORORMOREEVWRREEAAVOAQQQKRSCREELOQL RQSSEAECRAKARCAEAAERSRREEERVWRLQEATERORGGAEGECALRARAEEAEACR KROAGEEAERRROWGRESORRKRCRAEVELASRWKAEAEAAREKORALOALEERLOAEEAER RLRORAEVERARQVCRVALETACRRSAEAELQSKRASFAEKAQLERSLQEEHVAVAQLREEAER RAQQQAEAERAREEAEREERWORKANEARRCAEEVAQQKSLAQAEAEKOKEEAEREAR RRGKAEEGAWRQREAEGREEKQRQAEGTAGORAAECRELERLRAETEORGEQQRQLEEE ARQREAAAAQKROELEAEAKWRAEMEVLASKARAEEESRS SEKSKQREAEAGRFREL AEEAARRAAEEAKRORO AEEDAARCRAEAERVAEK AAGEATRKTEAEALKEKEAEN ERRRAEDEAFORRRREEQAAQHKADEERAORKASDSELERQKG VEDTRORROVEEE LAKASFEKAAAGKAELELEGRRSNAEDTLRSKEQAELEAAROROAAEEERRRREAEERV CKSAAEEEAARORKAAEEVERLKAKVEEARRRERAECESAROQLAQEAAQKRCRAEEK AHAFAWQOKEQELOOTCOECSVLDCLRGEAEAARRAAEEAEEARVCAEREAAQSRROWEE AERLKOSAEECACARAOAOAAAEKRKEAECREAARRACAECRAA ROKORAADAEMEKHKKFA EQTLROKAQVEQELTTLRLQLEETDHQKNLLDEELORLKAEATEAARORSOVEEELFSWRVQM EELSK KAREAENRARDKDNTORFLOEEAEKMKOVAEEAARSVAACEAARLRCRAEEDL AQQRALAEKMKEKMOAVGREATRKAEAE KELACECARROEDKEOMAOCAEET GFORTLEAEROROEMSAEAERKLRVAEMSRAQARAEEDACRFRKOAEEGEKLHRELA CREKVT VORTEORRORCRSDHDAER REAAEEREKEKOCREAKORLKSEEMOTVOCREQLLORE TCRACROSFSEKDSLORERFEOEKAKEO FODEVAKAOOLREEOOR MEOERORW ASMEEARRRCRHEAEEGWRRKOREEOCREQQRRQOEEAEENOR RECROLLEECRHRAALA HSEEVTASOVAATKT PNGRDADGPAAEAEPEHSFDGLRRKVSAQR CREAGSAEE ORLA

Patent Application Publication Mar. 6, 2014 Sheet 49 of 50 US 2014/0065620 A1

Figure 11

Patent Application Publication Mar. 6, 2014 Sheet 50 of 50 US 2014/0065620 A1

Figure 12

Piferation of VAST KD in WDA-MB-468

25 so 75 1do 15 150 is 2do Hours US 2014/0065620 A1 Mar. 6, 2014

NUCLECACDS FOR DETECTING BREAST sequence can be at least 50 nucleotides. The fusion partner A CANCER sequence can beat least 100 nucleotides. The fusion partner B sequence can be at least 10 nucleotides. The fusion partner B CROSS-REFERENCE TO RELATED sequence can be at least 50 nucleotides. The fusion partner B APPLICATIONS sequence can be at least 100 nucleotides. The first primer can 0001. This application claims the benefit of U.S. Provi be between 13 and 100 nucleotides in length. The first primer sional Application Ser. No. 61/581,627, filed Dec. 29, 2011. can be between 15 and 50 nucleotides in length. The second The disclosure of the prior application is considered part of primer can be between 13 and 100 nucleotides in length. The (and is incorporated by reference in) the disclosure of this second primer can be between 15 and 50 nucleotides in application. length. The fusion partner A sequence can be present in a human LIMA1 nucleic acid, and the fusion partner B BACKGROUND sequence can be present in a human USP22 nucleic acid. The 0002 1. Technical Field fusion partner A sequence can be present in a human LIMA1 0003. This document relates to methods and materials nucleic acid, and the fusion partner B sequence can be present involved in detecting breast cancer. For example, this docu in a human USP22 nucleic acid. The fusion partner A ment provides nucleic acids for detecting gene rearrange sequence can be present in a human ACACA nucleic acid, and ments (e.g., translocations) associated with breast cancer as the fusion partner B sequence can be present in a human well as methods and materials for detecting breast cancer. STAC2 nucleic acid. The fusion partner A sequence can be 0004 2. Background Information present in a human FAM102A nucleic acid, and the fusion 0005 Gene fusion events resulting from inversions, inter partner B sequence can be present in a human CIZ1 nucleic Stitial deletion, or translocations represent one of the most acid. The fusion partner A sequence can be present in a human common types of genomic rearrangement. So far, the major GLB1 nucleic acid, and the fusion partner B sequence can be ity of fusion have been identified in leukemias, lym present in a human CMTM7 nucleic acid. The fusion partner phomas, and sarcomas. Recently, the discovery of A sequence can be present in a human MED1 nucleic acid, TMPRSS2-ERG fusions in prostate cancer and EML4-ALK and the fusion partner B sequence can be present in a human fusion in non-small cell lung tumors suggests that gene fusion STXBP4 nucleic acid. The fusion partner A sequence can be events may as well occur with a relatively high frequency in present in a human PIP4K2B nucleic acid, and the fusion solid tumors, leading to the generation of fusion proteins with partner B sequence can be present in a human RAD51C unique oncogenic properties. The BCR-ABL1 fusion gene nucleic acid. The fusion partner A sequence can be present in can be used as a diagnostic marker for chronic myelogenous a human RAB22A nucleic acid, and the fusion partner B leukemia (CML), and is a drug target of Imatinib (Gleevec) in sequence can be present in a human MYO9B nucleic acid. cells that harbor the BCR-ABL1 fusion gene. The prostate The fusion partner A sequence can be present in a human cancer specific TMPRSS2-ERG fusion events place growth RPS6KB 1 nucleic acid, and the fusion partner B sequence regulatory genes under the influence of an androgen-regu can be present in a human SNF8 nucleic acid. The fusion lated promoter, giving rise to an oncogene that has the poten partner A sequence can be present in a human STARD3 tial to amplify normal androgen-dependent growth. nucleic acid, and the fusion partner B sequence can be present in a human DOK5 nucleic acid. The fusion partner A sequence can be present in a human TRPC4AP nucleic acid, SUMMARY and the fusion partner B sequence can be present in a human 0006. This document provides methods and materials MRPL45 nucleic acid. The fusion partner A sequence can be involved in detecting breast cancer. For example, this docu present in a human ZMYND8 nucleic acid, and the fusion ment provides nucleic acids for detecting gene rearrange partner B sequence can be present in a human CEP250 ments (e.g., translocations) associated with breast cancer as nucleic acid. The fusion partner A sequence can be present in well as methods and materials for detecting breast cancer. As a human CTAGE5 nucleic acid, and the fusion partner B described herein, a patient sample (e.g., a breast tissue sequence can be present in a human SIP1 nucleic acid. The sample) can be assessed for the presence or absence of one or fusion partner A sequence can be present in a human MLL5 more of the gene rearrangements set forth in Table 3, 4, 5, 6, nucleic acid, and the fusion partner B sequence can be present 8, or 10. In some cases, the presence of one or more gene in a human LHFPL3 nucleic acid. The fusion partner A rearrangements set forth in Table 3, 4, 5, 6, 8, or 10 can sequence can be present in a human SEC22B nucleic acid, indicate that the patient has breast cancer. Detecting a gene and the fusion partner B sequence can be present in a human rearrangement set forth in Table 3, 4, 5, 6, 8, or 10 can allow NOTCH2 nucleic acid. The fusion partner A sequence can be clinicians and patients to diagnose breast cancer in an effi presentina human EIF3Knucleic acid, and the fusion partner cient and effective manner. B sequence can be present in a human CYP39A1 nucleic acid. 0007. In general, one aspect of this document features a The fusion partner A sequence can be present in a human primer pair comprising, or consisting essentially of first and RAB7A nucleic acid, and the fusion partner B sequence can second primers, wherein an amplification reaction compris be present in a human LRCH3 nucleic acid. The fusion part ing the first and second primers has the ability to amplify a ner A sequence can be present in a human RNF187 nucleic nucleic acid having a fusion partner A sequence and a fusion acid, and the fusion partner B sequence can be present in a partner B sequence, wherein the fusion partner A sequence is human OBSCN nucleic acid. The fusion partner A sequence present in a first human gene set forth in Table 3, 4, 5, 6, 8, or can be present in a human SLC37A1 nucleic acid, and the 10 and the fusion partner B sequence is present in a second fusion partner B sequence can be present in a human ABCG1 human gene set forth in Table 3, 4, 5, 6, 8, or 10 as being a nucleic acid. The fusion partner A sequence can be present in fusion partner with the first human gene. The fusion partner A a human EXOC7 nucleic acid, and the fusion partner B sequence can be at least 10 nucleotides. The fusion partner A sequence can be present in a human CYTH1 nucleic acid. The US 2014/0065620 A1 Mar. 6, 2014 fusion partner A sequence can be present in a human BRE the fusion partner A sequence is present in a first human gene nucleic acid, and the fusion partner B sequence can be present set forth in Table 3, 4, 5, 6, 8, or 10 and the fusion partner B in a human DPYSL5 nucleic acid. The fusion partner A sequence is present in a second human gene set forth in Table sequence can be present in a human CD151 nucleic acid, and 3, 4, 5, 6, 8, or 10 as being a fusion partner with the first human the fusion partner B sequence can be present in a human gene. The fusion partner A sequence can be at least 10 nucle DRD4 nucleic acid. The fusion partner A sequence can be otides. The fusion partner A sequence can be at least 50 present in a human LDLRAD3 nucleic acid, and the fusion nucleotides. The fusion partner A sequence can beat least 100 partner B sequence can be present in a human TCP11L1 nucleotides. The fusion partner B sequence can be at least 10 nucleic acid. The fusion partner A sequence can be present in nucleotides. The fusion partner B sequence can be at least 50 a human RFT1 nucleic acid, and the fusion partner B nucleotides. The fusion partner B sequence can beat least 100 sequence can be present in a human UOCRC2 nucleic acid. nucleotides. The fusion partner A sequence can be present in The fusion partner A sequence can be present in a human a human LIMA1 nucleic acid, and the fusion partner B GSDMC nucleic acid, and the fusion partner B sequence can sequence can be present in a human USP22 nucleic acid. The be present in a human PVT1 nucleic acid. The fusion partner fusion partner A sequence can be present in a human LIMA1 A sequence can be present in a human INTS1 nucleic acid, nucleic acid, and the fusion partner B sequence can be present and the fusion partner B sequence can be present in a human in a human USP22 nucleic acid. The fusion partner A PRKAR1B nucleic acid. The fusion partner A sequence can sequence can be present in a human ACACA nucleic acid, and be present in a human POLDIP2 nucleic acid, and the fusion the fusion partner B sequence can be present in a human partner B sequence can be present in a human BRIP1 nucleic STAC2 nucleic acid. The fusion partner A sequence can be acid. The fusion partner A sequence can be present in a human present in a human FAM102A nucleic acid, and the fusion MYH9 nucleic acid, and the fusion partner B sequence can be partner B sequence can be present in a human CIZ1 nucleic present in a human EIF3D nucleic acid. The fusion partner A acid. The fusion partner A sequence can be present in a human sequence can be present in a human BRIP1 nucleic acid, and GLB1 nucleic acid, and the fusion partner B sequence can be the fusion partner B sequence can be present in a human present in a human CMTM7 nucleic acid. The fusion partner TMEM49 nucleic acid. The fusion partner A sequence can be A sequence can be present in a human MED1 nucleic acid, present in a human SUPT4H1 nucleic acid, and the fusion and the fusion partner B sequence can be present in a human partner B sequence can be present in a human CCDC46 STXBP4 nucleic acid. The fusion partner A sequence can be nucleic acid. The fusion partner A sequence can be present in present in a human PIP4K2B nucleic acid, and the fusion a human TMEM104 nucleic acid, and the fusion partner B partner B sequence can be present in a human RAD51C sequence can be present in a human CDK12 nucleic acid. The nucleic acid. The fusion partner A sequence can be present in fusion partner A sequence can be present in a human RIMS2 a human RAB22A nucleic acid, and the fusion partner B nucleic acid, and the fusion partner B sequence can be present sequence can be present in a human MYO9B nucleic acid. in a human ATP6V1C1 nucleic acid. The fusion partner A The fusion partner A sequence can be present in a human sequence can be present in a human TIAL1 nucleic acid, and RPS6KB1 nucleic acid, and the fusion partner B sequence the fusion partner B sequence can be present in a human can be present in a human SNF8 nucleic acid. The fusion C10orf119 nucleic acid. The fusion partner A sequence can partner A sequence can be present in a human STARD3 be present in a human MECP2 nucleic acid, and the fusion nucleic acid, and the fusion partner B sequence can be present partner B sequence can be present in a human TMLHE in a human DOK5 nucleic acid. The fusion partner A nucleic acid. The fusion partner A sequence can be present in sequence can be present in a human TRPC4AP nucleic acid, a human ARID1A nucleic acid, and the fusion partner B and the fusion partner B sequence can be present in a human sequence can be present in a human MAST2 nucleic acid. The MRPL45 nucleic acid. The fusion partner A sequence can be fusion partner A sequence can be present in a human UBR5 present in a human ZMYND8 nucleic acid, and the fusion nucleic acid, and the fusion partner B sequence can be present partner B sequence can be present in a human CEP250 in a human SLC25A32 nucleic acid. The fusion partner A nucleic acid. The fusion partner A sequence can be present in sequence can be present in a human KLHDC2 nucleic acid, a human CTAGE5 nucleic acid, and the fusion partner B and the fusion partner B sequence can be present in a human sequence can be present in a human SIP1 nucleic acid. The SNTB1 nucleic acid. The fusion partner A sequence can be fusion partner A sequence can be present in a human MLL5 present in a human ARID1A nucleic acid, and the fusion nucleic acid, and the fusion partner B sequence can be present partner B sequence can be present in a human WDTC1 in a human LHFPL3 nucleic acid. The fusion partner A nucleic acid. The fusion partner A sequence can be present in sequence can be present in a human SEC22B nucleic acid, a human HDGF nucleic acid, and the fusion partner B and the fusion partner B sequence can be present in a human sequence can be present in a human S100A10 nucleic acid. NOTCH2 nucleic acid. The fusion partner A sequence can be The fusion partner A sequence can be present in a human presentina human EIF3Knucleic acid, and the fusion partner PPP1R12B nucleic acid, and the fusion partner B sequence B sequence can be present in a human CYP39A1 nucleic acid. can be present in a human SNX27 nucleic acid. The fusion The fusion partner A sequence can be present in a human partner A sequence can be present in a human SRGAP2 RAB7A nucleic acid, and the fusion partner B sequence can nucleic acid, and the fusion partner B sequence can be present be present in a human LRCH3 nucleic acid. The fusion part in a human PRPF3 nucleic acid. The fusion partner A ner A sequence can be present in a human RNF187 nucleic sequence can be present in a human WIPF2 nucleic acid, and acid, and the fusion partner B sequence can be present in a the fusion partner B sequence can be present in a human human OBSCN nucleic acid. The fusion partner A sequence ERBB2 nucleic acid. can be present in a human SLC37A1 nucleic acid, and the 0008. In another aspect, this document features an isolated fusion partner B sequence can be present in a human ABCG1 nucleic acid comprising, or consisting essentially of a fusion nucleic acid. The fusion partner A sequence can be present in partner A sequence and a fusion partner B sequence, wherein a human EXOC7 nucleic acid, and the fusion partner B US 2014/0065620 A1 Mar. 6, 2014

sequence can be present in a human CYTH1 nucleic acid. The understood by one of ordinary skill in the art to which this fusion partner A sequence can be present in a human BRE invention pertains. Although methods and materials similar nucleic acid, and the fusion partner B sequence can be present or equivalent to those described herein can be used to practice in a human DPYSL5 nucleic acid. The fusion partner A the invention, suitable methods and materials are described sequence can be present in a human CD151 nucleic acid, and below. All publications, patent applications, patents, and the fusion partner B sequence can be present in a human other references mentioned herein are incorporated by refer DRD4 nucleic acid. The fusion partner A sequence can be ence in their entirety. In case of conflict, the present specifi present in a human LDLRAD3 nucleic acid, and the fusion cation, including definitions, will control. In addition, the partner B sequence can be present in a human TCP11L1 materials, methods, and examples are illustrative only and not nucleic acid. The fusion partner A sequence can be present in intended to be limiting. a human RFT1 nucleic acid, and the fusion partner B 0010. The details of one or more embodiments of the sequence can be present in a human UOCRC2 nucleic acid. invention are set forth in the accompanying drawings and the The fusion partner A sequence can be present in a human description below. Other features, objects, and advantages of GSDMC nucleic acid, and the fusion partner B sequence can the invention will be apparent from the description and draw be present in a human PVT1 nucleic acid. The fusion partner ings, and from the claims. A sequence can be present in a human INTS1 nucleic acid, and the fusion partner B sequence can be present in a human DESCRIPTION OF DRAWINGS PRKAR1B nucleic acid. The fusion partner A sequence can 0011 FIG. 1 is a flow chart of the work flow of the fusion be present in a human POLDIP2 nucleic acid, and the fusion detection algorithm implemented in SnowShoes-FTD. partner B sequence can be present in a human BRIP1 nucleic 0012 FIG. 2 contains photographs of PCR validation of acid. The fusion partner A sequence can be present in a human candidate fusion products. The PCR primers were designed MYH9 nucleic acid, and the fusion partner B sequence can be using the template sequences generated by SnowShoes-FTD. present in a human EIF3D nucleic acid. The fusion partner A The double stranded cDNA libraries were constructed using sequence can be present in a human BRIP1 nucleic acid, and total RNAs from each of the cell lines. The primer sequences the fusion partner B sequence can be present in a human and the expected PCR product sizes for each of the fusion TMEM49 nucleic acid. The fusion partner A sequence can be candidates were detailed in Table 5. (a) The PCR products present in a human SUPT4H1 nucleic acid, and the fusion from 50 fusion candidates with unique isoforms. The fusion partner B sequence can be present in a human CCDC46 candidates were grouped by the cell lines in which the fusion nucleic acid. The fusion partner A sequence can be present in candidates were discovered. (b) The PCR products from 5 a human TMEM104 nucleic acid, and the fusion partner B fusion candidates with two fusion isoforms each. Note that sequence can be present in a human CDK12 nucleic acid. The there are multiple PCR bands in the lanes for CDK12 fusion partner A sequence can be present in a human RIMS2 TMEM104, and the lowest bands were those from the fusion nucleic acid, and the fusion partner B sequence can be present product. in a human ATP6V1C1 nucleic acid. The fusion partner A 0013 FIG. 3 contains schematics of in-frame fusion tran sequence can be present in a human TIAL1 nucleic acid, and Scripts and their predicted protein sequences. (a) Starting the fusion partner B sequence can be present in a human from the fusion junction spanning reads that aligned to both C10orf119 nucleic acid. The fusion partner A sequence can fusion partner genes, the two junction boundary exons from be present in a human MECP2 nucleic acid, and the fusion fusion partner genes A and B were identified. (b) Obtaining partner B sequence can be present in a human TMLHE the IDs and sequences of all exons belonging to the two fusion nucleic acid. The fusion partner A sequence can be present in partner genes A and B based on the curated refflat file. In this a human ARID1A nucleic acid, and the fusion partner B example, Gene A has 7 exons with the 3' exon as the fusion sequence can be present in a human MAST2 nucleic acid. The boundary exon, and gene B has 10 exons with the 6" exon as fusion partner A sequence can be present in a human UBR5 the fusion boundary exon. (c) Obtaining all known transcripts nucleic acid, and the fusion partner B sequence can be present for the two fusion partner genes. Gene A has two known in a human SLC25A32 nucleic acid. The fusion partner A transcripts (A1 and A2) both of which contain the fusion sequence can be present in a human KLHDC2 nucleic acid, boundary exon. Gene B has 4 known transcripts (B 1->B4) and the fusion partner B sequence can be present in a human and three of which (B1, B3, and B4) contain the fusion bound SNTB1 nucleic acid. The fusion partner A sequence can be ary exon. (d) Generating the list of exhaustive fusion tran present in a human ARID1A nucleic acid, and the fusion Scripts using the known transcripts containing the fusion partner B sequence can be present in a human WDTC1 boundary exons. There are 6 possible fusion transcripts: A nucleic acid. The fusion partner A sequence can be present in 1-B1. A 1-B3, A 1-B4, A2-B1, A2-B3, and A2-B4. Note that a human HDGF nucleic acid, and the fusion partner B because the differences between the transcripts B1 and B4 are sequence can be present in a human S100A10 nucleic acid. “fused out', the fusion transcript of A1-B1 is identical to that The fusion partner A sequence can be present in a human of Al-B4. Similarly, A2-B1 is identical to A2-B4. The fusion PPP1R12B nucleic acid, and the fusion partner B sequence transcripts that cause frame shift in gene B are defined as “out can be present in a human SNX27 nucleic acid. The fusion of frame', and the ones that did not cause any frame shift are partner A sequence can be present in a human SRGAP2 defined as “inframe” fusions. Each of the inframe fusions are nucleic acid, and the fusion partner B sequence can be present translated into amino acid sequences of the fusion proteins. in a human PRPF3 nucleic acid. The fusion partner A 0014 FIG. 4 contains a detailed description of ARID1A sequence can be present in a human WIPF2 nucleic acid, and MAST2(a) and WIPF2 ERBB2 (b) fusion transcripts. Using the fusion partner B sequence can be present in a human the process described in FIG. 3, SnowShoes-FTD uses the ERBB2 nucleic acid. RNA sequence of all known transcripts of the fusion partners 0009. Unless otherwise defined, all technical and scien to predict the sequence of all potential in frame and out of tific terms used herein have the same meaning as commonly frame fusion transcripts. Abundance of individual exons for US 2014/0065620 A1 Mar. 6, 2014

each of the fusion partners, normalized to total exon abun breast tissue sample) from a mammal (e.g., a human) contains dance, was extracted from the mRNA-Seq data. a gene rearrangement set forth in Table 3, 4, 5, 6, 8, or 10. In 0015 FIG.5 is a photograph of RT-PCR results performed Some cases, the methods and materials provided herein can be using the PCR primers provided by Maher et al. (Proc. Natl. used to detect the presence of a gene rearrangement set forth Acad. Sci. USA, 106(30): 12353-8 (2009)) for five indicated in Table 3, 4, 5, 6, 8, or 10 within a breast tissue sample, fusion transcripts. The PCR validated four of the fusion prod thereby indicating that the breast tissue is likely to be cancer ucts (lanes 2-5). However, the fusion product was not ous. Detecting a gene rearrangement set forth in Table 3, 4, 5, observed for ARGAP19 DRG1 (lane 6). The first lane is the 6, 8, or 10 can be used to diagnose breast cancer in a mammal, 50-pb ladder. typically when known clinical symptoms of or known risk 0016 FIG. 6. Multiple fusion transcripts are expressed in factors for breast cancer also are present. breast tumors of different subtypes. Subtype specific fusion 0024. The term “nucleic acid” as used herein can be RNA transcripts are identified with oval symbols. All fusion tran or DNA, including cDNA, genomic DNA, and synthetic (e.g. Scripts are given according to orientation 5 fusion partner->3' chemically synthesized) DNA. The nucleic acid can be fusion partner. Transcripts are further identified according to double-stranded or single-stranded. Where single-stranded, sentinel status in each tumor Subtype (S), redundancy in each the nucleic acid can be the sense Strandor the antisense Strand. Subtype (R), and fusion transcript isoforms detection in each In addition, nucleic acid can be circular or linear. subtype (I). Fusion products are identified as follows: 0025. The term “isolated as used herein with reference to 3'UTR-fusion that changes 3'UTR of 5" fusion partner: nucleic acid refers to a naturally-occurring nucleic acid that is 5'UTR=fusion in 5'UTR of 5" fusion partner: CIF-coding in not immediately contiguous with both of the sequences with frame fusion to produce a chimaeric protein; CTT-C-termi which it is immediately contiguous (one on the 5' end and one nal truncation of 5" fusion partner resulting from frame shift. on the 3' end) in the naturally-occurring genome of the organ 0017 FIG. 7. Chromosomal distribution of fusion tran ism or cell from which it is derived. For example, an isolated Scripts and fusion partner genes is non-random. Connection nucleic acid can be, without limitation, a recombinant DNA between the chromosomal loci of fusion transcripts in shown molecule of any length, provided one of the nucleic acid in Panel A for all sentinel fusions as well as for tumor subtype sequences normally found immediately flanking that recom specific fusion transcripts. The chromosomal heat map binant DNA molecule in a naturally-occurring genome is (Panel B) shows the top four (red) and bottom four (green) removed or absent. Thus, an isolated nucleic acid includes, , identified by the genomic coordinates of without limitation, a recombinant DNA that exists as a sepa fusion partner genes. rate molecule (e.g., a cDNA or a genomic DNA fragment 0018 FIG. 8. Chromosomal mapping of fusion partner produced by PCR or restriction endonuclease treatment) genes reveals tumor Sup type specific clusters. Chromosomal independent of other sequences as well as recombinant DNA mapping was carried out using PheGen (NCBI) to assign that is incorporated into a vector, an autonomously replicating chromosomal coordinates of all fusion gene partners. Clus plasmid, a virus (e.g., a retrovirus, adenovirus, or herpes ters that are uniquely associated with HER2+ tumors are virus), or into the genomic DNA of a prokaryote or eukaryote. designated by an arrow with a single asterisk (Chrld21.22 In addition, an isolated nucleic acid can include a recombi 21.3), whereas an arrow with two asterisks designates a large nant DNA molecule that is part of a hybrid or fusion nucleic ER+ cluster at chr11q13.1-q13.3, and an arrow with three acid sequence. asterisks identifies TN clusters at chr8q24.3, chr12q13.13. 0026. The term "isolated” as used herein with reference to and chr17q25.1-2.5.3. nucleic acid also includes any non-naturally-occurring 0019 FIG. 9 is a listing of predicted chimeric protein nucleic acid since non-naturally-occurring nucleic acid products of fusion transcripts. Amino acids pertaining to 5' sequences are not found in nature and do not have immedi fusion partners are highlighted with a single underline. ately contiguous sequences in a naturally-occurring genome. Amino acids pertaining to 3' fusion partners fused in frame For example, non-naturally-occurring nucleic acid such as an are highlighted with a double underline. Amino acids that are engineered nucleic acid is considered to be isolated nucleic inserted at fusions junctions are highlighted with a wavy acid. Engineered nucleic acid can be made using common underline. molecular cloning or chemical nucleic acid synthesis tech 0020 FIG. 10 is a listing of the predicted amino acid niques. Isolated non-naturally-occurring nucleic acid can be sequence of the ARID1A->MAST2 fusion protein (SEQID independent of other sequences, or incorporated into a vector, NO:1530). This chimeric protein arises from a fusion tran an autonomously replicating plasmid, a virus (e.g., a retrovi script in which exon 1 of ARID1A (with start codon) is rus, adenovirus, or herpes virus), or the genomic DNA of a spliced in frame to exon 2 of MAST2. Underlined amino prokaryote or eukaryote. In addition, a non-naturally-occur acids are derived from exon 1 of ARID1A, whereas the other ring nucleic acid can include a nucleic acid molecule that is amino acids are derived from MAST2. part of a hybrid or fusion nucleic acid sequence. 0021 FIG. 11 is a photograph demonstrating shRNA 0027. It will be apparent to those of skill in the art that a knockdown of the ARID1A->MAST2 fusion transcript. nucleic acid existing among hundreds to millions of other 0022 FIG. 12 is a graph demonstrating that knockdown of nucleic acid molecules within, for example, cDNA or the ARID1A->MAST2 fusion transcript by shRNA inhibits genomic libraries, or gel slices containing a genomic DNA growth of MDA-MB-468 cultures. restriction digest is not to be considered an isolated nucleic acid. DETAILED DESCRIPTION 0028. In one embodiment, this document provides a 0023 This document provides methods and materials primer pair having the ability to amplify a nucleic acid that involved in assessing gene rearrangements (e.g., transloca includes (a) a first nucleic acid sequence from one gene listed tions). For example, this document provides methods and in a gene rearrangement set forth in Table 3, 4, 5, 6, 8, or 10 materials for determining whether or not a sample (e.g., (e.g., a fusion partner A sequence) and (b) a second nucleic US 2014/0065620 A1 Mar. 6, 2014

acid sequence from another gene that is listed in Table 3, 4, 5, Sequence’ column of Table 6 as well as those having a 6, 8, or 10 as being in combination with that one gene (e.g., a sequence that encodes an amino acid sequence set forth in the fusion partner B sequence). For example, this document pro “Fusion Protein Sequence' column of Table 6. The isolated vides primer pairs that have the ability to amplify a nucleic nucleic acid molecules provided herein can be any appropri acid that includes a LIMA1 nucleic acid sequence (e.g., a ate length including, without limitation, lengths ranging from fusion partner A sequence) and a USP22 nucleic acid about 50 and about 5000 nucleotides (e.g., between about 75 sequence (e.g., a fusion partner B sequence). The primers of and about 5000 nucleotides, between about 100 and about the primer pair can be any appropriate length including, with 5000 nucleotides, between about 250 and about 5000 nucle out limitation, lengths ranging from about 10 nucleotides to otides, between about 500 and about 5000 nucleotides, about 100 nucleotides (e.g., from about 15 nucleotides to between about 50 and about 2500 nucleotides, between about about 100 nucleotides, from about 20 nucleotides to about 500 and about 2500 nucleotides, or between about 50 and 100 nucleotides, from about 15 nucleotides to about 75 nucle about 1000 nucleotides). otides, from about 15 nucleotides to about 50 nucleotides, 0032. As described herein, the primer pairs and isolated from about 15 nucleotides to about 25 nucleotides, from nucleic acid molecules provided herein can be used to deter about 13 nucleotides to about 50 nucleotides, or from about mine whether or not a patient has breast cancer. For example, 17 nucleotides to about 50 nucleotides). a patient sample (e.g., a breast tissue sample) can be assessed 0029. The primers can be designed to amplify any appro for the presence or absence of one or more of the gene rear priate length of the fusion partner A sequence and the fusion rangements set forth in Table 3, 4, 5, 6, 8, or 10 using a primer partner B sequence. For example, the fusion partner A pair provided herein or an isolated nucleic acid that was sequence of an amplified nucleic acid can be about 5 to about amplified using an amplification reaction. In some cases, the 2500 nucleotides in length (e.g., about 10 to about 2500 presence of one or more gene rearrangements set forth in nucleotides in length, about 15 to about 2500 nucleotides in Table 3, 4, 5, 6, 8, or 10 can indicate that the patient has breast length, about 20 to about 2500 nucleotides in length, about 25 CaCC. to about 2500 nucleotides in length, about 20 to about 1000 0033. This document also provides methods for detecting nucleotides in length, about 20 to about 500 nucleotides in the presence of breast cancer. Such methods can include length, or about 50 to about 100 nucleotides in length), and detecting the presence of one or more gene rearrangements the fusion partner B sequence of that amplified nucleic acid set forth in Table 3, 4, 5, 6, 8, or 10. Any appropriate method can be about 5 to about 2500 nucleotides in length (e.g., about can be used to detect a gene rearrangement set forth in Table 10 to about 2500 nucleotides in length, about 15 to about 2500 3, 4, 5, 6, 8, or 10. For example, the nucleic acid amplification nucleotides in length, about 20 to about 2500 nucleotides in techniques described herein can be used to detect a gene length, about 25 to about 2500 nucleotides in length, about 20 rearrangement set forth in Table 3, 4, 5, 6, 8, or 10. to about 1000 nucleotides in length, about 20 to about 500 0034. The invention will be further described in the fol nucleotides in length, or about 50 to about 100 nucleotides in lowing examples, which do not limit the scope of the inven length). In some cases, the combined length of the fusion tion described in the claims. partner A and fusion partner B sequences that are amplified can be between about 50 and about 5000 nucleotides (e.g., EXAMPLES between about 75 and about 5000 nucleotides, between about 100 and about 5000 nucleotides, between about 250 and Example 1 about 5000 nucleotides, between about 500 and about 5000 nucleotides, between about 50 and about 2500 nucleotides, Identification and Characterization of Fusion between about 500 and about 2500 nucleotides, or between Transcripts in Breast Cancer and Normal Cell Lines about 50 and about 1000 nucleotides). In some cases, the primer pairs provided herein have the ability to amplify a Breast Cell Lines junction region of a gene rearrangement that involves a two 0035 Twenty-two breast cancer cell lines and one non gene fusion set forth in Table 3, 4, 5, 6, 8, or 10. For example, tumorigenic breast epithelial cell line (MCF 10A) were a primer pair provided herein can amplify a junction region obtained from the AmericanType Culture Collection (ATCC) between a RAB7A nucleic acid sequence and a LRCH3 (Table 1). All cell lines were thawed and expanded to allow nucleic acid sequence. for isolation of total RNA from low passage cells, which 0030 Examples of particular primer pairs for amplifying a should exhibit minimal deviation from the ATCC type refer gene rearrangement provided herein include, without limita ence cells. Eight primary human mammary epithelial cell tion, those primer pairs set forth in Table 5. (HMEC) cultures were established from biopsies of Mayo 0031. This document also provides isolated nucleic acid Clinic patients undergoing evaluation of Suspected breast molecules having (a) a first nucleic acid sequence from one lesions (Table 1). All of the biopsy samples from which the gene listed in a gene rearrangement set forth in Table 3, 4, 5, cell lines were derived were assessed as benign. 6, 8, or 10 (e.g., a fusion partner A sequence) and (b) a second RNA Preparation and Sequencing nucleic acid sequence from another gene that is listed in Table 3, 4, 5, 6, 8, or 10 as being in combination with that one gene 0036 Total RNA extraction was performed using (e.g., a fusion partner B sequence). For example, this docu Exiqons miRCURY RNA. Isolation Kit. One microgram of ment provides isolated nucleic acid molecules that include a total RNA was used for the sequencing library preparation, LIMA1 nucleic acid sequence (e.g., a fusion partner A which was modified from conventional Illumina mRNA-Seq sequence) and a USP22 nucleic acid sequence (e.g., a fusion protocols to facilitate paired end RNA sequence analysis (Sun partner B sequence). Other examples of isolated nucleic acid et al., PLoS ONE, 6(2): e17490 (2011)). The cDNA fragments molecules provided herein include, without limitation, those were amplified by PCR and sequenced at both ends for 50 having a sequence set forth in the “Fusion Transcript Coding bases (50--end sequencing) using the Illumina US 2014/0065620 A1 Mar. 6, 2014

Genome Analyzer IIX. Sequencing was carried out at the Annotation of Aligned Reads Illumina assay development facility at Hayward, Calif. and at 0040. After filtering, the reads remaining in the SAM files the Mayo Clinic Advanced Genomic Technology Center at were categorized into 5 groups: (1) reads with both ends Rochester, Minn. The FASTQ read files for each sample were mapped to genome locations; (2) reads with both ends used for further analysis. mapped to exonjunctions; (3) reads with one end mapped to the genome and the other mapped to exons; (4) reads with one Construction of Exhaustive One-Directional Exon Junction end mapped to the genome and the other end not mapped; and Database (5) reads with one end mapped to exonjunctions and the other not mapped. All mapped ends were annotated using the genes 0037. The exon-exon boundary database was generated and exons defined in the curated refflat file. For a read to be using the exon and gene definition files downloaded from annotated as being mapped to a gene, it was required that UCSC Table Browser (table: refflat; track: RefSeq Genes: either the start or the end of the read be mapped within the group: Genes and Gene Prediction Tracks) in reference to boundaries of an exon of that gene. If a read aligned to both build 36 (hg18). Among 35,983 total tran genome and an exonjunction, the annotation from the exon scripts in the refflat file, 765 transcripts with alternative junction alignment took precedence. haplotypes and 1,482 transcripts with multiple/redundant genomic locations were removed. Based on the exon bound False Positive Filtering aries of all transcripts defined in the curated refflat file, all possible one-directional combinations of exon-exon bound 0041. There were two steps of filtering to minimize the ary sequences for the sequencing length of 50 bases were false fusion rate that could plague nomination of fusion gene generated to ensure that no reads will map to more than one candidates. The first filtering step was performed on the reads junction using a developed algorithm. The curated refflat file pairs that were annotated to two different genes, also known and its future updated versions in reference to both Genome as fusion encompassing reads. This began with the filtering of Build 36 and 37, as well as the FASTA files of exon-exon fusion candidates with significant sequence similarities boundary sequences for different sequencing lengths (50-, between the two fusion partners. 75-, and 100-base) can be downloaded from the following 0042. In addition, a gene distance filter was implemented website: http://mayoresearch.mayo.edu/mayof research/bio to exclude fusions formed by two genes that were within Mkb stat/stand-alone-packages.cfm. of each other on the reference genome, in order to eliminate chimeric transcripts that might arise from overlapping genes Analytic Workflow for Fusion Detection or transcriptional read through of adjacent genes. Further more, the fusion candidates with less than N fusion encom 0038. With reference to FIG. 1, the SnowShoes-FTD tool passing reads were filtered out. The second filtering step consisted of (i) read alignments to both reference genome and focused on the fusion candidates with Supporting evidences exonjunction database; (ii) annotation of aligned read pairs to of both fusion encompassing read pairs and fusion junction identify potential fusion candidates; (iii) filtering of false spanning reads. The mapping orientations of the end pairs positive candidates; (iv) generation of a continuous sequence were compared to the orientations of the two fusion partner region spanning fusionjunction points for PCR primer design genes on the genome, and the fusion candidates with incon for experimental validation; (v) prediction of fusion mecha sistent mapping orientations between end pairs were filtered nism; and (vi) prediction of the in-frame vs. out of frame out. Also, the algorithm required at least X unique fusion fusion products and generation of the predicted protein junction spanning reads and no more than Y fusion junction sequences of the in-frame fusion products based on known points per fusion candidate. These thresholds (M, N, X, and transcripts of the two partner genes. In addition, the tool Y) were user defined. filtered out reads mapped with poor quality as described above. Prediction of the Fusion Mechanism 0043. If a fusion product was formed by two partner genes Read Alignment and Filtering for Fusion Detection from two different chromosomes, a translocation was listed as the mechanism of fusion. The translocation event can be 0039. The two ends of RNA-Seq reads were aligned to accompanied by inversion of the two partner genes that have both the Human Reference Genome Build 36 (hg18) and exon the opposite strand orientations. When the two partner genes junctions using BWA (Li and Durbin, Bioinformatics, 25 (14): were located on the same , the mechanism of the 1754-60 (2009)) with a seed length of 32 allowing 4% of fusion could be translocation alone, inversion alone, and maximum edit distance. The BWA aligned reads were stored inversion and translocation concurrently. These three sce in the Sequence Alignment/Map (SAM) format (Li et al., narios were determined based on the Strand orientations and Bioinformatics, 25(16):2078-9 (2009)). The pairs of SAM the relative chromosomal positions of the two partners. How files from the alignment of two ends of the same sample were ever, when an intra-chromosomal fusion arose without alter sorted according to read IDs using SAMtools (Liet al., Bio ing the relative orders of the two partners with the same strand informatics, 25(16):2078-9 (2009)). The reads with neither orientation, the fusion can be the consequence of a transloca end mapped to genome or exonjunctions are not informative tion or an interstitial deletion. and were filtered out. If the Phred-scaled Mapping Quality Score (MAPQ) of either end was less than 20, the end pair was Prediction of the Fusion Protein Product considered low quality and was excluded from further analy sis. Note that this also filtered out read pairs with either or 0044 Prediction of the fusion protein sequences was car both ends mapped to multiple locations since BWA assigns a ried out using all of the known transcripts of the two fusion MAPQ of Zero to such reads. partner genes as defined in the refflat file. As shown in FIG. US 2014/0065620 A1 Mar. 6, 2014

3, the two exons from each of the two fusion partner genes Sanger sequencing to further confirm the nucleotide sequence that aligned to the fusion spanning reads (fusion boundary of the predicted fusion junctions. exons) were first identified. Next, among all know transcripts of the two fusion partner genes, the transcripts containing the Quantification of Gene and Exon Expression Levels boundary exons were identified, and a list of putative fusion 0047. The gene expression levels were calculated as the transcripts was generated. Each of the putative fusion tran Sum of the individual exon read counts and exonjunction read Scripts was then translated into predicted amino acid counts. The expression levels of genes and exons were nor sequence, and each of the putative fusion proteins was char malized using the total aligned reads from the sample and the acterized as whether its in frame. In addition, the fusion products were categorized as: (1) coding region to coding length of the exon or gene (Reads per kilo-bases per million, region fusion which results in in-frame fusion product, a RPKM). frame-shift for the 3' gene, or an in-frame fusion with a single Results amino acid mutation at the fusion junction point. The single amino acid mutation was listed in the SnowShoes-FTD out Flexibility of the Choice of Sequence Alignment Tools put; (2) 5' UTR to coding region fusion in which the promoter of the 5' gene fused in front of a coding region of the 3' gene: 0048. There are several sequencing platforms and mul (3) 5' UTR to 3' UTR fusion in which coding regions from tiple sequence alignment algorithms designed for NeXt Gen both partner genes were fused out; (4) 3' UTR to 3' UTR eration sequencing of transcriptome. The SnowShoes-FTD fusion in which the 5' gene was intact but the coding region of worked with raw or post-alignment files of different plat the 3' gene was fused out; (5) 5' UTR to 5' UTR fusion in forms. When FASTQ files obtained from Illumina Genome which the promoter of the 5' gene potentially drives the AnalyZeror HiSeq sequencers were provided as input, Snow expression of 3' gene as the consequence of the fusion; (6) 3' Shoes-FTD was designed such that the user can choose BWA UTR to 5' UTR or coding region fusion in which the stop or Bowtie (Langmead et al., Genome Biol, 10(3):R25 (2009)) codon of the 5' gene terminates the translation of any coding for alignment. SnowShoes-FTD also was designed to accept regions of the 3' gene; (7) coding region to 5' UTR fusion in post-alignment files (BAM) for both genome and exonjunc which the sequence between the coding region of the 5' gene tion alignments from different sequencing platforms includ and the start codon of the 3' gene may result in an insertion of ing Life Technologies’ SOLiD sequencer. Since the exon single or multiple amino acids that are listed in the output file; junction database generated by SnowShoes-FTD was pre (8) the coding region to 3' UTR fusion which may result in the ferred over other publically available junction databases, the shortening of the 5' gene with or without the addition of user needed to align the reads to the exonjunctions provided foreign amino acids. by SnowShoes-FTD if BAM files were provided as input files. The results reported herein were obtained using FASTQ Nucleotide Sequences Spanning Fusion Junction Points for as input files and BWA as the aligner. PCR Primer Design User-Defined Parameters for SnowShoes-FTD 0045. The chromosomal orientations of the two fusion 0049. The following parameters were user-defined for partners, the mapping orientations of the two ends from detection of fusion transcripts using SnowShoes-FTD: (i) the fusion encompassing read pairs, as well as the sequence and minimum number of fusion encompassing reads (default orientation of the fusion junction spanning read(s) were used value: 10); (ii) the minimum number of unique fusion junc to report a template region for PCR primer design in order to tion spanning reads (must be > 1 with a default set to 2); (iii) quickly validate the fusion candidates with RT-PCR. From 5' the minimum distance between the two fusion partner genes to 3', the template region consisted of the exon region from if both are located on the same chromosome (default value: partner A from the start of the exon to the fusion junction 100 kb); (iv) the maximum number of fusion isoforms point, a “I” sign that signified the fusion junction point, and allowed between two fusion partners (default value: 2); and the exon region from partner B from the start of the fusion (v) whether the fusion transcripts feature junction points at junction point to the end of the exon. Since the orientation of exon boundaries (default=Yes). The default values of the the primer template region did not necessarily define direc parameters were chosen to minimize false positive rate. For tionality (5' to 3') of the fusion transcript, it was necessary to example, the minimum number of unique fusion junction use double stranded cDNAs as the template for PCR valida spanning reads was set to 2 by default to avoid the false tion. detection of fusion junction spanning reads arising from the PCR artifacts which may give multiple junction spanning PCR and Sanger Sequencing Validations of Fusion reads that are identical in alignment positions. In addition, the Candidates limit of the maximum fusion isoforms between two partner genes was based on the hypothesis that if there are too many 0046 Double stranded cDNA were synthesized using the fusion isoforms between two partners, the fusion event would total RNAs from each of the 31 cell lines. To minimize poten appear to be existing by random fusion events without obvi tial artifacts that might arise during library construction, dif ous biological significances. ferent cDNA libraries were constructed and used for sequenc ing and for PCR validation. PCR primers were designed using the template regions recommended by SnowShoes-FTD. The List of Reference Files Available 5' and 3' primers were complementary to the template regions 0050. A list of reference files was available for download that represent the two fusion partners, respectively. The in preparation for the fusion transcript detection using Snow fusion transcript was considered validated ifa PCR product of Shoes-FTD: (1) the one-directional exhaustive exon-exon the predicted size was detected. The PCR bands from ran junction database generated for read-lengths 50-, 75-, and domly selected fusion transcripts were sequenced using 100-bases. This was provided in the FASTA format; and (2) US 2014/0065620 A1 Mar. 6, 2014

the curated gene and exon definition files (refflat files) from TABLE 1-continued both genome builds 36 and 37. The gene and exon definition files are updated periodically. All reference files can be Sample information of the 31 breast cell lines. obtained from the SnowShoes website: http://mayoresearch. Flow mayo.edu/mayof research/biostat/stand-alone-packages.cfm. Sample Cell Run Number Sample ID Sample Description Lane # Number Detection of Fusion Transcripts in 31 Breast Cell Lines 3 BT-20 Cancer Cell Line 3 0051. The SnowShoes-FTD tool was applied to the 4 MCF7 Cancer Cell Line 4 50-base pair-end RNA-Seq data from 22 breast cancer cell 5 MIDA-MB-468 Cancer Cell Line 6 6 T47D Cancer Cell Line 7 lines, one established non-tumorigenic breast cell line 7 ZR-75-1 Cancer Cell Line 8 (MCF 10A), and 8 primary HMEC cultures (Table 1). The 8 HCC1937 Cancer Cell Line 1 Run #2 fusion transcript candidates of these 31 breast cell lines were 9 HCC1954 Cancer Cell Line 2 nominated using the default parameter values based on 10 HCC2218 Cancer Cell Line 3 genome build 36 (hg18). As shown in Table 2, read pairs 11 HCC1599 Cancer Cell Line 4 sequenced per sample totaled to 18-33 millions, among which 12 HCC1395 Cancer Cell Line 5 13 BTS49 Cancer Cell Line 6 45-5.8% had both ends mapped to the genome, 3-5% had both 14 HSS78T Cancer Cell Line 7 ends mapped to exonjunctions, 11-18% had one end mapped 15 MDA-MB-17SV-II Cancer Cell Line 8 to the genome and the other mapped to exonjunctions, 5-15% 16 MIDA-MB-361 Cancer Cell Line 1 Run #3 had one end mapped to the genome and the other not mapped, 17 MDA-MB-436 Cancer Cell Line 2 1-2% had one end mapped to exonjunctions and the 2nd end 18 MDA-MB-453 Cancer Cell Line 3 not mapped. In addition, there were 2-9% of the read pairs 19 SK-BR-3 Cancer Cell Line 4 with neither ends mapped to the genome or exonjunctions. 2O UACC812 Cancer Cell Line 5 21 HCC1187 Cancer Cell Line 6 11-20% of the reads were filtered out due to low mapping 22 HCC1428 Cancer Cell Line 7 quality and/or redundant mapping. 23 HCC1806 Cancer Cell Line 8 24 DHF 168 Normal HMEC* 1 Run #4 TABLE 1. 25 BSO19B Normal HMEC 2 26 BSO28 Normal HMEC 3 Sample information of the 31 breast cell lines. 27 BSO29 Normal HMEC 4 28 BSO3O Normal HMEC 5 Flow 29 BSO32N Normal HMEC 6 Sample Cell Run 3O BSO36 Normal HMEC 7 Number Sample ID Sample Description Lane # Number 31 BSO37 Normal HMEC 8 1 BT474 Cancer Cell Line 1 Run #1 HMEC: Human Mammalian Epithelial Cells Primarily cultured from benign breast biopsy 2 MCF 10A Non-Tumorigenic 2 samples,

TABLE 2 Row Cell Line: BT474 MCF 10A BT-20 MCF7 MDA-MEB-468 T47D

A. Total Read Pairs 33,108,579 29,942,274 33,004.454 29,777,246 32,629,020 27,834,336 B Both ends mapped to 967,599 1,185,024 1465,055 1,287,318 1,325,523 1,233,551 exonjunctions C Both ends mapped to 15,472,214 16,126,686 17,293,975 15,491,395 15,622,395 14,106.192 genome D End 1 map to genome; 1,921,698 2,306,577 2,490,157 2,275,646 1,780,577 1991,704 End 2 map to junction E End 1 map to junctions; 1956,072 2,344,307 2,532,582 2,301,062 1,796,695 2,022,605 End 2 map to genome End 1 map to genome, 3,165,404 1,534,609 1937,424 1434,814 2,527,943 1,531,133 End 2 not mapped G End 1 not mapped, End 2 2,082,290 918,848 1,029,301 1,005,043 1,321,144 956,473 map to genome H End 1 map to exon 451,657 266,065 351,723 262,611 446,131 283,188 junction, End 2 not mapped End 1 not mapped, End 2 288,340 130,259 153,356 157,406 209,534 144,249 map to exonjunction Both Ends Not Mapped 1413,251 987,653 1,083,161 861,804 3,150,078 873,260 K Filtered (MapQ, 5,390,054 4,142,246 4,667,720 4,700,147 4449,000 4,691.981 Mappability) Total Read Pairs 33,108,579 29,942,274 33,004.454 29,777,246 32,629,020 27,834,336 M Both Ends Mapped to 967,599 1,185,024 1465,055 1,287,318 1,325,523 1,233,551 Exon Junctions N Both Ends Mapped to 15,472,214 16,126,686 17,293,975 15,491,395 15,622,395 14,106.192 Genome O One End Mapped to 3,877,770 4,650,884 5,022,739 4,576,708 3,577,272 4,014,309 Genome, One End Mapped to Exon Junction P One End Mapped to 5,247,694 2.453.457 2,966,725 2.439,857 3,849,087 2,487,606 Genome, One End Not Mapped US 2014/0065620 A1 Mar. 6, 2014

TABLE 2-continued Q One End Mapped of Exon 739,997 396,324 505,079 420,017 655,665 427,437 Junction, One End Not Mapped R Both Ends Not Mapped 1413,251 987,653 1,083,161 861,804 3,150,078 873,260 S Filtered Read Pairs 5,390,054 4,142,246 4,667,720 4,700,147 4,449,000 4,691.981 T Total Read Pairs 33,108,579 29,942,274 33,004.454 29,777,246 32,629,020 27,834,336 U Both Ends Mapped to 2.92.25% 3.95779% 4.4390% 4.323.2% 4.0624% 4.431.8% Exon Junctions V Both Ends Mapped to 46.731.7% 53.8593% 52.3989% S2.024.3% 47.8788% SO.6791% Genome W One End Mapped to 11.71.23% 1S.S328% 1521.84% 15.36.98% 10.96.35% 144221% Genome, One End Mapped to Exon Junction X One End Mapped to 15.85.00% 8.1940% 8.988.9% 8.1937% 11.7965% 8.93.72% Genome, One End Not Mapped Y One End Mapped of Exon 2.235.1% 1.323.6% 1.53O3% 1:41.05% 20095% 1.535.6% Junction, One End Not Mapped Z. Both Ends Not Mapped 4.268.5% 3.298.5% 3.28.19% 2.89.42% 9.6542% 3.1373% AA Filtered Read Pairs 16.27999 13.8341% 14.1427% 15.7844% 13.63.51% 16.8568%

Row ZR-75-1 HCC1954 HCC2218 HCC1599 HCC1395 BTS49 HSST8T MDA-MB-17SV-II MDA-MB-361 A. 28,279,001 21,368,082 21,646,565 20,839,210 20,885,816 20,564,387 21,163,489 19,975,881 18,982,847 B 906,388 1,057,290 1,057,372 1,060,908 1,028,507 992,377 1,139,460 790,780 760,538 C 12,865,780 11457,498 10.468,135 11,075,098 10,889,876 11,217,404 11,469,811 10,843,542 11,152,244 D 1,553,829 1,654.404 1,293,163 1,699,329 1,719,327 1,698,529 1902,759 1,364,672 1,217,291 E 1,569,922 1,683,656 1,315,053 1,723,060 1,745,710 1,720,513 1940,243 1,377,012 1,240,587 2,756.407 839,016 771,655 761,735 794,179 754,211 1,002,891 763,913 787,834 G 1,636,306 605,286 562,351 524,216 549,597 488,195 625,354 502,625 402,695 H 426,338 144,308 134,419 128,164 137,034 127,247 210,186 109,477 124,795 244,443 86,449 84,679 69.464 73,921 60,734 97,234 54.231 48,261 1942,352 1,104,408 1505.030 367,688 547,157 250,766 429,522 545,897 380,510 K 4,377,236 2,735,767 4454,708 3,429,548 3,400,508 3,254,411 2,346,029 3,623,732 2,868,092 28,279,001 21,368,082 21,646,565 20,839,210 20,885,816 20,564,387 21,163,489 19,975,881 18,982,847 M 906,388 1,057,290 1,057,372 1,060,908 1,028,507 992,377 1,139,460 790,780 760,538 N 12,865,780 11457,498 10.468,135 11,075,098 10,889,876 11,217,404 11,469,811 10,843,542 11,152,244 O 3,123,751 3,338,060 2,608,216 3,422,389 3,465,037 3.419,042 3,843,002 2,741,684 2,457,878 C 4,392,713 1444,302 1,334,006 1,285,951 1,343,776 1,242,406 1,628.245 1,266,538 1,190,529 Q 670,781 230,757 219,098 197,628 210,955 187,981 307.420 163,708 173,056 R 1942,352 1,104,408 1505.030 367,688 547,157 250,766 429,522 545,897 380,510 S 4,377,236 2,735,767 4454,708 3,429,548 3,400,508 3,254,411 2,346,029 3,623,732 2,868,092 T 28,279,001 21,368,082 21,646,565 20,839,210 20,885,816 20,564,387 21,163,489 19,975,881 18,982,847 U 3.2052% 4.94.80% 4.8847% S.O909% 4.924.4% 4.825.7% 5.3841% 3.9587% 4.OO64% V 45.49.59% 53.61.97% 48.3593% S3.1455% 52.14.01% 54.54779% 54-1962% 54.2832% S8.749.1% W 11.046.2% 15.621.7% 12.0491% 16.4.228% 16.5904% 16.626.0% 18.1586% 13.725.0% 12.94.79% X 15.533.59% 6.7592% 6.1627% 6.1708% 6.433.9% 6.04.15% 7.693.79% 6.34.03% 6.271.6% Y 2.3720% 1.0799% 1.01.22% O.94.83% 1.01.00% O.91.41% 14526% O.819.5% O.91.16% Z. 6.8685% S.168.5% 6.9527% 1764.4% 2.61.98% 121.94% 2.029.5% 2.73.28% 2.OO45% AA 15.4788% 12.8O31% 20.5793% 16.4572% 16.281.4% 15.825.5% 11.0853% 18.1405% 15.108.9%

MDA-ME- MDA-MB Row 436 453 SK-BR-3 UACC812 HCC1187 HCC1428 HCC1806 HCC1937 BN1 BN2 A. 19,326,929 18,821,975 18,958,559 19,338,997 19,807,859 19,126,250 18,714,788 18,104,523 21,550,821 21,353,151 B 1,013,331 853,132 879,624 872,827 982,195 905,990 969,604 860,993 1,060,260 1,094,922 C 10,245,609 10,758,747 9,956.488 10,852,009 10,622,768 10,149,823 9,707,659 10,205.243 11,809,197 11,606,028 D 1,668,058 1425,541 1,516,716 1480,106 1449,627 1,434,064 1436,302 1496,280 1,906,149 1,657,758 E 1,687,.689 1436,359 1531,754 1,500.490 1467,590 1446,630 1451,970 1507,298 1,918,891 1,680,564 703,619 627,395 700,675 722.458 903,348 845,077 900,149 534,762 654,737 750,144 G 445,262 393.266 430,142 434,050 512,627 486,248 496,522 397,224 443,296 552,817 H 121,839 90.442 114,555 111,533 162,509 148,712 168,810 75,098 98.768 98.206 59.423 44,966 55,628 52,103 74421 69,319 74,394 41,912 51,730 57,704 403,083 225,327 428,803 275,411 524,031 645,819 546,545 338,321 470,350 495,573 K 2,979,016 2,966,800 3,344,174 3,038,010 3,108,743 2,994,568 2,962,833 2,647,392 3,137,443 3,359.435

MDA-ME- MDA-MB- MDA-ME- MDA-MB Row 436 453 SK-BR-3 UACC812 HCC1187 HCC1428 HCC1806 HCC1937 436 453 19,326,929 18,821,975 18,958,559 19,338,997 19,807,859 19,126,250 18,714,788 18,104,523 21,550,821 21,353,151 M 1,013,331 853,132 879,624 872,827 982,195 905,990 969,604 860,993 1,060,260 1,094,922 N 10,245,609 10,758,747 9,956.488 10,852,009 10,622,768 10,149,823 9,707,659 10,205.243 11,809,197 11,606,028 O 3,355,747 2,861,900 3,048,470 2,980,596 2,917,217 2,880,694 2,888,272 3,003,578 3,825,040 3,338,322 C 1,148,881 1,020,661 1,130,817 1,156,508 1415,975 1,331,325 1,396,671 931,986 1,098,033 1,302,961 Q 181,262 135,408 170,183 163,636 236,930 218,031 243.204 117,010 150,498 155,910 R 403,083 225,327 428,803 275,411 524,031 645,819 546,545 338,321 470,350 495,573 US 2014/0065620 A1 Mar. 6, 2014 10

TABLE 2-continued S 2,979,016 2,966,800 3,344,174 3,038,010 3,108,743 2,994,568 2,962,833 2,647,392 3,137,443 3,359.435 T 19,326,929 18,821,975 18,958,559 19,338,997 19,807,859 19,126,250 18,714,788 18,104,523 21,550,821 21,353,151 U 5.2431% 4.532.6% 4.6397% 4.513.3% 4.958.6% 4.7369% S.1810% 4.7557% 4.91.98% 5.12779% V 53.01.21% Sf.1606% 52.51.71% 56.11.46% 53.629.1% 53.06.75% 51.8716% 56.3685% 54.7970% 54.3528% W 17.363.1% 15.2051%. 16.0797% 1541.24%. 14.7276% 15.0615%. 15.433.1% 6.59.02%. 17.7489%. 15.633.9% X S.94.45% S422.7% S.9647% S.98.02% 7.148.6% 6.96O7% 7.4629% S.1478% S.O951% 6.10.20% Y O.93.79% O.71949/o O.89.77% O846.1% 1.1961% 1.1400% 1.299.5% O.646.3% O.6983% O.730.1% Z. 2.085.6% 1.1971% 2.261.8% 1424.1% 2.645.6% 3.376.6% 2.92.04% 1868.7% 2.1825% 2.3208% AA 15.41.38%. 15.7624%. 17.6394%. 15.7092%. 15.69.45% 15.6568%. 15.831.5% 4.6228%. 14.SS83%. 15.732.7%

Row BN3 BN4 BNS BN6 BNT BN8 Min Max A. 20,924,924 22,510,790 21,057,269 24,033,748 21,682,601 20,257,198 18,104,523 33,108,579 B 1,045,586 1,149,385 958,317 1,146,878 1,083,300 945,339 760,538 1465,055 C 11,204.204 12,296,033 11,542,896 13,355,714 12,005,466 11,203,857 9,707,659 17,293,975 D 1,861,254 2,049,317 1,723,515 2,076,865 1,673,894 1,721,057 1,217,291 2,490,157 E 1,868,123 2,062.445 1,736,873 2,089,358 1,689,254 1,732,145 1,240,587 2,532,582 657,416 645,611 639,013 762,606 741,286 646,232 534,762 3,165.404 G 445,708 425,788 417.370 495,086 515,542 425,145 393.266 2,082,290 H 99.404 97,012 91,801 113,551 111449 93,334 75,098 451,657 51,038 44,827 45,871 54,262 65,365 47,633 41912 288,340 432,782 425,987 685,134 428,998 494.662 512,827 225,327 3,150,078 K 3,259,409 3,314,385 3,216.479 3,510,430 3,302,383 2,929,629 2,346,029 5,390,054 Row SK-BR-3 UACC812 HCC1187 HCC1428 HCC1806 HCC1937 Min Max 20,924,924 22,510,790 21,057,269 24,033,748 21,682,601 20,257,198 18,104,523 33,108,579 M 1,045,586 1,149,385 958,317 1,146,878 1,083,300 945,339 760,538 1465,055 N 11,204.204 12,296,033 11,542,896 13,355,714 12,005,466 11,203,857 9,707,659 17,293,975 O 3,729,377 4,111,762 3,460,388 4,166,223 3,363,148 3,453,202 2.457,878 5,022,739 C 1,103,124 1,071,399 1,056,383 1,257,692 1,256,828 1,071,377 931,986 5,247,694 Q 150,442 141,839 137,672 167,813 176,814 140,967 117,010 739,997 R 432,782 425,987 685,134 428,998 494.662 512,827 225,327 3,150,078 S 3,259,409 3,314,385 3,216.479 3,510,430 3,302,383 2,929,629 2,346,029 5,390,054 T 20,924,924 22,510,790 21,057,269 24,033,748 21,682,601 20,257,198 18,104,523 33,108,579 U 4.99.68% S.105.9% 4.SS.10% 4.771.9% 4.9962% 4.66.67% 2.922.5% 5.3841% V 53.544.8% 54.62.28% 54.81.67% 55.57079 SS.369.1% SS.3O80% 45.49.59% S8.749.1% W 17.822.7% 18.265.7% 16.433.2% 17.3349% 15.51.08% 17.0468% 10.96.35% 18.265.7% X S.271.8% 4.75.95% S.O16.7% S.23.30% 5.7965% S.2889% 475.95% 1585.00% Y O.7190% O.63O1% O.653.8% O.6982% O.81.55% O.6959% O.630.1% 2.3720% Z. 2.068.3% 18924% 3.253.79% 1.78.50% 2281.4% 2.53.16% 1.1971% 9.6542% AA 15.57679, 14.7235% 15.274.9% 14.606.3% 1S.23.06% 14.4622% 11.0853% 20.5793%

0052 55 fusion transcript candidates were nominated All PCR products were confirmed by Sanger sequencing with (Tables 3 and 4). Fifty of these had unique isoforms while the the observation that the predicted DNA sequence conformed rest had 2 isoforms. As shown in FIG. 2A, all 50 fusion to the actual DNA sequence of the PCR product. All isoforms were similarly validated for the 5 fusion candidates with two transcripts with a single fusion isoform were validated as isoforms (FIG. 2B). The sequences of the primers used in evidenced by generation of PCR products of the predicted PCR validations are set forth in Table 5, which includes the sizes. Several fusion transcripts were randomly selected for primers for the alternative isoforms of the 5 fusion candidates further validation using Sanger sequencing of the PCR bands. with 2 isoforms each. TABLE 3 List of fusion transcripts identified. Total Between # of In Read Exon Fusion FUSION Transcript Mechanism Type Frame Strand Pairs Boundaries Isoforms LIMA 1->USP22 inter-chir YES 16 YES 1 ACACA->STAC2 intra-chir YES 72 YES 1 FAM102A->CIZ1 intra-chr 31 YES 2 GLB1->CMTM7 intra-chir YES 13 YES 1 MED1->STXBP4 IANDT intra-chir YES 54 YES 1 PIP4K2B->RADS1C IANDT intra-chr 15 YES 1 RAB22A->MYO9B inter-chr -- 16 YES 1 RPS6KB1->SNF8 IANDT intra-chir YES -- 162 YES 1 STARD3->DOKS inter-chr -- 21. YES 1 TRPC4AP->MRPL45 IANDT inter-chir YES 27 YES 1 ZMYND8->CEP250 intra-chr 189 YES 2 CTAGES->SIP1 intra-chr -- 64 YES 1 MLLS->LHFPL3 intra-chr -- 23 YES 1 US 2014/0065620 A1 Mar. 6, 2014 11

TABLE 3-continued List of fusion transcripts identified. Total Between # of In Read Exon Fusion FUSION Transcript Mechanism Type Frame Strand Pairs Boundaries Isoforms PUM1->TRERF1 T inter-chr 58 ES SEC22B->NOTCH2 IANDT intra-chr -- 22 ES EIF3K->CYP39A1 IANDT inter-chir YES -- 91 ES RAB7A->LRCH3 DORT intra-chr -- ES RNF187->OBSCN T intra-chr -- ES SLC37A1->ABCG1 T intra-chir YES -- ES CYTH1->PRPSAP1 DORT intra-chir YES ES EXOC7->CYTH1 T intra-chir YES ES BRE->DPYSL5 T intra-chir YES -- ES CD151->DRD4 T intra-chr -- ES LDLRAD3->TCP11L1 T intra-chr -- ES RFT1->UQCRC2 IANDT inter-chir YES ES TAX1BP1->AHCY IANDT inter-chir YES -- ES NFIA->EHF T inter-chir YES -- ES GSDMC->PVT1 I intra-chr ES NTS1->PRKAR1B DORT intra-chir YES ES PHF2OL1->SAMD12 IANDT intra-chir YES -- ES STRADB->NOP58 DORT intra-chir YES -- ES POLDIP2->BRIP1 intra-chr ES ADAMTS19->SLC27A6 T intra-chr -- ES ARFGEF2->SULF2 IANDT intra-chir YES -- ES ATXN7L3->FAM171A2 T intra-chr ES BCAS4->BCAS3 inter-chr -- GCN1L1->MSI1 intra-chir YES MYH9->EIF3D intra-chir YES RPS6KB1->DLAPH3 IANDT inter-chr -- SULF2->PRICKLE2 inter-chr ODZ4->NRG1 IANDT inter-chir YES BRIP1->TMEM49 I intra-chr SUPT4H1->CCDC46 intra-chr TMEM104->CDK12 intra-chir YES -- RIMS2->ATP6V1C1 intra-chir YES -- TIAL1->C10orf119 intra-chr MECP2->TMLHE intra-chr ARID1A->MAST2 DORT intra-chir YES -- UBRS->SLC25A32 intra-chr KLHDC2->SNTB1 IANDT inter-chir YES -- ARID1A->WDTC1 DORT intra-chir YES -- HDGF->S100A10 DORT intra-chir YES 154 PPP1R12B->SNX27 intra-chir YES -- 45 SRGAP2->PRPF3 intra-chir YES -- 22 R WIPF2->ERBB2 intra-chir YES -- 66 E S

The fusion transcripts are named as the 5' gene ->3'gene. For example, LIMA1-> USP22 is a fusion transcript formed between two partner genes, LIMA1 and USP22, in which LIMA1 is the 5' gene and USP22 is the 3' gene. In the fusion mechanism column, T stands for trans ocation; Istands for inversion; and D stands for interstitial deletion, Intra-chr: intra-chromosomal fusion; Inter-chr: inter-chromosomal fusion,

TABLE 4 Row FUSION GENE Potential Fusion Mechanism Type

ranslocation in OOSO8. ranslocation in OOSO8. Inversion AND Translocation in OOSO8. Interstitial Deletion OR Translocation in OOSO8. Interstitial Deletion OR Translocation in OOSO8. in OOSO8. ranslocation in OOSO8. ranslocation in OOSO8. BRIP1->TMEM49 in OOSO8. 1 CD151->DRD4 in OOSO8. CTAGES->SIP1 ranslocation in OOSO8. Interstitial Deletion OR Translocation in OOSO8. Inversion AND Translocation in OOSO8. in OOSO8. ranslocation in OOSO8. ranslocation in OOSO8. ranslocation in OOSO8. US 2014/0065620 A1 Mar. 6, 2014 12

TABLE 4-continued 18 GLB1->CMTM7 nversion Alone intra-chromosoma 19 GSDMC->PVT1 nversion Alone intra-chromosoma 2O HDGF->S100A10 interstitial Deletion OR Translocation intra-chromosoma 21 INTS1->PRKAR1B interstitial Deletion OR Translocation intra-chromosoma 22 KLHDC2->SNTB1 nversion AND Translocation inter-chromosoma 23 LDLRAD3->TCP11L1 Translocation intra-chromosoma 24 LIMA1->USP22 Translocation inter-chromosoma 25 MECP2->TMLHE Translocation intra-chromosoma 26 MED1->STXBP4 nversion AND Translocation intra-chromosoma 27 MLLS->LHFPL3 Translocation intra-chromosoma 28 MYH9->EIF3D Translocation intra-chromosoma 29 NFIA->EHF Translocation inter-chromosoma 30 ODZ4->NRG1 nversion AND Translocation inter-chromosoma 31 PHF2OL1->SAMD12 nversion AND Translocation intra-chromosoma 32 PIP4K2B->RADS1C nversion AND Translocation intra-chromosoma 33 POLDIP2->BRIP1 Translocation intra-chromosoma 34 PPP1R12B->SNX27 Translocation intra-chromosoma 35 PRPF3->SRGAP2 interstitial Deletion OR Translocation intra-chromosoma 36 PUM1->TRERF1 Translocation inter-chromosoma 37 RAB22A->MYO9B Translocation inter-chromosoma 38 RAB7A->LRCH3 interstitial Deletion OR Translocation intra-chromosoma 39 RFT1->UQCRC2 nversion AND Translocation inter-chromosoma 40 RIMS2->ATP6V1C1 Translocation intra-chromosoma 41 RNF187->OBSCN Translocation intra-chromosoma 42 RPS6KB1->DIAPH3 nversion AND Translocation inter-chromosoma 43 RPS6KB1->SNF8 nversion AND Translocation intra-chromosoma 44 SEC22B->NOTCH2 nversion AND Translocation intra-chromosoma 45 SLC37A1->ABCG1 Translocation intra-chromosoma 46 SRGAP2->PRPF3 Translocation intra-chromosoma 47 STARD3->DOKS Translocation inter-chromosoma 48 STRADB->NOP58 interstitial Deletion OR Translocation intra-chromosoma 49 SULF2->PRICKLE2 Translocation inter-chromosoma 50 SUPT4H1->CCDC46 Translocation intra-chromosoma 51 TAX1BP1->AHCY nversion AND Translocation inter-chromosoma 52 TIAL1->C10orf119 Translocation intra-chromosoma 53 TMEM104->CDK12 Translocation intra-chromosoma S4 TMEM104->CDK12 Translocation intra-chromosoma 55 TRPC4AP->MRPL45 nversion AND Translocation inter-chromosoma 56 UBRS->SLC25A32 Translocation intra-chromosoma 57 WIPF2->ERBB2 Translocation intra-chromosoma 58 WIPF2->ERBB2 Translocation intra-chromosoma 59 ZMYND8->CEP250 nversion Alone intra-chromosoma 60 ZMYND8->CEP250 nversion Alone intra-chromosoma

Row Inversion Exon Mapping Information Fusion Stran

1 NO E2:chr17:STAC2:NM 1989.93:34627645:34627952:-: REVERSE Strand 285 307||E53:chr17:ACACA:NM 198839:32553565:32553662: -:1 27 2 NO E1:chr5:ADAMTS19:NM 133638:128824001:128824074:+: FORWARD Strand 45 73|E9:chris:SLC27A6:NM 014031:128391936:128392O34: +:1. 19 3 YES E3:chr20:SULF2:NM 198596:45798853:45799093:-: FORWARD Strand 211 240|E1:chr2O:ARFGEF2:NM OO6420:46971681:46971954: +:273 254 4 NO E2:chr1:MAST2:NM 0151.12:46062691:46062839:+:21, 1|E1: FORWARD Strand chr1:ARID1A:NM 006O15:26895.108:26896618:+:1510 1482 5 NO E1:chr1:ARID1A:NM OO6O15:26895.108:26896618:+:1487 1510|| FORWARD Strand E4:chr1:WDTC1:NM O15023:2748.1316:27481363:+: 26 6 NO E1:chr17:ATXN7L3:NM 001098833:39630913:39631055:-: REVERSE Strand 26 1|E4:chr17:FAM171A2:NM 1984.75:39789.323:39789482: -:159 136 7 NO E1:chr2O:BCAS4:NM O17843:48.844873:48.8451 17:+:221 244|| FORWARD Strand E24:chr17:BCAS3:NM 001099432:56800469:568OO637: +:1 23 8 NO E2:chr2:DPYSL5:NM O2O134:26974867:26975132:+:19 1 || FORWARD Strand E8:chr2:BRE:NM 199192:28205641:28205751:+:110 80 9 YES E3:chr17:BRIP1:NM 032043:57291938:57292050:-: REVERSE Strand 25 1||E10:chr17:TMEM49:NM O30938:55249854:55.249916: +:1 25 10 NO E4:chr11:CD151:NM 139030:826768:826843:+:52 75||E4:chr11: FORWARD Strand DRD4:NM 000797:630400:630703:+:1 26 11 NO E9:chr14:SIP1:NM 001009182:38675394:38675928:+:24 1|| FORWARD Strand E20:chr14:CTAGES:NM 203354:38865818:38865977:+:159 134

US 2014/0065620 A1 Mar. 6, 2014 15

TABLE 4-continued

28 r r 10 11 r 11 64 r 12 33 r 13 91 f 14 2O r 15 31 r r r 16 31 r r r 17 25 r r r 18 13 r r r 19 23 r r r 2O 154 r 21 r – 22 f 23 r f 24 r r r 25 r r r 26 r r r 27 r f 28 r r r 29 r f 30 r r r 31 f r 32 r r r 33 r r r 34 r f 35 r f 36 r r r 37 r f 38 r f 39 r r r 40 r f 41 r f 42 f r 43 162 f r 22 f r 45 2O r f 46 22 r f 47 21 r f 48 10 r f 49 26 r r r 50 17 r r r 51 S4 f r 52 12 r r r 53 10 r f S4 10 r f 55 27 r r r 56 28 r r r 57 66 r f 58 66 r f 59 189 E r r r E 60 189 E r r r E Number Recommended Sequence for of Fusion Primer Design Row Isoforms Description (SEQ ID NO:) Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line 1 Breas Cancer Ce Line 1 Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line 2 Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line Breas Cancer Ce Line US 2014/0065620 A1 Mar. 6, 2014 16

TABLE 4-continued 21 Breas Cancer Ce Line 21 22 Breas Cancer Ce Line 22 23 Breas Cancer Ce Line 23 24 Breas Cancer Ce Line 24 25 Breas Cancer Ce Line 25 26 Breas Cancer Ce Line 26 27 Breas Cancer Ce Line 27 28 Breas Cancer Ce Line 28 29 Breas Cancer Ce Line 29 30 Breas Cancer Ce Line 30 31 Breas Cancer Ce Line 31 32 Breas Cancer Ce Line 32 33 Breas Cancer Ce Line 33 34 Breas Cancer Ce Line 34 35 Breas Cancer Ce Line 35 36 Breas Cancer Ce Line 36 37 Breas Cancer Ce Line 37 38 Breas Cancer Ce Line 38 39 Breas Cancer Ce Line 39 40 Breas Cancer Ce Line 40 41 Breas Cancer Ce Line 41 42 Breas Cancer Ce Line 42 43 Breas Cancer Ce Line 43 Breas Cancer Ce Line 45 Breas Cancer Ce Line 45 46 Breas Cancer Ce Line 46 47 Breas Cancer Ce Line 47 48 Breas Cancer Ce Line 48 49 Breas Cancer Ce Line 49 50 Breas Cancer Ce Line 50 51 Breas Cancer Ce Line 51 52 Breas Cancer Ce Line 52 53 Breas Cancer Ce Line 53 S4 : Breas Cancer Ce Line S4 55 Breas Cancer Ce Line 55 56 Breas Cancer Ce Line 56 57 Breas Cancer Ce Line 57 58 Breas Cancer Ce Line 58 59 Breas Cancer Ce Line 59 60 Breas Cancer Ce Line 60

TABLE 5 TABLE 5-continued

Primer 1 Primer 2 Primer 1 Primer 2 (SEQ (SEQ Product (SEQ (SEQ Product Fusion Gene ID NO:) ID NO:) Size Cell Line Fusion Gene ID NO:) ID NO:) Size Cell Line LIMA1->USP22 333 393 86 BT-20 356 416 91 HCC1806 ACACA->STAC2 334 394 8O BT474 357 417 97 HCC1806 ZMYND8->CEP250 335 395 83 BT474 358 418 84 HCC1806 isoform 1 359 419 100 HCC1806 ZMYND8->CEP250 336 396 96 BT474 360 420 99 HCC1806 isoform 2 361 421 92 HCC1937 FAM102A->CIZ1 337 397 84 BT474 362 422 95 HCC1954 isoform 1 363 423 100 HCC1954 FAM102A->CIZ1 338 398 99 BT474 364 424 98 HCC1954 isoform 2 365 425 92 HCC1954 GLB1->CMTM7 339 399 98 BT474 366 426 99 HCC2218 STARD3->DOKS 340 400 111 BT474 367 427 81 MCF7 MED1->STXBP4 341 4O1 94 BT474 368 428 98 MCF7 TRPC4AP->MRPL45 342 4O2 89 BT474 369 429 100 MCF7 RAB22A->MYO9B 343 403 98 BT474 370 430 82 MCF7 PIP4K2B->RADS1C 344 404 81 BT474 371 431 83 MCF7 RPS6KB1->SNF8 345 40S 82 BT474 372 432 97 MCF7 CTAGES->SIP 346 4O6 8O HCC1187 373 433 98 MCF7 MLLS->LHFPL3 347 407 91 HCC1187 374 434 81 MCF7 SEC22B->NOTCH2 348 4.08 97 HCC1187 375 435 98 MDA-MEB PUM1->TRERF1 349 409 90 HCC1187 175V-II EIF3K->CYP39A1 350 410 96 HCC1395 BRIP1->TMEM49 376 436 91 MDA-MEB RAB7A->LRCH3 351 411 1OO HCC1395 361 SLC37A1->ABCG1 352 412 88 HCC1428 377 437 96 MDA-MEB RNF187->OBSCN 353 413 92 HCC1428 361 EXOC7->CYTH1 3S4 414 83 HCC1599 TMEM104->CDK12 378 438 90 MDA-MEB CYTH1->PRPSAP1 355 415 84 HCC1599 isoform 1 361 US 2014/0065620 A1 Mar. 6, 2014

TABLE 5-continued TABLE 5-continued

Primer 1 Primer 2 Primer 1 Primer 2 (SEQ (SEQ Product (SEQ (SEQ Product Fusion Gene ID NO:) ID NO:) Size Cell Line Fusion Gene ID NO:) ID NO:) Size Cell Line TMEM104->CDK12 379 439 87 MDA-MEB isoform 1 isoform 2 361 SRGAP2->PRPF3 392 452 90 UACC812 380 440 8O MDA-MEB isoform 2 436 TIAL1->C10orf119 381 441 8O MDA-MEB 436 0053 Among the 55 fusion candidates, 30 were in-frame MECP2->TMLHE 382 442 88 MDA-MEB (Tables 3 and 6). A fusion product was defined as “inframe' 453 ARID1A->MAST2 383 443 120 MDA-MEB when there was no frame shift in the 3' gene, regardless 468 whether there is single amino acid mutation or single/mul UBRS->SLC25A32 384 444 95 MDA-MEB tiple amino acid insertion at the fusion junction point. The 468 fusion junction point mutations were also listed in Table 6. In 385 445 90 SK-BR-3 addition, the list of fusion transcripts as the result of exhaus 386 446 114 UACC812 tive combinations of all transcripts from two partner genes 387 447 98 UACC812 isoform 1 may contain identical fusion products if the differences 388 448 91 UACC812 between the transcripts from the same partner are “fused out.” isoform 2 For example, as shown in FIG. 3D, the fusion transcript of 389 449 88 UACC812 Al-B4 was identical to that of A1-B1, and the fusion tran 390 450 98 UACC812 script of A2-B4 was identical to that of A2-B1. These identi 391 451 92 UACC812 cal fusion proteins were flagged in the SnowShoes output file (Table 6).

US 2014/0065620 A1 Mar. 6, 2014 20

& & ) & 8) & y & ca N. N. S. r Trrrrrr. Trr

Ig US 2014/0065620 A1 Mar. 6, 2014 21

99LI9II00TWN

§§ IS SS 99 LS US 2014/0065620 A1 Mar. 6, 2014 22

US 2014/0065620 A1 Mar. 6, 2014 24

US 2014/0065620 A1 Mar. 6, 2014 26

Fusion Genes Identified in MCF7 Cancer Cell Line way, and FGF/ERBB signaling. This observation suggests that fusion transcripts may have functional significance in 0054 Fusion gene products in the MCF7 cell line had been previously described using a paired end sequencing protocol. signal transduction in breast cancer cells. The list of fusion transcripts identified in MCF7 cancer cell Structural Analysis of Fusion Transcripts Suggests a Prepon line using SnowShoes-FTD as described herein was com derance of Promoter Swap Mutations, One of which May pared to the list of transcripts described elsewhere (Maher et Represent a Novel Mechanism for ERBB2 Overexpression al., Proc. Natl. Acad. Sci. USA, 106(30): 12353-8 (2009)). The 0057 The analytical power of the SnowShoes-FTD pipe SnowShoes-FTD identified and validated 5 novel fusion tran line lies in part in the very low false detection rate and in very scripts that were not reported by Maher et al.: ADAMTS19 large part in the downstream features that predict the structure SLC27A6, ATXN7L3-FAM171A2, GCN1L1-MSI1, of the hypothetical fusion transcripts and the amino acid MYH9-EIF3D, and RPS6KB1-DIAPH3. In addition, there sequence of the resultant translation products. Such analyses were 5 fusion genes identified by Maher et al. that were not indicated that the nature of the fusion transcripts that were detected by SnowShoes-FTD: ARHGAP19-DRG1, detected in breast cancer cells is strikingly non-random, as BC017255-TMEM49, PAPOLA-AK7, AHCYL1-RAD51C, evidenced by the fact that 23 of the 60 confirmed chimeric and FCHOL-MYO9B. It was found that (i) BC017255 was no transcripts result from fusion of exon 1 of the 5"/upstream longer in the RefSeq RNA database and (ii) the distance partners to the 3'/downstream partners. The most probable between PAPOLA-AK7 is 65 Kb which is Smaller than the cause of Such chimeric RNAS is a genomic rearrangement default setting of 100 Kb. In addition, no fusion junction that results in juxtaposition of a promoter that potentially spanning reads were observed to support this fusion. There alters the level of expression and/or the regulation of the fore, this fusion transcript would only have been detected downstream partner in response to changes in the cellular with a different distance threshold and by reducing the default environment. In addition, all of the fusion transcripts that for fusion spanning reads to 0. (iii) There are no junction were reported and validated herein map precisely to exon/ spanning reads in the data set for AHCYL1-RAD51C exonjunctions between the upstream and downstream fusion although 10 fusion encompassing reads Supporting the exist partners, Suggesting that Such transcripts are processed. ence of this fusion transcript were found. (iv) There was only There were only five additional fusion transcripts in which the one fusion junction spanning read for FCHOL-MYO9B, and fusion junction points were in the middle of exons (detected the default setting for SnowShoes-FTD was “at least two with different parameter settings for SnowShoes-FTD). unique junction spanning reads. On the other hand, no evi About half of the fusion events were in frame and therefore dence was found in support of an ARHGAP19-DRG1 fusion, predicted to encode fusion proteins. The preponderance of as the alignment file (SAM file) did not contain any read pairs Such events in these samples Suggests that some of the fusion that mapped to both of these genes. When RT-PCR was per transcripts may convey a growth advantage, such that tran formed using the PCR primers provided by Maher et al. (FIG. script enrichment results from selection. For example, MDA 5), the results also supported the existence of the fusion prod MB-468 cells express an ARID1A MAST2 fusion transcript ucts BC017255-TMEM49, PAPOLA-AK7, AHCYL1 (FIG. 4A) that might result from translocation without inver RAD51C, and FCHOL-MYO9B, while no PCR product was sion of the ARID1A promoter (1p36.11) to the more centro observed for the ARHGAP19-DRG1 fusion. Thus, 4 out of 5 meric MAST2 locus (1 p34.1). Alternatively, this fusion tran “known fusion transcripts that were not identified by Snow script might result from interstitial deletion of those portions Shoes-FTD were explained by differences in the RefSeq data of chromosome 1 that intervene between exon 1 of ARID1A base used for the analyses or by the choice of parameter (coordinates 26896618) and exon 2 of MAST2 (coordinates settings for the various filtering steps. The ARHGAP19 46062691). Juxtaposition of the ARID1A promoter would DRG1 fusion transcript reported by Maher et al. did not place control of MAST2, which is downstream of the RB1 appear to be expressed in the MCF7 cells that were obtained pathway, as evidenced by the preponderance of E2F sites in from ATCC and used in this study. the ARID1A promoter and by the observation that ARID1A is 0055 Edgren et al. (Genome Biol., 12(1):R6 (2011)) regulated in a cell cycle-dependent manner (Nagletal. Embo reported on detection of fusion transcripts in four breast can J., 26(3):752-63 (2007)). Using SnowShoes-FTD, it was pre cer cell lines, including MCF7 in which three fusion tran dicted that in-frame fusion between ARID1A exon 1 and scripts were validated. The work described herein detected MAST2 exon 3 will give rise to a chimeric transcript with a eight fusion transcripts in MCF, including two of the three predicted open reading frame of 2118 amino acids. The N-ter reported by Edgren et al. (BCAS4 BCAS3 and ARFGEF2 minal 378 amino acids of this hypothetical fusion protein were derived from ARID1A and appeared to contain no SULF2). known or predicted functional domain. Conversely, the C-ter minal 1740 amino acids were derived from MAST2 and con Pathway Analysis of Genes Involved in Fusion Transcripts in tained the protein kinase, AGC kinase, and PDZ domains of Breast Cancer Cell Lines the parental protein. It was likely that this fusion protein has 0056. There were a total of 105 fusion partner genes from serine/threonine kinase activity. Whether loss of the N-termi the 55 fusion candidates, among which 58 genes formed nal 58 amino acids from MAST2, insertion of the 378 amino in-frame fusion transcripts of 30 chimeric RNAs. Pathway acid N-terminus of ARID1A, or aberrant expression of and regulatory network analyses of these 58 genes were per MAST2 driven from the ARID1A promoter conveyed novel formed using MetaCore (GeneGo Inc., San Diego, Calif.). oncogenic potential remains to be determined. There were two pathways that are enriched among these 58 0058. The level of exon expression of the fusion transcript genes: the non-genomic action of androgen receptor and was examined. As shown in FIG. 4A, exon 1 expression of ligand-independent activation of ESR1 and ESR2. Three MAST2 was significantly lower than the other exons (exon GeneGo process networks were significantly enriched: 2-29), which might be due to the fact the exon 1 was fused out. androgen receptor signaling cross-talk, ESR1-nuclear path However, there were no obvious expression differences US 2014/0065620 A1 Mar. 6, 2014 27 between the exons of the ARID1A gene. The most provoca produce transcripts that encode full length ERBB2 protein tive chimeric transcript that was detected involves fusion of (FIG. 4B). The sequence of full length transcripts from the WIPF2 and ERBB2 RNAS. Two isoforms of the fusion mRNA-Seq data was not determined at this time. It might be were predicted and validated. These chimeric transcripts necessary to clone and sequence longer cDNA fragments that were expressed in UACC812 cells, which were derived from correspond to the first few hundred nucleotides of the fusion a HER2+ tumor (Meltzer et al., Br. J. Cancer. 63(5):727-35 transcript in order to determine which of the hypothetical (1991)). The WIPF2 locus (also known as WIRE) is located at transcripts are expressed. When the exon expression levels of chr17q21.2 and is transcribed towards the telomere. ERBB2 ERBB2 were examined, exons 1-4 were found to be substan is located at chr17q11.2, centromeric to WIPF2. Like WIPF2, tially less abundant than downstream exons, Suggesting that ERBB2 is transcribed towards the telomere. It was therefore the transcript with the first 4 exons of ERBB2 fused out might probable that this fusion transcript arose as a result of trans be the more plausible fusion product. These results may indi location without inversion of the WIPF2 promoter to give rise cate that a novel mechanism accounts for ERBB2 overexpres to two in-frame transcripts in which the 5' untranslated region sion in HER2+ breast cancer. of WIPF2 is fused to one of several 5' untranslated exons of ERBB2 (FIG.4B). The genomic structure of this hypothetical Example 2 translocation remains to be verified, but the net result of such an event would be to place ERBB2 expression under control Detection of Redundant Fusion Transcripts in of a promoter that appears, from analysis of potential tran Primary Breast Tumors scription factor binding sites in the WIPF2 5' flanking region, to be susceptible to regulation by NFKB, NOTCH, and MYC Paired-End RNA-Seq Analysis signaling. This hypothetical promoter Swap may account, at 0060 Total RNA was prepared from 8 each fresh frozen least in part, for the observation that ERBB2 transcripts estrogen receptor positive (ER+). ERBB2 enriched (HER2+), account for about 12,632 tags per million total tags, as deter and triple negative (TN) breast tumors. Tumors were macro mined from the mRNA-Seq data, which translates to about dissected to remove normal tissue. RNA quality was deter 1.3% of the total polyA+ mRNA pool in UACC812 cells. The mined using an Agilent Bioanalyzer (RIND-7.9 for all observation that there was a dramatic increase in ERBB2 samples), and cDNA libraries were prepared and sequenced exon expression at the fusionjunction (FIG. 4B) is consistent (50 nt paired-end) on the Illumina GAIX, as described else with this hypothesis. where (Sun et al., PLoS ONE, 6:e17490 (2011)) to a depth of SnowShoeS-FTD Predicted TWO WIPF2 ERBB2 Fusion 20-50 Mend pairs per sample (Table 7). The quantity of the Junctions which were Verified in UACC812 Cells fusion transcripts were calculated as the number of fusion 0059 WIPF2 chromosomal coordinates 35629270 fused encompassing reads per million aligned reads. Normal tissue to ERBB2 coordinates 35104766 or 35.116768. The latter mRNA-Seq data (50-base paired-end, 73-80 million read coordinates fell within the coding sequence of one of the pairs per sample) from the Body Map 2.0 project were RefSeq variants of ERBB2 mRNA (exon 2 of NM 004448) obtained from ArrayExpress (http://www.ebi.ac.uk/arrayex and would introduce a frame shift mutation in that variant press, query ID: E-MTAB-513). Paired-end sequence data (FIG. 4B). However, two of the three predicted fusion from non-transformed human mammary epithelial cells (AS sequences (comprised of exon 1 of WIPF2 NM 133264 mann et al. Nucleic Acids Res., 39(15): e100 (2011)) were fused to exon 4 or 5 of ERBB2 NM 001005862) would re-analyzed as described herein. TABLE 7 Alignment summaries for individual tumors.

One Both End Mapped One End One End Ends to Junction, the Mapped to Mapped to Total Mapped Both Ends Other End Genome, the Junction, the Tumor Number of to Exon Mapped to Mapped to Other EndNot Other End ID Read Pairs Junctions Genome Genome Mapped Not Mapped Sample Description S 25 19,633,880 782,526 11,349,512 2,298,248 1,098,701 162,327 HER2+ Breast Tumor S. 26 19,510,963 742,955 11,691.267 2,660,440 830,003 134,309 HER2+ Breast Tumor S. 27 19,965,809 862,416 11,681,914 3,022,891 958,518 167,122 HER2+ Breast Tumor s 28 19,326,720 729,475 11,350,258 2,709,067 942,473 146,291 HER2+ Breast Tumor S 29 19,287,844 668,668 10,136,081 2,644,347 1427,405 181,342 HER2+ Breast Tumor S 30 19,872,605 806,772 11,013,118 2,943,293 1426,803 249,954. HER2+ Breast Tumor S 31 17,399,880 662,680 9,975,682 2,409,195 811,389 122,519 HER2+ Breast Tumor S 32 19,167,067 804,355 10,062,209 2,874,016 1,040,725 179,568 HER2+ Breast Tumor s 33 52,989,065 1,442,285 32,211,986 6,370,244 2,926,974 384,138 ER+ Breast Tumor S 34 47,666,820 1,481,381 28,330,271 6,340,808 3,099,048 455,741 ER+ Breast Tumor S 35 49,814,344 1,598,163 27,010,487 6,074,822 5,687,392 859,744 ER+ Breast Tumor s 36 50,734,654 1,349,033 23,322,513 5,046,612 8,806,497 1,335,350 ER+ Breast Tumor s 37 52,954,073 1,887,348 27,674,967 6,846,678 5,605,759 977,738 ER+ Breast Tumor s 38 51,724,496 2,235,084 27,914,085 8,148,819 2,675,758 465,731 ER+ Breast Tumor s 39 51,548,133 1,833,333 28,399,341 6,920,007 2,742,863 435,097 ER+ Breast Tumor S 40 44,112,005 1,916,730 25,100,264 6,332,273 2,451,822 418,968 ER+ Breast Tumor S 41 21,550,821 1,060,261 11,731,299 4,208,639 749,366 152,413 Normal Breast Primary Culture US 2014/0065620 A1 Mar. 6, 2014 28

TABLE 7-continued Alignment Sunnaries for individual tumors. One Both End Mapped One End One End Ends to Junction, the Mapped to Mapped to Total Mapped Both Ends Other End Genome, the unction, the Tumor Number of to Exon Mapped to Mapped to Other EndNot Other End ID Read Pairs Junctions Genome Genome Mapped Not Mapped Sample Description S 42 21,353,151 ,094,923 1,523,049 3,743,587 943,898 57,786 Normal Breast Primary Culture S 43 20,924,924 ,045,589 1,120,605 4,111,792 771,947 52.238 Normal Breast Primary Culture S 44 22,510,790 ,149,387 2,209,115 4,544,155 678,941 43,978 Normal Breast Primary Culture s 45 21,057,269 958,317 1470,882 3,815,264 735,812 39,739 Normal Breast Primary Culture S 46 24,033,748 146,880 3.264,678 4,594,952 887,587 69,999 Normal Breast Primary Culture s 47 21,682,601 ,083,301 1919,091 3,769,336 907,219 78,754 Normal Breast Primary Culture s 48 20,257,198 945,339 1,137,380 3,802,035 754,667 43,117 Normal Breast Primary Culture S. 49 27,742,773 ,194,950 4,219,774 3,643,071 1,802,090 281,820 Triple Negative Breast O S 50 26,038.478 922,741 5,502,762 3,686.465 1,091,731 55,837 Triple Negative Breast O s 51 25,538,716 110,680 3,090,688 4,133,936 1939,284 357,700 Triple Negative Breast O S. 52 22,224,358 773,848 2,782.913 3,121,694 937,044 39,107 Triple Negative Breast O s 53 21,271,234 123,178 0,145,277 4,547,699 1,580,560 343,052 Triple Negative Breast O S 54 25,238,796 724,527 2,992.429 2.910,508 2,842,085 329,963 Triple Negative Breast O s 55 22,588,795 733,892 2,319,173 3,006.438 1,913,594 263,909 Triple Negative Breast O S. 56 28,685,711 966,103 5,650,142 3,271,160 1,834,194 228,476 Triple Negative Breast O S 25 OO.00% 3.99% 57.81% 1.71% 5.60% O.83% HER2- Breast Tumor S. 26 OO.00% 3.81% 59.92% 3.64% 4.25% O.69% HER2- Breast Tumor S. 27 OO.00% 4.32% 58.51% S.14% 4.80% O.84% HER2- Breast Tumor s 28 OO.00% 3.77% 58.73% 4.02% 4.88% O.76% HER2- Breast Tumor s 29 OO.00% 3.47% 52.55% 3.71% 7.40% O.94% HER2- Breast Tumor S 30 OO.00% 4.06% SS.42% 4.81% 7.18% 1.26% HER2- Breast Tumor S 31 OO.00% 3.81% 57.33% 3.85% 4.66% O.70% HER2- Breast Tumor S 32 OO.00% 4.20% 52.50% 4.99% 5.43% O.94% HER2- Breast Tumor s 33 OO.00% 2.72% 60.79% 2.02% 5.529 O.72% ER. Breast Tumor S 34 OO.00% 3.11% 59.43% 3.30% 6.50% O.96% ER. Breast Tumor S 35 OO.00% 3.21% 54.22% 2.19% 11.42% 1.73% ER. Breast Tumor s 36 OO.00% 2.66% 45.97% 9.95% 17.36% 2.63% ER. Breast Tumor s 37 OO.00% 3.56% 52.26% 2.93% 10.59% 1.85% ER. Breast Tumor s 38 OO.00% 4.32% 53.97% 5.75% 5.17% O.90% ER. Breast Tumor s 39 OO.00% 3.56% SS.09% 3.42% 5.32% 0.84% ER+ Breast Tumor S 40 OO.00% 4.35% 56.90% 4.35% S.S.6% O.95% ER. Breast Tumor S 41 OO.00% 4.92% 54.44% 9.53% 3.48% 0.71% Normal Breast Primary Culture S 42 OO.00% S.13% 53.96% 7.53% 4.42% 0.74% Normal Breast Primary Culture S 43 OO.00% S.OO% 53.15% 9.65% 3.69% 0.73% Normal Breast Primary Culture S 44 OO.00% S.11% 54.24% 20.19% 3.02% 0.64% Normal Breast Primary Culture s 45 OO.00% 4.55% 54.47% 8.12% 3.49% 0.66% Normal Breast Primary Culture S 46 OO.00% 4.77% SS.19% 9.12% 3.69% 0.71% Normal Breast Primary Culture s 47 OO.00% S.OO% 54.97% 7.38% 4.18% 0.82% Normal Breast Primary Culture s 48 OO.00% 4.67% S4.98% 8.77% 3.73% 0.71% Normal Breast Primary Culture S. 49 OO.00% 4.31% 51.26% 3.13% 6.50% 1.02% Triple Negative Breast Tumor S 50 OO.00% 3.54% 59.54% 4.16% 4.19% 0.60% Triple Negative Breast Tumor s 51 OO.00% 4.35% 51.26% 6.19% 7.59% 1.40% Triple Negative Breast Tumor US 2014/0065620 A1 Mar. 6, 2014 29

TABLE 7-continued Alignment Sunnaries for individual tumors. One Both End Mapped One End One End Ends to Junction, the Mapped to Mapped to Total Mapped Both Ends Other End Genome, the Junction, the Tumor Number of to Exon Mapped to Mapped to Other EndNot Other End ID Read Pairs Junctions Genome Genome Mapped Not Mapped Sample Description S. 52 100.00% 3.48% 57.52% 4.05% 4.22% O.63% Triple Negative Breast Tumor S. 53 100.00% S.28% 47.69% 21.38% 7.43% 1.61% Triple Negative Breast Tumor S 54 100.00% 2.87% S1.48% 1.53% 11.26% 1.31% Triple Negative Breast Tumor S. 55 100.00% 3.25% 54.54% 3.31% 8.47% 1.17% Triple Negative Breast Tumor S. 56 100.00% 3.37% 54.56% 1.40% 6.39% O.80% Triple Negative Breast Tumor

Identification of Fusion Transcripts Private Versus Redundant Fusion Transcripts 0061 End pairs were aligned to human genome build 36 0063 A private fusion transcript was detected in only one using Burrows-Wheeler Aligner (BWA) (Li and Durbin, Bio tumor sample. All private transcripts, by definition, had sen informatics, 25:1754-60 (2009)). The aligned SAM files were tinel properties. Redundant transcripts were detected in two sorted according to read IDs using SAMtools (Liet al., Bio or more tumors. A redundant transcript must exhibit sentinel informatics, 25.2078-9 (2009)). The fusion transcripts were properties in at least one tumor. identified using SnowShoes-FTD (Asmann et al. Nucleic Tumor-Specific Fusion Transcripts Acids Res., 39(15):e 100 (2011)) version 2.0, which has higher sensitivity without increasing false discovery rate, 0064. Fusion transcripts in breast tumors were filtered to compared to version 1.0. remove all candidates that were also detected in either one of the control datasets: the HMEC or Body Map data. This Fusion Encompassing Versus Fusion Spanning Reads approach was based on the assumption that such candidates represent either annotation or alignment errors or arise from 0062 Fusion encompassing reads (Maher et al., Proc. germ line rearrangement polymorphisms (Hillmer et al., Natl. Acad. Sci. USA, 106(30): 12353-8 (2009)) contained 50 Genome Res., 21:665-75 (2011)). nucleotides from each end which map to different fusion partners. Fusion spanning reads included one end that maps Results within one of the two fusion partners and a second end that 0065. 131 sentinel fusion transcripts were detected in 24 spans the junction between the two different fusion partners. tumors (Table 8). The majority of the fusion transcripts arose Sentinel fusion transcripts were defined as those detected in a from interchromosomal fusions (104/131). Six fusion tran single tumor with 3 or more unique, tiling fusion encompass Scripts were expressed as multiple isoforms in tumors (la ing read pairs plus 2 or more unique, tiling fusion spanning beled with a "+" in Table 8). The majority of the fusion reads. Moreover, alignment of these reads must allow unam transcripts were private, expressed in only one tumor biguous assignment of directionality (5' to 3') of the two sample. However, 45 sentinel transcripts were redundant, as fusion partners. The initial analysis of fusion transcripts in evidenced by detection in two or more tumors (labeled with a breast cancer cell lines indicated that sentinel transcripts are “S” in Table 8). Redundancy was dependent upon depth of predicted with very high accuracy. See. Example 1. A select sequence. Therefore, Some of the private transcripts could subset of sentinel transcripts from the breast tumors was emerge as redundant if greater depth of sequence were validated. obtained. US 2014/0065620 A1 Mar. 6, 2014 30

8HT8IVIL US 2014/0065620 A1 Mar. 6, 2014 31

98#7

?I

ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI

+

TILHTOEGICTV

US 2014/0065620 A1 Mar. 6, 2014 34

ZI

ZI

ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI

IIVIVINTZ?DI IIVIVINTZ?DI

I?S ÁZS I?S 99S 09S 99S 99S 99S 99S 0£S

*18.

US 2014/0065620 A1 Mar. 6, 2014 36

099 999 Á99 899 699 ILS ?LS. SLS LLS 089

0I

0I

ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI ONI

CI)CINVI (L’HO JL

IS 09 Ig SS 99 S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S

3.06 $/g,6 US 2014/0065620 A1 Mar. 6, 2014 37 US 2014/0065620 A1 Mar. 6, 2014 38

or ri v N. cx N 0I IZ ZZ US 2014/0065620 A1 Mar. 6, 2014 39

SS 99 LS 89 89 69 IL

US 2014/0065620 A1 Mar. 6, 2014 42

:SOEH

IL S? US 2014/0065620 A1 Mar. 6, 2014 43

98 Á8 88 68 06 90€. ZIZ ÁIZ 8IZ US 2014/0065620 A1 Mar. 6, 2014 44

6.IZ US 2014/0065620 A1 Mar. 6, 2014

Tumor Subtype Distribution of Fusion Transcripts mapped to clusters located in the vicinity of 19p 13 or 19q13. 0066 Every tumor expressed at least one redundant fusion One large cluster of genes at 11q13.1-q13.4 was restricted to transcript, with a range of 1-13 redundant transcripts/tumor ER+ tumors (arrow in FIG. 8 labeled with two asterisks), a (Table 9). Among the redundant transcripts, seven were Small cluster of genes at 1 q21.2-q21.3 was restricted to uniquely expressed in ER+ tumors and eight in TN tumors HER2+ tumors (arrow in FIG. 8 labeled with one asterisk), (labeled with oval symbols in FIG. 6), but no redundant and genes that clustered at 8q24.3, 12q13.13, and 17q25.1- transcript was exclusively expressed in HER2+ tumors. Pri q25.3 were restricted to TN tumors (arrows in FIG. 8 labeled vate transcripts were detected at a range of 0-12/tumor (Table with three asterisks). 9). ER+ and TN tumors expressed similar numbers of fusion 0069 Limited data from genomic analysis of both breast transcripts, whereas HER2+ tumors expressed significantly cancer cell lines (Edgren et al., Genome Biol., 12:R6 (2011)) fewer fusions (Table 9). However, a few HER2+ tumors and tumors (Inaki et al., Genome Res., 21:676-87 (2011); and expressed levels of fusions that were comparable to those Stephens et al., Nature, 462:1005-10 (2009)) indicate that observed in ER+ or TN tumors (see, e.g., HER2+ tumors 29 genomic rearrangement is the primary mechanism whereby in Table 8). It is possible that the expression of large numbers most fusion transcripts are generated. Furthermore, review of offusion transcripts is indicative of a subset of HER2+ tumors the array comparative genomic hybridization (aCGH) data on that have unusually high genomic instability, with implica breast cancer revealed that many of the fusion partners that tions for therapeutic response. Fusion transcripts represented were identified map to regions that are known to undergo a heretofore underappreciated class of genomic features that copy number gain or loss in breast tumors. This correlation may have considerable potential as biomarkers ortherapeutic was evident when one considers , which con targets in breast cancer. tained 33 genes that contributed to fusion transcripts. Among TABLE 9 Distribution of fusion transcripts among tumors subtypes. Tumor Subtype specific incidence was abstracted from Table 8. Statistical analysis was performed by ANOVA Number of Number of Subtype Fusions Genes in Range Genes in Specific with Tumor Private Range Private Private Redundant Redundant. Redundant Redundant Multiple Subtype Fusions Fusions/Tumor Fusions Fusions Fusions. Tumor Fusions Fusions Isoforms All 86 O to 12 149 45 1 to 13 76 6 Tumors HER2 17(1) O to S 34 19(2) 1 to 9 33 O 1 Tumors ER+ 30 O to 9 51 32 2 to 12 55 7 2 Tumors TN 39 2 to 12 68 32 3 to 13 53 8 3 Tumors (p = 0.25 re. ER+, p = 0.036 re. TN (p = 0.006 re. ER+, p = 0.02 re. TN

Chromosomal Distribution of Fusion Transcript Partners these genes, six mapped to a cluster at 17q12, 5 to 17q21, and 6 to 17q25. All three of these loci are known to undergo copy 0067. The chromosomal mapping distribution of the sen number variation in breast cancer (Stephens et al., Nature, tinel fusions was clearly non-random (FIG. 7A). A dispropor 462:1005-10 (2009); Adelaide et al., Cancer Res., 67: 11565 tionately large number of fusion transcript partners were 75 (2007); Andre et al., Clin. Cancer Res., 15:441-51 (2009): located on chromosomes 1, 2, 17, and 19 (FIG. 7B), whereas and Bae et al., World J. Surg. Oncol., 8:32 (2010)). The relatively few fusion transcript partners are located on chro distribution of fusion partners on chromosome 19 was even mosomes 4, 9, 13, 15, 20, and 21. It was difficult, because of more striking. All of the genes map to either 19p12-p13 or the relatively small numbers, to make any rigorous conclu 19q13. Both aGGH and genome wide association data indi sions with respect to tumor-subtype-specific distribution of cated that these two regions are important in breast cancer, fusion transcripts. However, chromosome 19 appeared to be a particularly the triple negative Subtype (Antoniou et al., Nat. hotspot for TN tumors. Circos plots of ER+ specific and TN Genet., 42:885-92 (2010); and Yang et al., Genes Chromo specific redundant fusion gene partners (FIG. 7A) indicated somes Cancer; 41:250-6 (2004)). Based on these consider that there is a Subtype-specific fusion transcript geography, ations, most of the fusion transcripts appeared to arise due to Suggesting a functional link between breast tumor Subtype chromosomal rearrangements and therefore marked areas of and formation of fusion transcripts. The observation that local chromosomal instability. HER2+ tumors, as a group, express significantly fewer fusion transcripts was consistent with this hypothesis. Structure and Potential Functional Significance of Predicted 0068 A number of distinct clusters emerged when the Fusion Transcript Products fusion partner genes were mapped to genomic loci (FIG. 8). Two major clusters were observed on chromosome 17, map (0070 SnowShoes. FTD assembled the predicted nucle ping to 17q21-q23, and 17q25. Both of these regions are otide sequences of the candidate fusion transcripts and trans well-known to undergo copy number variation in breast can lated that sequence into the predicted amino acid sequences cer. All of the chromosome 19 fusion partners in TN tumors of the putative fusion proteins (Table 10). Fusion transcripts US 2014/0065620 A1 Mar. 6, 2014 46 in breast cancer cell lines fall into several broad categories based on the location with the transcription unit wherein the fusion occurs. A small number of fusions occurred in 5' UTR regions (FIG. 6), placing the coding sequence of the 3' fusion partner under the control of the promoter from the 5' fusion partner. A promoter Swap event of this sort was associated with ERBB2 overexpression in a breast cancer cell line derived from a HER2+ tumor. US 2014/0065620 A1 Mar. 6, 2014 47

OTFITEIVÄL