Critical Dynamic Processes in the Pathogenesis of Cancer

Multiclass Analysis of Expression Data Robert Stengel March 27, 2006

! Background ! Small, round, blue- tumor example ! Analysis of colon cancer, metastases, and normal tissue

presented at the Institute for Advanced Study

Putative Paradigm for The Process Microarray Analysis: Expression Level Infers Cell Function & Carcinogenesis

! Up-regulation in tumor cells – Causal input (tumor enhancing gene) – Defensive response (tumor suppressor gene) – Bystander effect – Tissue effect – Artifact ! Down-regulation in tumor cells – Present, but mutated (and, therefore, not detected) – Eliminated or suppressed by tumor growth – Tissue effect – Artifact Models of Gene Microarray Classification Interaction Objectives

! Inhibition and amplification are multigene effects !Class comparison ! Effects of over/underexpression depend on regulatory pathway and sign of interaction – Identify expression profiles (feature sets) for predefined classes !Class prediction – Develop mathematical function/algorithm that predicts class membership for a novel expression profile !Class discovery – Identify new classes, sub-classes, or features related to classification objectives

200

150 Tumor Typical Pairs of Colon Cancer Normal 100 Characteristics of D14657 Microarray Expression Levels 5 0 0 Classification Features -50 (Alon, Notterman, Levine et al, 1999) -20 -10 0 1 0 2 0 3 0 4 0 M94363

! Samples not well differentiated in individual transcript clusters (overlapping)

250

200

150 Tumor Normal ! Strong feature 100 H57136 – Individual feature provides good classification 5 0 – Minimal overlap of feature values in each class 0 – Significant difference in class mean values

-50 – Low variance in class is desirable ! Additional features -100 -80 -60 -40 -20 0 2 0 4 0 6 0 8 0 – Orthogonal feature (low correlation) may add M96839 new information to the set – Co-expressed feature (high correlation) is redundant; averaging may reduce error Two-Sample Student t Test Example of Gene-by-Gene Tumor/Normal Classification by t Test for Gene Selection (data from Alon, Notterman, Levine, et al, 1999)

! 1,151 transcripts are ! t compares mean values of two data sets 2 2 over/under-expressed in # # – |t| is reduced by uncertainty in the data sets ( ) T N ! tumor/normal comparison, p t = (mT " mN )/ + – |t| is increased by number of points in the data sets (n) # 0.003 36 22 ! Genetically dissimilar # 2 # 2 ! m = mean value of data set samples are apparent t = m " m / A + B ! ! = standard deviation of data set ( A B ) ! nA nB ! n = number of points in data set “Cancer-positive probe sets” ! t test applied to mean tumor/normal expression levels of individual

! ! |t| > 3, mA ! mB with "99.7% confidence “Cancer-negative probe sets” (error p # 0.003) [n > 25]

! Dukes Stages: A -> B -> C -> D

Correlation Matrices Ensemble Mean Values (Metagenes) +1 0 –1

! Treat each probe set (row) as a redundant, corrupted ! Probe set correlation measurement of the same tumor/normal indicator

zij = ki y + "ij , i =1, m, j =1, n x = ! Compute column averages for each sample sub-group (i.e., sum each column and divide by n) 1 n ! Sample correlation ˆ z j = "zij ! n i=1 ! Statistics of random variable sums are normal by central x = limit theorem

! Class Prediction and Evaluation Clustering of Sample Averages for Primary (data from Alon, Notterman, Levine, et al 1999) Colon Cancer vs. Normal Mucosa (PPG, 144-sample set)

800.00 ! 144 samples,

200 3,437 probe 700.00 sets analyzed ! Single-feature 180 ! 47 primary prediction 600.00 colon cancer 160 – Cancer-positive Tumors ! 22 normal 140 Normals mucosa genes: 4 errors 500.00 ! Affymetrix – Cancer-negative 120 genes: 2 errors HGU-133A 400.00 100 GeneChip ! Two-feature Primary Colon Cancer Normal Mucosa ! All prediction: 1 error 80 300.00 transcripts (mislabeling?) “Present” in 60 all samples 200.00 40 Average Value for Genes Down-Regulated in Tumors

20 100.00 # Probe sets

0 Average Value of Down-Regulated Genes (Primary Colon Cancer) Up: 1067 0 20 40 60 80 100 120 140 160 180 Down: 290 Average Value for Genes Up-Regulated in Tumors 0.00 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00 2000.00 Constant: 19 Average Value of Up-Regulated Genes (Primary Colon Cancer)

Clustering of Sample Averages for Clustering of Sample Averages for Primary Polyp vs. Normal Mucosa Primary Polyp vs. Primary Colon Cancer (PPG, 144-sample set) (PPG, 144-sample set)

800.00 1000.00

900.00 700.00 ! 21 primary polyp 800.00 ! 21 primary polyp 600.00 ! 22 normal mucosa 700.00 ! 47 primary 500.00 colon cancer 600.00

400.00 500.00 Primary Polyp Normal Mucosa 300.00 # Probe sets 400.00 # Probe sets Primary Polyp Up: 1014 Primary Colon Cancer Up: 273 300.00 200.00 Down: 219 Down: 126 Constant: 40 200.00 Constant: 172

Average Value of Down-Regulated Genes (Primary Polyp) 100.00 Average Value of Down-Regulated Genes (Primary Polyp) 100.00

0.00 0.00 500.00 1000.00 1500.00 2000.00 2500.00 0.00 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00 Average Value of Up-Regulated Genes (Primary Polyp) Average Value of Up-Regulated Genes (Primary Polyp) Identified Cancer Links in Genetic Distinctions Between Ascending Selected Gene Training Sets and Descending Colon Tumors (PPG, 144 samples) (PPG, 144-sample set)

Number of Identified Cancer Genes in Set Over-Expressed (10) Under-Expressed (10) Both (20) 1400.00 Primary Colon Cancer vs. Normal Mucosa 7 (70%) 5 (50%) 12 (60%) 1200.00 Primary Polyp vs. 1000.00 Norma Mucosa 7 (70%) 5 (50%) 12 (60%)

Primary Colon Cancer 800.00 vs. Primary Polyp 7 (70%) 6 (60%) 13 (65%)

600.00

Ascending Colon Total 21 (70%) 16 (53%) 37 (62%) Remainder of Colon 400.00 … but a significant number of genes at the bottom of the

Average Value of Down-Regulated Genes (Ascending Colon) 200.00 |t| > 3 lists also have been linked to cancer

0.00 ! Bioalma linkages via GeneCards® 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 (Weizmann Institute of Science) Average Value of Up-Regulated Genes (Ascending Colon)

Genetic Distinctions Between Cecum, Small, Round Blue-Cell Tumor Sigmoid, and Rectosigmoid Colon Tumors (PPG, 144-sample set) Classification Example

! Childhood cancers, including

1000.00 900.00 – Ewing’s sarcoma (EWS)

900.00 – Burkitt’s Lymphoma (BL) 800.00 – Neuroblastoma (NB) 800.00 700.00 – Rhabdomyosarcoma (RMS) 700.00 600.00 ! cDNA microarray analysis presented in 600.00 RectosigmoidCecum Colon – “Classification and Diagnostic Prediction of 500.00 Remainder of Colon 500.00 Sigmoid Colon Cancers Using Profiling and Remainder of Colon 400.00 Artificial Neural Networks,” J. Khan, et al., 400.00 Nature Medicine, 7(6), June 2001, 673-679. 300.00 http://www.medstudents.com.br/patoclin/courses/ 300.00 » 96 transcripts chosen from 2308 probes for training sarcomas/dsrct/dsrct.htm » 63 EWS, BL, NB, and RMS training samples 200.00200.00 Average Value of Down-Regulated Genes (Cecum) Average Value of Down-Regulated Genes (Sigmoid) Desmoplastic small, » Classification with principal component analysis Average Value of Down-Regulated Genes (Rectosigmoid) 100.00100.00 round blue-cell tumors and a linear neural network

0.00 – Source of data for my analysis 0.00 100.00200.00100.00200.00400.00200.00300.00 600.00400.00300.00500.00800.00 400.00600.001000.00 700.00500.001200.00800.00 1400.00600.00900.00 AverageAverageAverage Value Value Value of of Up-Regulatedof Up-Regulated Up-Regulated Genes Genes Genes (Rectosigmoid) (Sigmoid) (Cecum) Overview of Present Ranking by EWS t Values SRBCT Analysis (Top and Bottom 12)

! 24 genes selected from 12 highest and lowest t values for EWS vs. remainder ! Gene selection by t test Sort by EWS t Value EWS BL NB RMS – 96 genes, 12 highest and lowest t values for Image ID Gene Description t Value t Value t Value t Value 770394 Fc fragment of IgG, receptor, transporter, alpha 12.04 -6.67 -6.17 -4.79 each class 1435862 antigen identified by monoclonal antibodies 12E7, F21 and O13 9.09 -6.75 -5.01 -4.03 377461 caveolin 1, caveolae protein, 22kD 8.82 -5.97 -4.91 -4.78 – Overlap with Khan set: 32 genes 814260 follicular lymphoma variant translocation 1 8.17 -4.31 -4.70 -5.48 ! Samples 491565 Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain 7.60 -5.82 -2.62 -3.68 ! Ensemble averaging of highest and 841641 cyclin D1 (PRAD1: parathyroid adenomatosis 1) 6.84 -9.93 0.56 -4.30 – EWS (13 tumor, 10 cell-line 1471841 ATPase, Na+/K+ transporting, alpha 1 polypeptide 6.65 -3.56 -2.72 -4.69 lowest t values for each class 866702 protein tyrosine phosphatase, non-receptor type 13 6.54 -4.99 -4.07 -4.84 samples) vs. remainder 713922 glutathione S-transferase M1 6.17 -5.61 -5.16 -1.97 308497 KIAA0467S proteintrong t-value distinctions 5.99 -6.69 -6.63 -1.11 ! Correlations of training genes and – BL (8 cell-line samples) vs. 770868 NGFI-A binding protein 2 (ERG1 binding protein 2) 5.93 -6.74 -3.88 -1.21 samples remainder 345232 lymphotoxinb ealphatw (TNFe esuperfamily,n cla membersse 1)s 5.61 -8.05 -2.49 -1.19 – NB (12 cell-line samples) vs. 786084 chromobox homolog 1 (Drosophila HP1 beta) -5.04 -1.05 9.65 -0.62 ! Clustering of ensemble averages remainder 796258 sarcoglycan, alpha (50kD dystrophin-associated glycoprotein) -5.04 -3.31 -3.86 6.83 431397 -5.04 2.64 2.19 0.64 ! Classification by sigmoidal neural – RMS (10 tumor, 10 cell-line 825411 N-acetylglucosamine receptor 1 (thyroid) -5.06 -1.45 5.79 0.76 859359 quinone oxidoreductase homolog -5.23 -7.27 0.78 5.40 network samples) vs. remainder 75254 cysteine and glycine-rich protein 2 (LIM domain only, smooth muscle) -5.30 -4.11 2.20 3.68 448386 -5.38 -0.42 3.76 0.14 68950 cyclin E1 -5.80 0.03 -1.58 5.10 ! Validation of neural network 774502 protein tyrosine phosphatase, non-receptor type 12 -5.80 -5.56 3.76 3.66 842820 inducible poly(A)-binding protein -6.14 0.60 0.66 3.80 – Novel set simulation 214572 ESTs -6.39 -0.08 -0.22 4.56 – Leave-one-out simulation 295985 ESTs -9.26 -0.13 3.24 2.95 ! Repeated for BL vs. remainder, NB vs. remainder, and RMS vs. remainder

Comparison of Present Clustering of SRBCT Training SRBCT Set with Khan Top 10 Ensemble Averages

EWS BL NB RMS Most Khan EWS set BL set Student t Student t Student t Student t Significant Gene Image ID Gene Description Value Value Value Value t Value Class insulin-like growth factor 2 296448 (somatomedin A) -4.789 -5.226 -1.185 5.998 RMS RMS Human DNA for insulin-like growth factor II (IGF-2); exon 207274 7 and additional ORF -4.377 -5.424 -1.639 5.708 RMS RMS cyclinK hD1a n(PRAD1: selec tion 841641 parathyroid adenomatosis 1) 6.841 -9.932 0.565 -4.300 BL (-) EWS/NB 365826 growthdis arrest-specifictinguished b1 y 3.551 -8.438 -6.995 1.583 BL (-) EWS/RMS 486787 calponinstro n3,g acidic, consistent -4.335 -6.354 2.446 2.605 BL (-) RMS/NB Fc fragment of IgG, receptor, NB set RMS set 770394 transporter,relatio alphan to t values 12.037 -6.673 -6.173 -4.792 EWS EWS 244618 ESTs -4.174 -4.822 -3.484 5.986 RMS RMS insulin-like growth factor 233721 binding protein 2 (36kD) 0.058 -7.487 -1.599 2.184 BL (-) Not BL 43733 glycogenin 2 4.715 -4.576 -3.834 -3.524 EWS EWS 295985 ESTs -9.260 -0.133 3.237 2.948 EWS (-) Not EWS

! Red: both sets ! Black: Khan set only Neural Networks: Model of a Single Neuron Complex Multiple Neuron Model

! Single layer " T % w1 – Number of inputs = n r = Wu + b $ T ' w » dim(u) = (n x 1) W = $ 2 ' $ ' – Number of nodes = m u s r $ T ' » dim(r) = dim(b) = dim(s) = ( ) #w n & = (m x 1) ! Vector input, u, to a single neuron y u – Sensory input or output from upstream T ! Two layers = 2 neurons r w u b – Number of nodes in each ! = + = s2(r2) = s2 (W2u1 + b2 ) – Linear operation produces scalar, r layer need not be the same – Node functions may be – Add bias, b, for zero adjustment = s2[W2 s1(r1) + b2 ] u = s r different, e.g., ! Scalar output, u, of a single ( ) » Sigmoid hidden layer ! = s2[W2 s1(W1u0 + b1) + b2 ] neuron (or node) » Linear output layer – Scalar linear or nonlinear operation, s(r) = s2[W2 s1(W1x + b1) + b2 ] !

! ! Artificial Neural Single-Layer, Multi-Node Networks Perceptron Discriminants

! Multiple inputs, nodes, and outputs – More inputs lead to more " x1% dimensions in discriminants x = $ ' # x2& – More (sigmoid) nodes lead to “higher bandwidth” in output – More outputs lead to more ! discriminants

" x1% u = s(Wx + b) $ ' x x = $ 2' #$ x3&'

! ! Neural Network SRBCT Neural Network Training Set Training and Application

! Each input row is a “metagene,” i.e., the ensemble ! Neural network average for a 12-gene set, normalized in (–1,+1) – 8 ensemble-average inputs – various # of sigmoidal " Sample1 Sample 2 Sample 3 ... Sample 62 Sample 63 % neurons $ ' – 4 linear neurons $ EWS EWS EWS ... RMS RMS ' $ ' – 4 outputs $ EWS( )Average EWS( )Average EWS( )Average ... EWS( )Average EWS( )Average' $ + + + + + ' – Training algorithm " Identifier % $ EWS(()Average EWS(()Average EWS(()Average ... EWS(()Average EWS(()Average' » Levenberg-Marquardt with $ ' $ ' Target Output = BL(+)Average BL(+)Average BL(+)Average ... BL(+)Average BL(+)Average Bayesian regularization $ ' $ ' #$Gene Training Set&' $ BL(()Average BL(()Average BL(()Average ... BL(()Average BL(()Average ' ! Training accuracy $ NB(+)Average NB(+)Average NB(+)Average ... NB(+)Average NB(+)Average ' $ ' – Train on all 63 samples $ NB(()Average NB(()Average NB(()Average ... NB(()Average NB(()Average ' $ RMS(+)Average RMS(+)Average RMS(+)Average ... RMS(+)Average RMS(+)Average' – Test on all 63 samples $ ' #$ RMS(()Average RMS(()Average RMS(()Average ... RMS(()Average RMS(()Average&' – 100% accuracy

!

Leave-One-Out Validation of Novel-Set Validation of SRBCT Neural Network SRBCT Neural Network

! Network always chooses one of four classes ! Remove a single sample (i.e., “unknown” is not an option) ! Train on remaining samples (125 times) ! Test on 25 novel samples (400 times each) ! Evaluate class of the removed sample – 5 EWS ! Repeat for each of 63 samples – 5 BL – 5 NB ! 6 sigmoids: 99.96% accuracy (3 errors – 5 RMS in 7,875 trials) – 5 samples of unknown class ! 12 sigmoids: 99.99% accuracy (1 error ! 99.96% accuracy on first 20 novel samples in 7,875 trials) (3 errors in 8,000 trials) ! 0% accuracy on unknown classes 260-Sample Colon Follow-up from 6/05 Talk: Effect Cancer Data Set* of Detection Call (PPG, 260-sample set)

! Primary colon cancer (130) ! Affymetrix HGU- ! Samples 133A GeneChip – 130 primary ! Primary polyp (31) colon cancer ! 22,283 probe sets – 24 normal ! Normal mucosa (24) ! 17,455 probe sets mucosa ! Liver metastasis (33) verified ! Detection call based on ! Normal liver (12) ! Selected transcripts are “Present” in relative ! 80% of samples intensity of Lung metastasis (20) PM and MM ! Normal lung (5) ! 7,343 probe sets probes analyzed ! Does ! Microadenoma (4) |t| ! 5 * New Methods for Cancer Detection, detection call ! High grade dysplasia (1) National Cancer Institute Program reflect data Project Grant (PPG) quality?

PPG DNA and RNA PPG Data Scheme Extraction Process (Oracle Database) Available RNA Aliquots All Aliquots by by Patient Tissue Type Patient Tissue Type

9-Class Colon Cancer 9-Class Gene Signatures by t Test Neural Network (260-Sample PPG Colon Cancer Set; 80% “Present”; 12 genes in each sub-class)

! Neural network Up in Class – 18 ensemble-average RNA inputs – 9 linear neurons – 12-48 genes in each Primary Colon Primary Normal Liver Lung Normal High Grade Cancer Polyp Mucosa Metastasis Normal Liver Metastasis Lung Microadenoma Dysplasia – various # of sigmoidal neurons – 9 outputs ensemble average RFC3 TM4SF3 KCTD9 TNPO3 MASP2 FTL FHL1 TPD52 DSPP DNMT1 S100P PPAP2A KBTBD2 IF EGFL6 DKK3 DKFZP434K046 ALS2CR8 MCM2 MAP17 PPAP2A KIAA1068 GSTO1 EEF1A1 PECAM1 APG5L ASB1 NME1 UMP-CMPK DSCR1L1 PLA2G4B DKFZP586A0522 RPS27A PLS3 CLTC PLCL1 TPX2 LGALS4 PRKACB ABCA2 FGL1 RPL27A EPAS1 ENTH UBE2D1 COPS8 MRPS27 ADAMDEC1EIF3S9 CYP4F3 RPS27 TM4SF10 PRKAG1 SP100 CTSK MAP7 UGP2 TAX1BP1 FLJ20605 RPL27 MSN CCL15 C6orf139 UBE2S TPD52 SLC4A4 PDPK1 CYB5 PTMA STOM MGC4170 SMG1 RRM2 SLC12A2 KIAA0089 UPF3B F7 TMSB4X QKI HSPC009 VPS41 CCT4 GMDS KLF4 ZNF325 MTSS1 EEF1A1 CAV1 F11R ICK COL1A2 PCM1 SDHD DDR1 DNASE1L3 RPL37A SOX13 FLJ11017 CYLC2 UBE2C EPLIN PTGER4 TIF1 SEPHS2 HIVEP2 CD81 SERPINB1 TLR1 Down in Class Primary Colon Primary Normal Liver Lung Normal High Grade Cancer Polyp Mucosa Metastasis Normal Liver Metastasis Lung Microadenoma Dysplasia ANK3 FN1 NFE2L3 FBLN1 TACSTD1 FLJ10525 HSPD1 COL1A2 SUI1 ANK3 FN1 MET MMP2 SPINT2 SCP2 CDH17 RNF13 RPL9 FLJ14146 FN1 CBFB CTSK S100A6 FPGT LGALS4 COL4A2 UBB DMXL1 WASF3 ENC1 WNT5A LGALS3 ETHE1 MGC4809 NFAT5 RPS3A ZNF24 FN1 FXYD5 IGJ STK39 ABCD3 MRPS35 COL1A1 ACTG1 ACADVL ASPN NEBL FOXF1 RAB25 HNRPK GPX2 FLJ10618 ACTG1 SYNJ2BP BGN FLJ10439 GREM1 CEACAM5 GALNT7 CEACAM5 EFEMP2 ACTG1 SULT1A1 RBMS1 S100A11 ENPP2 SPINT1 RAB2 TM4SF3 DDX26 EEF1A1 ABCA5 BGN NONO ANGPTL2 AP1M2 C14orf108 MUC13 SF1 ACTG1 MTSS1 RDX EIF3S9 GREM1 CLIC1 NDP52 VIL1 DDX26 RPS13 C9orf95 CTSD DDX56 C5orf14 FLJ10901 ZMPSTE24 PRSS3 ELOVL5 UBC KIAA0256 PLEKHC1 EIF3S9 SPARCL1 MLSTD1 NAT1 PAICS FN1 RPL13A Strong Distinctions Between Normal Tissues Training and Validation Accuracy of from Different Organs (PPG, 260-sample set) 9-Class Colon Cancer Neural Network

! Normal Lung vs. ! Normal Liver vs. ! Normal Lung vs. ! Training set: Ensemble averages of ! Leave-one-out validation Normal Colon Normal Colon Normal Liver genes with highest block correlation – Remove a single sample ! Training set accuracy – Train on remaining samples – Train on all 260 samples – Test the removed sample – Test on all samples – Repeat for each sample

Genes in Training Set Leave-One-Out Ensemble Neurons Trials Correct % Correct Correct % Correct 12 4 260 176 68% 151 58% 12 9 260 188 72% 156 60% 48 4 260 179 69% 173 67% 48 9 260 194 75% 170 65% 48 18 260 234 90% 172 66%

! 9-class PPG set is more heterogeneous that 4-class SRBCT set ! Generalization is better with fewer neurons (to a point)

Training with Principal Classification of Primary Tumors, Components of Gene Ensembles Polyps, and Normal Mucosa

! Principal components identify most significant patterns in gene ensembles ! Primary colon cancer (130), primary polyp (31), normal mucosa (24)

Genes in Training Set Leave-One-Out Genes in Training Set Leave-One-Out Ensemble Neurons Trials Correct % Correct Trials Correct % Correct Ensemble Neurons Trials Correct % Correct Correct % Correct Two Principal Components (18 Inputs) 6 3 370 323 87% 310 84% 12 4 260 168 65% 260 164 63% 6 4 370 328 89% 314 85% 24 4 260 165 63% 260 165 63% 48 4 260 145 56% 260 157 60% 6 6 370 328 89% 312 84% Four Principal Components (36 Inputs) 12 1 370 296 80% 298 81% 24 4 260 188 72% 260 162 62% 12 2 370 336 91% 320 86% 24 9 260 233 90% 260 119 46% 12 3 370 336 91% 319 86% 48 9 260 238 92% 260 136 52% 12 4 370 333 90% 319 86% Six Principal Components (54 Inputs) 12 6 370 334 90% 323 87% 12 4 2600 2000 77% 520 264 51% 24 3 370 311 84% 311 84% 12 9 3120 3058 98% 260 130 50% 24 6 370 330 89% 320 86% 12 18 1040 1034 99% 260 126 48% 24 6 370 334 90% 318 86% ! Orthogonality of principal components leads to overfitting and reduced generalization of neural network ! Significant correlation between tumors and polyps ! Imbalance in numbers of training samples High Overlap and Correlation Between Classification of Primary Tumors, Primary Tumor and Polyp Samples Metastases, and Normal Tissue

! Positive vs. negative metagenes ! Sample correlation ! Primary colon cancer (130), normal mucosa (24), liver metastasis (33), normal liver (12), lung metastasis (20), normal lung (5) ! Separation of organ-specific effects

Genes in Training Set Leave-One-Out Ensemble Neurons Trials Correct % Correct Correct % Correct 12 1 448 260 58% 275 61% 12 3 448 364 81% 343 77% 12 6 448 427 95% 367 82% 12 9 448 427 95% 379 85% 12 12 448 433 97% 377 84%

! Removing polyps, microadenomas, and high-grade dysplasia improves classification accuracy

Correlation of Primary Tumor, Classification of Primary Colon Metastases, and Normal Tissue Samples Cancer According to Location with Remainder

Colon Liver Lung ! Ascending (21), cecum (15), descending (12), rectosigmoid (14), sigmoid (30), transverse (11)

Cancer Genes in Training Set Leave-One-Out Ensemble Neurons Trials Correct % Correct Correct % Correct 6 6 103 61 59% 49 48% 6 9 103 63 61% 50 49% 6 12 103 43 42% 49 48% 12 6 103 42 41% 47 46% 12 9 103 39 38% 45 44% 24 6 103 36 35% 49 48% Normal ! Tumor locations are genetically distinct but correlated Sample Correlations with Remainder Classification of Primary Colon According to Tumor Location Cancer According to Gender

Ascending Cecum Descending ! Male (61), female (44) (incomplete)

Genes in Training Set Leave-One-Out Ensemble Neurons Trials Correct % Correct Correct % Correct 6 4 210 168 80% 164 78% 12 2 210 178 85% 177 84% 12 4 210 178 85% 176 84% 12 8 210 178 85% 176 84% Rectosigmoid Sigmoid Transverse 12 9 210 179 85% 176 84% 24 2 210 184 88% 174 83% 24 4 210 184 88% 174 83%

! Relatively strong distinction between male and female tumors

Overlap and Correlation of Classification of Primary Colon Primary Colon Cancer by Gender Cancer According to Age at Diagnosis

! 50 annotated samples (incomplete) ! Male (yellow dot), female (green square) Genes in Training Set Leave-One-Out ! Significant overlap in Ensemble Neurons Trials Correct % Correct Correct % Correct ensemble averages Discriminant = 50 24 2 200 184 92% 180 90% 24 4 400 368 92% 354 89% 48 2 400 366 92% 358 90% Discriminant = 60 ! Female (upper left), 24 2 400 372 93% 362 91% 48 2 400 360 90% 360 90% male (lower right) Discriminant = 70 ! Sample distinction 24 2 400 312 78% 305 76% not strong 24 2 400 336 84% 312 78% ! Probe distinction is strong ! Implication that genetic distinction occurs in the 60s Overlap and Correlation of Primary Colon Does High Percent “Present” Imply Cancer According to Age at Diagnosis Good Neural Network Classification? (PPG, 260-sample set)

#50 (green) #60 (green) #70 (green)

! Apparently not! 6:44 11:39 28:22

Conclusions Acknowledgments

! Search for many genes ! Gunter Schemmann, Princeton University ! Simple tests for gene selection – Mean (t test) ! Daniel Notterman, University of Medicine and Dentistry of – Covariance (correlation matrices) New Jersey ! Importance of over- and under- ! Francis Barany, Cornell Weill Medical School expressed genes ! Heterogeneity of colon cancer data ! Phillip Paty, Memorial Sloan Kettering Cancer Center ! Classifications according to ! Eytan Domany, Weizmann Institute of Science – Primary tumor, polyp, normal mucosa ! Jürg Ott, Rockefeller University – Primary tumor, metastasis, normal tissue – Primary tumor location ! Arnold Levine, Institute for Advanced Study – Patient gender and age ! Future work ! Hao Liu, University of Medicine and Dentistry of New – Sampling issues Jersey – Analysis of complete discovery set – Alternative targets for classification ! … and members of the Simons Center for Systems Biology, – Identification of functional pathways Institute for Advanced Study