Multiclass Analysis of Gene Expression Data the Protein Process
Total Page:16
File Type:pdf, Size:1020Kb
Critical Dynamic Processes in the Pathogenesis of Cancer Multiclass Analysis of Gene Expression Data Robert Stengel March 27, 2006 ! Background ! Small, round, blue-cell tumor example ! Analysis of colon cancer, metastases, and normal tissue presented at the Institute for Advanced Study Putative Paradigm for The Protein Process Microarray Analysis: Expression Level Infers Cell Function & Carcinogenesis ! Up-regulation in tumor cells – Causal input (tumor enhancing gene) – Defensive response (tumor suppressor gene) – Bystander effect – Tissue effect – Artifact ! Down-regulation in tumor cells – Present, but mutated (and, therefore, not detected) – Eliminated or suppressed by tumor growth – Tissue effect – Artifact Models of Gene Microarray Classification Interaction Objectives ! Inhibition and amplification are multigene effects !Class comparison ! Effects of over/underexpression depend on regulatory pathway and sign of interaction – Identify expression profiles (feature sets) for predefined classes !Class prediction – Develop mathematical function/algorithm that predicts class membership for a novel expression profile !Class discovery – Identify new classes, sub-classes, or features related to classification objectives 200 150 Tumor Typical Pairs of Colon Cancer Normal 100 Characteristics of D14657 Microarray Expression Levels 5 0 0 Classification Features -50 (Alon, Notterman, Levine et al, 1999) -20 -10 0 1 0 2 0 3 0 4 0 M94363 ! Samples not well differentiated in individual transcript clusters (overlapping) 250 200 150 Tumor Normal ! Strong feature 100 H57136 – Individual feature provides good classification 5 0 – Minimal overlap of feature values in each class 0 – Significant difference in class mean values -50 – Low variance in class is desirable ! Additional features -100 -80 -60 -40 -20 0 2 0 4 0 6 0 8 0 – Orthogonal feature (low correlation) may add M96839 new information to the set – Co-expressed feature (high correlation) is redundant; averaging may reduce error Two-Sample Student t Test Example of Gene-by-Gene Tumor/Normal Classification by t Test for Gene Selection (data from Alon, Notterman, Levine, et al, 1999) ! 1,151 transcripts are ! t compares mean values of two data sets 2 2 over/under-expressed in # # – |t| is reduced by uncertainty in the data sets ( ) T N ! tumor/normal comparison, p t = (mT " mN )/ + – |t| is increased by number of points in the data sets (n) # 0.003 36 22 ! Genetically dissimilar # 2 # 2 ! m = mean value of data set samples are apparent t = m " m / A + B ! ! = standard deviation of data set ( A B ) ! nA nB ! n = number of points in data set “Cancer-positive probe sets” ! t test applied to mean tumor/normal expression levels of individual genes ! ! |t| > 3, mA ! mB with "99.7% confidence “Cancer-negative probe sets” (error p # 0.003) [n > 25] ! Dukes Stages: A -> B -> C -> D Correlation Matrices Ensemble Mean Values (Metagenes) +1 0 –1 ! Treat each probe set (row) as a redundant, corrupted ! Probe set correlation measurement of the same tumor/normal indicator zij = ki y + "ij , i =1, m, j =1, n x = ! Compute column averages for each sample sub-group (i.e., sum each column and divide by n) 1 n ! Sample correlation ˆ z j = "zij ! n i=1 ! Statistics of random variable sums are normal by central x = limit theorem ! Class Prediction and Evaluation Clustering of Sample Averages for Primary (data from Alon, Notterman, Levine, et al 1999) Colon Cancer vs. Normal Mucosa (PPG, 144-sample set) 800.00 ! 144 samples, 200 3,437 probe 700.00 sets analyzed ! Single-feature 180 ! 47 primary prediction 600.00 colon cancer 160 – Cancer-positive Tumors ! 22 normal 140 Normals mucosa genes: 4 errors 500.00 ! Affymetrix – Cancer-negative 120 genes: 2 errors HGU-133A 400.00 100 GeneChip ! Two-feature Primary Colon Cancer Normal Mucosa ! All prediction: 1 error 80 300.00 transcripts (mislabeling?) “Present” in 60 all samples 200.00 40 Average Value for Genes Down-Regulated in Tumors 20 100.00 # Probe sets 0 Average Value of Down-Regulated Genes (Primary Colon Cancer) Up: 1067 0 20 40 60 80 100 120 140 160 180 Down: 290 Average Value for Genes Up-Regulated in Tumors 0.00 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 1600.00 1800.00 2000.00 Constant: 19 Average Value of Up-Regulated Genes (Primary Colon Cancer) Clustering of Sample Averages for Clustering of Sample Averages for Primary Polyp vs. Normal Mucosa Primary Polyp vs. Primary Colon Cancer (PPG, 144-sample set) (PPG, 144-sample set) 800.00 1000.00 900.00 700.00 ! 21 primary polyp 800.00 ! 21 primary polyp 600.00 ! 22 normal mucosa 700.00 ! 47 primary 500.00 colon cancer 600.00 400.00 500.00 Primary Polyp Normal Mucosa 300.00 # Probe sets 400.00 # Probe sets Primary Polyp Up: 1014 Primary Colon Cancer Up: 273 300.00 200.00 Down: 219 Down: 126 Constant: 40 200.00 Constant: 172 Average Value of Down-Regulated Genes (Primary Polyp) 100.00 Average Value of Down-Regulated Genes (Primary Polyp) 100.00 0.00 0.00 500.00 1000.00 1500.00 2000.00 2500.00 0.00 0.00 500.00 1000.00 1500.00 2000.00 2500.00 3000.00 3500.00 Average Value of Up-Regulated Genes (Primary Polyp) Average Value of Up-Regulated Genes (Primary Polyp) Identified Cancer Links in Genetic Distinctions Between Ascending Selected Gene Training Sets and Descending Colon Tumors (PPG, 144 samples) (PPG, 144-sample set) Number of Identified Cancer Genes in Set Over-Expressed (10) Under-Expressed (10) Both (20) 1400.00 Primary Colon Cancer vs. Normal Mucosa 7 (70%) 5 (50%) 12 (60%) 1200.00 Primary Polyp vs. 1000.00 Norma Mucosa 7 (70%) 5 (50%) 12 (60%) Primary Colon Cancer 800.00 vs. Primary Polyp 7 (70%) 6 (60%) 13 (65%) 600.00 Ascending Colon Total 21 (70%) 16 (53%) 37 (62%) Remainder of Colon 400.00 … but a significant number of genes at the bottom of the Average Value of Down-Regulated Genes (Ascending Colon) 200.00 |t| > 3 lists also have been linked to cancer 0.00 ! Bioalma linkages via GeneCards® 0.00 200.00 400.00 600.00 800.00 1000.00 1200.00 1400.00 (Weizmann Institute of Science) Average Value of Up-Regulated Genes (Ascending Colon) Genetic Distinctions Between Cecum, Small, Round Blue-Cell Tumor Sigmoid, and Rectosigmoid Colon Tumors (PPG, 144-sample set) Classification Example ! Childhood cancers, including 1000.00 900.00 – Ewing’s sarcoma (EWS) 900.00 – Burkitt’s Lymphoma (BL) 800.00 – Neuroblastoma (NB) 800.00 700.00 – Rhabdomyosarcoma (RMS) 700.00 600.00 ! cDNA microarray analysis presented in 600.00 RectosigmoidCecum Colon – “Classification and Diagnostic Prediction of 500.00 Remainder of Colon 500.00 Sigmoid Colon Cancers Using Gene Expression Profiling and Remainder of Colon 400.00 Artificial Neural Networks,” J. Khan, et al., 400.00 Nature Medicine, 7(6), June 2001, 673-679. 300.00 http://www.medstudents.com.br/patoclin/courses/ 300.00 » 96 transcripts chosen from 2308 probes for training sarcomas/dsrct/dsrct.htm » 63 EWS, BL, NB, and RMS training samples 200.00200.00 Average Value of Down-Regulated Genes (Cecum) Average Value of Down-Regulated Genes (Sigmoid) Desmoplastic small, » Classification with principal component analysis Average Value of Down-Regulated Genes (Rectosigmoid) 100.00100.00 round blue-cell tumors and a linear neural network 0.00 – Source of data for my analysis 0.00 100.00200.00100.00200.00400.00200.00300.00 600.00400.00300.00500.00800.00 400.00600.001000.00 700.00500.001200.00800.00 1400.00600.00900.00 AverageAverageAverage Value Value Value of of Up-Regulatedof Up-Regulated Up-Regulated Genes Genes Genes (Rectosigmoid) (Sigmoid) (Cecum) Overview of Present Ranking by EWS t Values SRBCT Analysis (Top and Bottom 12) ! 24 genes selected from 12 highest and lowest t values for EWS vs. remainder ! Gene selection by t test Sort by EWS t Value EWS BL NB RMS – 96 genes, 12 highest and lowest t values for Image ID Gene Description t Value t Value t Value t Value 770394 Fc fragment of IgG, receptor, transporter, alpha 12.04 -6.67 -6.17 -4.79 each class 1435862 antigen identified by monoclonal antibodies 12E7, F21 and O13 9.09 -6.75 -5.01 -4.03 377461 caveolin 1, caveolae protein, 22kD 8.82 -5.97 -4.91 -4.78 – Overlap with Khan set: 32 genes 814260 follicular lymphoma variant translocation 1 8.17 -4.31 -4.70 -5.48 ! Samples 491565 Cbp/p300-interacting transactivator, with Glu/Asp-rich carboxy-terminal domain 7.60 -5.82 -2.62 -3.68 ! Ensemble averaging of highest and 841641 cyclin D1 (PRAD1: parathyroid adenomatosis 1) 6.84 -9.93 0.56 -4.30 – EWS (13 tumor, 10 cell-line 1471841 ATPase, Na+/K+ transporting, alpha 1 polypeptide 6.65 -3.56 -2.72 -4.69 lowest t values for each class 866702 protein tyrosine phosphatase, non-receptor type 13 6.54 -4.99 -4.07 -4.84 samples) vs. remainder 713922 glutathione S-transferase M1 6.17 -5.61 -5.16 -1.97 308497 KIAA0467S proteintrong t-value distinctions 5.99 -6.69 -6.63 -1.11 ! Correlations of training genes and – BL (8 cell-line samples) vs. 770868 NGFI-A binding protein 2 (ERG1 binding protein 2) 5.93 -6.74 -3.88 -1.21 samples remainder 345232 lymphotoxinb ealphatw (TNFe esuperfamily,n cla membersse 1)s 5.61 -8.05 -2.49 -1.19 – NB (12 cell-line samples) vs. 786084 chromobox homolog 1 (Drosophila HP1 beta) -5.04 -1.05 9.65 -0.62 ! Clustering of ensemble averages remainder 796258 sarcoglycan, alpha (50kD dystrophin-associated glycoprotein) -5.04 -3.31 -3.86 6.83 431397 -5.04 2.64 2.19 0.64 ! Classification by sigmoidal neural – RMS (10 tumor, 10 cell-line 825411 N-acetylglucosamine receptor 1 (thyroid) -5.06 -1.45 5.79 0.76 859359 quinone oxidoreductase homolog -5.23 -7.27 0.78 5.40 network samples) vs.