Characterizing Basal-Like Triple Negative Breast Cancer Using Gene Expression Analysis: a Data Mining Approach
Total Page:16
File Type:pdf, Size:1020Kb
CHARACTERIZING BASAL-LIKE TRIPLE NEGATIVE BREAST CANCER USING GENE EXPRESSION ANALYSIS: A DATA MINING APPROACH A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Biomedical Engineering By QAMAR ALSABI B.S.B.E., Wright State University, 2017 2019 Wright State University WRIGHT STATE UNIVERSITY GRADUATE SCHOOL November 22, 2019 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPERVISION BY QAMAR ALSABI ENTITLED CHARACTERIZING BASAL- LIKE TRIPLE NEGATIVE BREAST CANCER USING GENE EXPRESSION ANALYSIS: A DATA MINING APPROACH BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science in Biomedical Engineering. ______________________________ .Jaime E Ramirez-Vick, Ph.D. Thesis Director ______________________________ John C. Gallagher, Ph.D. Chair, Biomedical, Industrial, and Human Factor Engineering. Committee on Final Examination: _______________________________ Nasim Nosoudi, Ph.D. _______________________________ Amir Zadeh, Ph.D. Barry Milligan, Ph.D. Interim Dean of the Graduate School ABSTRACT Alsabi Qamar. M.S.B.M.E., Department of Biomedical, Industrial, and Human Factor Engineering, Wright State University, 2019. Characterizing Basal-Like Triple Negative Breast Cancer using Gene Expression Analysis: A Data Mining Approach. Triple-negative breast cancer (TNBC) is characterized by the absence of expression of the estrogen receptor, progesterone receptor and human epidermal growth factor receptor 2 (HER2). Therefore, TNBC is unresponsive to targeted hormonal therapies, which limits treatment options to nonselective chemotherapeutic agents. Basal-like breast cancers (BLBCs) represent a subset of about 70% of TNBCs, more frequently affecting younger patients, being more prevalent in African-American women and significantly more aggressive than tumors of other molecular subtypes, with high rates of proliferation and extremely poor clinical outcomes. Proper classification of BLBCs using current pathological tools has been a major challenge. Although TNBCs have many BLBC characteristics, the relationship between clinically defined TNBC and the gene expression profile of BLBC is not fully examined. The purpose of this study is to assemble publicly- available TNBC gene expression datasets generated by Affymetrix gene chips and define a set of genes, or gene signature, that can classify TNBC samples between BLBC and Non- BLBC subtypes. We compiled over 3,500 breast cancer gene expression profiles from several individual publicly available datasets and extracted Affymetrix gene expression data for 580 TNBC cases. Several popular data mining methods along with dimensionality reduction and feature selection techniques were applied to the resultant dataset to build iii predictive models to understand molecular characteristics and mechanisms associated with BLBCs and to classify them more accurately according to important features extracted through microarray data analysis of BLBC and Non-BLBC cases. Our result can lead to proper identification and diagnosis of BLBCs, which can potentially direct clinical implications by dictating the most effective therapy. iv Table of Contents INTRODUCTION ......................................................................................................................... 1 BACKGROUND ............................................................................................................................ 3 BREAST CANCER (BC) ................................................................................................................... 3 TRIPLE NEGATIVE BREAST CANCER (TNBC) ................................................................................. 3 BASAL-LIKE BREAST CANCER (BLBC) ........................................................................................... 4 DIAGNOSTIC DIFFICULTIES ........................................................................................................... 5 BIOMARKERS IN TNBC AND BLBC ................................................................................................ 6 GENE EXPRESSION PROFILING .................................................................................................... 12 DATA MINING IN GENE EXPRESSION ........................................................................................... 14 SAS ENTERPRISE MINER ............................................................................................................. 15 CLASSIFICATION .......................................................................................................................... 15 Artificial Neural Networks ................................................................................................... 16 Logistic Regression .............................................................................................................. 16 Decision Tree ........................................................................................................................ 17 Random Forest ..................................................................................................................... 17 Support Vector Machine ...................................................................................................... 18 Least Angle Regression ........................................................................................................ 19 Bayesian Classifier ............................................................................................................... 19 Gradient Boosting ................................................................................................................ 20 DATA PRE-PROCESSING .............................................................................................................. 20 MEASURES FOR PERFORMANCE EVALUATION ............................................................................. 21 v RELATED WORK .......................................................................................................................... 21 METHODOLOGY ...................................................................................................................... 23 DATA ACQUISITION AND INTEGRATION ........................................................................................ 25 PREDICTIVE MODELS .................................................................................................................. 26 RESULT ....................................................................................................................................... 28 PERFORMANCE EVALUATION OF THE MODELS ............................................................................ 28 Neural Network (ANNs) ....................................................................................................... 31 Logistic regression ................................................................................................................ 37 Decision Tree ........................................................................................................................ 38 Random Forest ..................................................................................................................... 40 Least Angle Regression ........................................................................................................ 41 Bayesian classifier ................................................................................................................ 42 Gradient Boosting ................................................................................................................ 42 SUMMARY .................................................................................................................................... 43 DISCUSSION ............................................................................................................................... 45 CONCLUSIONS .......................................................................................................................... 57 REFERENCE ............................................................................................................................... 59 vi TABLE OF FIGURES Figure page Figure 1. Distribution of the expression of basal-like metagene among TNBC. ...............23 Figure 2. Research Methodology. ......................................................................................24 Figure 3. Neural Network Model (ANNs). ........................................................................32 Figure 4. Logistic regression model. ..................................................................................38 Figure 5. Decision Tree model. ..........................................................................................39 Figure 6. Least angle regression model. ............................................................................41 vii LIST OF TABLES Table page Table 1. Models' performance Evaluation. ........................................................................28 Table 2. Top 40 Genes Associated to BLBC. ....................................................................30 Table 3. Gene importance by Decision Tree model. .........................................................39 Table 4. Random Forest with variable selection enabled. .................................................40 Table 5. Gene Ranking (importance) using Bayesian Classifier. ......................................42 Table 6. Correlated