Applying Machine Learning to Breast Cancer Gene Expression Data to Predict Survival Likelihood Pegah Tavangar Thesis Submitted T
Total Page:16
File Type:pdf, Size:1020Kb
Applying Machine Learning to Breast Cancer Gene Expression Data to Predict Survival Likelihood Pegah Tavangar Thesis submitted to the University of Ottawa in partial Fulfillment of the requirements for the Master of Science in Chemistry and Biomolecular Sciences Department of Chemistry and Biomolecular Sciences Faculty of Science University of Ottawa © Pegah Tavangar, Ottawa, Canada, 2020 Abstract Analyzing the expression level of thousands of genes will provide additional information beneficial in improving cancer therapy or synthesizing a new drug. In this project, the expression of 48807 genes from primary human breast tumors cells was analyzed. Humans cannot make sense of such a large volume of gene expression data from each person. Therefore, we used Machine Learning as an automated system that can learn from the data and be able to predict results from the data. This project presents the use of Machine Learning to predict the likelihood of survival in breast cancer patients using gene expression profiling. Machine Learning techniques, such as Logistic Regression, Support Vector Machines, Random Forest, and different Feature Selection techniques were used to find essential genes that lead to breast cancer or help a patient to live longer. This project describes the evaluation of different Machine Learning algorithms to classify breast cancer tumors into two groups of high and low survival. ii Acknowledgments I would like to thank Dr. Jonathan Lee for providing me the opportunity to work with him on an exciting project. I would like to recognize the invaluable counsel that you all provided during my research. It was my honor to work with some other professors in the Faculty of Medicine, such as Dr. Mathieu Lavallée-Adam and Dr. Theodore Perkins. Thank you for the support, advice, and friendship over the years. Their valuable guidance and advice over the past years significantly shaped my work. Finally, I would like to thank all my friends and family who have supported me throughout my graduate studies. iii Table of contents: Contents Abstract ......................................................................................................................................................... ii Acknowledgments ........................................................................................................................................ iii Table of contents: ........................................................................................................................................ iv List of Abbreviations .................................................................................................................................... vi List of Figures .............................................................................................................................................. vii List of Tables ................................................................................................................................................ ix 1: Introduction .............................................................................................................................................. 1 1.1: Cancer .................................................................................................................................................... 1 1.1.1: Breast Cancer ...................................................................................................................................... 2 1.1.2: Illumina bead chip: .............................................................................................................................. 5 1.1.3: Machine Learning ............................................................................................................................... 8 1.1.4: Supervised and Unsupervised Learning .............................................................................................. 9 1:2: Machine learning algorithms ............................................................................................................... 11 1.2.1: Logistic Regression ............................................................................................................................ 11 1.2.2: Loss Function .................................................................................................................................... 11 1.2.3: Random Forest .................................................................................................................................. 12 1.2.4: Support Vector Machines ................................................................................................................. 14 1.2.5: Linear Kernel ..................................................................................................................................... 18 1.2.6: Polynomial kernel ............................................................................................................................. 18 1.2.7: Gaussian kernel ................................................................................................................................. 18 1.2.8: Radial Basis Function (RBF) ............................................................................................................... 19 1.2.9: Sigmoid kernel .................................................................................................................................. 19 1.2.10: Clustering ........................................................................................................................................ 20 1.3: Evaluation Metrics ............................................................................................................................... 21 1.3.1: Accuracy ............................................................................................................................................ 21 1.3.2: Area Under The Curve ....................................................................................................................... 21 1.4: Cross-Validation ................................................................................................................................... 22 1.5: Feature Selection ................................................................................................................................. 25 1.5.1: Recursive Feature Elimination .......................................................................................................... 25 1.5.2: Least Absolute Shrinkage and Selection Operator (Lasso) ............................................................... 26 2: Materials and Methods ....................................................................................................................... 28 iv 2.1: Dataset preparation ......................................................................................................................... 28 2.2: Validation Test ................................................................................................................................. 30 2.3: Logistic Regression performance ..................................................................................................... 30 2.3.1: Logistic Regression Cost Function ................................................................................................. 33 2.4: Scikit-learn ....................................................................................................................................... 34 2.5: Random Forest as a Feature Selection method ............................................................................... 35 2.5.1: Gini Impurity and Entropy ............................................................................................................. 35 2.5.2: Random Forest Hyperparameter tuning using Scikit learn library ............................................... 36 2.6: Support Vector Machine Cost Function ........................................................................................... 39 2.6.1: Tuning Support Vector Machine parameters ............................................................................... 43 2.7: Feature selection ............................................................................................................................. 43 2.7.1: Recursive Feature Elimination ...................................................................................................... 43 2.7.2: Regularization parameter in Lasso................................................................................................ 44 2.8: Evaluation Metrics ........................................................................................................................... 44 3. Results: ............................................................................................................................................ 45 3.1: Logistic Regression ....................................................................................................................... 48 3.2: Random Forest ............................................................................................................................. 49 3.3: SVM .............................................................................................................................................. 50 3.4: Recursive Feature Elimination ....................................................................................................