CSE642 Final Version

Eindhoven University of Technology MASTER Dimensionality reduction of gene expression data Arts, S. Award date: 2018 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Eindhoven University of Technology MASTER THESIS Dimensionality Reduction of Gene Expression Data Author: S. (Sako) Arts Daily Supervisor: dr. V. (Vlado) Menkovski Graduation Committee: dr. V. (Vlado) Menkovski dr. D.C. (Decebal) Mocanu dr. N. (Nikolay) Yakovets May 16, 2018 v1.0 Abstract The focus of this thesis is dimensionality reduction of gene expression data. I propose and test a framework that deploys linear prediction algorithms resulting in a reduced set of selected genes relevant to a specified case. Abstract In cancer research there is a large need to automate parts of the process of diagnosis, this is mainly to reduce cost, make it faster and more accurate. Gene expression data of tumor samples is known to contain much information about the disease and with that information that can help with diagnosing and even curing the patient. Datamining methods are an obvious candidate to aid in this automation and therefore have been deployed on gene expression data in a number of research papers. However, all these researchers face the same problem, the limited amount of samples and the large number of features. Due to the many genes in the human genome and the only few tumor samples that have been processed for their gene expression data, the data set has much more features than there are samples. This makes any type of data analysis hard due to easy overfitting on data with these properties. Thusly, there is a need for a way of effectively reducing the dimensionality of the samples, without removing the information of interest, to enable more effective data analysis of the reduced set. In this thesis, I propose a framework that is capable of reducing the dimensionality of gene expression samples for case-specific purposes. I explore multiple types of dimensionality reduction from basic statistical ones to novel deep learning algorithms. This research concludes with suggesting a combination of multiple linear prediction algorithms for feature selection in a case-specific fashion. With these types of algorithms some problems exist with the selections stability and robustness, I designed a framework aimed at improving these properties of the resulting selection. The framework combines multiple of these algorithms with cross-folding to end up with a sufficiently stable set of features that can be used for further analysis. Apart from this main result, the framework produces metrics that indicate the quality of the selection and it does additional genetic analysis and plotting relevant for field experts. Besides proposing and arguing the setup and validity of this framework, the implemented framework is tested on several medically relevant use cases and the results of these tests are presented and analyzed in this thesis as well. These results show the effectiveness of the framework on certain use-cases and the limits of the gene expression data. They prove that the framework is a solution to the proposed problem and show that the framework could add value for medical professionals during their daily practices. 3 Acknowledgments Before diving into the full scientific extent of this thesis I would like to reserve some space to thank the people that helped me reach this goal in acquiring my Master of Science title. First and foremost, I would like to thank my daily supervisor Vlado Menkovski. Without his support and insights I would not have been able to finish a project of this magnitude. I would like to thank him for sparking my interest in deep learning, a field that I am now pursuing a career in. I would like to thank him for introducing me to the right people enabling me to finish two successful internships during my masters, during these internships I learned most of my skills in this field. I would like to thank him for providing the atmosphere and infrastructure within the TU/e's deep learning community that helped me thrive in this field. And lastly, I would like to thank him for putting up with me all this time, I have been bothering him for over a year now, and while I know he is a busy man that didn't stop me from dropping in at his office whenever it suited me. I would like to thank Decebal Mocanu and Nikolay Yakovets for being part of my graduation committee and plowing through this stack of paper. I'm not much of a reader myself so I am extra grateful for this effort. I would like to thank the guys and girls of the deep learning community at the TU/e. We spend many AA meetings together, I learned a lot about the field by listening to their struggles and achievements, and I'm very grateful for their feedback on mine. I would also like to thank my fellow Data Science pioneers, Jos, Stefan, and Puck, for the many projects and assignments we collaborated on over the course of my masters. I will also extend my gratitude to Nevenka and my colleagues at Philips New York, not only for their expertise in the field of genomics without which this project would not have been possible, but also for their warmth and kindness during my stay in the US. I really enjoyed myself in the office and at the activities they invited me too outside of office. I really hope our paths will cross once more in the future. A special shout-out to Steven and Andrea for letting me stay at Andrea's place the times I didn't make it to the last train out of NYC. Many thanks to all the friends I made at GEWIS, especially those of B.O.O.M. and the BAr Committee with which I drank many a beer and I expect to drink many more in the future. These people made my years as a student the time of my life and I look back on many great memories of activities partially financed by the many kind companies giving large sums of money to GEWIS. A special thanks to my fellow board members of the 33rd board of GEWIS. It could not have been easy to deal with me as a chairman, especially on some early Friday mornings. Last, but definitely not least, I would like to thank my girlfriend Jet, my parents, Jan and Annelies and my sisters Tika and Nadi, for their continuous support during my studies. They always provided me with a safe haven to return to and a place to take the necessary rest during my studies. Special thanks to my parents for providing me with many freshly ironed dress shirts. Oh, and let's not forget to thank Daan for being Daan I guess. 4 Contents 1 Introduction 6 2 Domain and Data7 2.1 Domain..........................................7 2.2 Data Extraction.....................................8 2.3 Data Source........................................9 3 Background 11 3.1 Dimensionality Reduction................................ 11 3.2 Feature Extraction.................................... 11 3.3 Feature Selection..................................... 14 4 Proposed Methodology 20 4.1 Approach to Dimensionality Reduction......................... 20 4.2 Robustness........................................ 22 4.3 Framework........................................ 23 5 Implementation 26 5.1 Data Retrieval, Processing and Enrichment...................... 26 5.2 Experimental Structure................................. 28 6 Results 31 6.1 Research Results..................................... 31 6.2 Framework Results.................................... 35 7 Conclusion 60 7.1 Method of Reduction................................... 60 7.2 Experimental Results................................... 61 7.3 Future Work....................................... 61 8 References 63 Appendices 65 A Result Interpretations 66 B Selected Genes 68 C Large Tables 98 D Sacred notebook 112 E Code listings 120 5 Chapter 1 Introduction Within the field of medicine, cancer is clearly one of the most researched diseases. This interest is caused by the number of deaths and the difficulty of preventing or predicting when and where the disease will pop up. Data mining is an obvious candidate to help with this predicting and diagnosing due to its ability to find previously unknown relations with few or any real knowledge about a process' innerworkings. With more diagnostic and demographic data of cancer patients becoming available, the amount of data mining algorithms applicable and their effectiveness is growing. Especially the genetic information amongst this data is interesting since it is known that a human's collective genome causes all sorts of phenotypes, which thusly should be derivable from this genetic data. However, there are very few phenotypes of which the direct relation to a set of genes is known, even less of which it is possible to explain the cause and effect. One of the main problems that makes genetic data difficult to analyze is its dimensionality. This is caused by two inherent properties of this type of data.

CSE642 Final Version

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support