Disease-Gene Association Using Genetic Programming

Disease-Gene Association Using Genetic Programming Ashkan Entezari Heravi Department of Computer Science Submitted in partial fulfillment of the requirements for the degree of Master of Science Faculty of Mathematics and Science, Brock University St. Catharines, Ontario c A. E. Heravi, 2015 Abstract As a result of mutation in genes, which is a simple change in our DNA, we will have undesirable phenotypes which are known as genetic diseases or disorders. These small changes, which happen frequently, can have extreme results. Understanding and iden- tifying these changes and associating these mutated genes with genetic diseases can play an important role in our health, by making us able to find better diagnosis and therapeutic strategies for these genetic diseases. As a result of years of experiments, there is a vast amount of data regarding human genome and di↵erent genetic diseases that they still need to be processed properly to extract useful information. This work is an e↵ort to analyze some useful datasets and to apply di↵erent techniques to asso- ciate genes with genetic diseases. Two genetic diseases were studied here: Parkinson’s disease and breast cancer. Using genetic programming, we analyzed the complex network around known disease genes of the aforementioned diseases, and based on that we generated a ranking for genes, based on their relevance to these diseases. In order to generate these rankings, centrality measures of all nodes in the complex network surrounding the known disease genes of the given genetic disease were calculated. Using genetic programming, all the nodes were assigned scores based on the simi- larity of their centrality measures to those of the known disease genes. Obtained results showed that this method is successful at finding these patterns in centrality measures and the highly ranked genes are worthy as good candidate disease genes for being studied. Using standard benchmark tests, we tested our approach against ENDEAVOUR and CIPHER - two well known disease gene ranking frameworks - and we obtained comparable results. Acknowledgement Iwouldliketoexpressmygratitudetothefollowingpeopleforalltheirsupportand encouragement during the course of my research: Sheridan Houghten for her superb supervision. Her insightful and inspiring • guidance shed the light on my research path and brought this work to its fullest. Thanks for your kind support and thanks for being a perfect supervisor. Brian Ross for his precious guidance and thought provoking questions. Your • constant assistance will not be forgotten. I am grateful to my dear friend Koosha Tahmasebipour for sharing his brilliant • ideas with me and enlightening me the first glance of research. Many thanks to Cale Fairchild for all his technical support. Your dedication is • much appreciated. Last but not the least, I would like to thank my loving parents Zohreh and Ali • and my beautiful sister Roshanak.Yourconstantcareandattentionmadeit possible for me to stay focused on my research. You are the biggest blessing in the world. A.E Contents 1 Introduction 1 1.1 The Problem of Disease-Gene Association and the Role of Bioinformatics 2 1.2 Thesis Structure . 2 2 Background 3 2.1 Bioinformatics Terms and Concepts . 3 2.1.1 DNA................................ 3 2.1.2 Chromosome . 5 2.1.3 RNA................................ 6 2.1.4 Proteins . 7 2.2 ResearchFieldsinBioinformatics . 9 2.2.1 Sequencing . 10 2.2.2 SequenceAlignment ....................... 11 2.2.3 Gene Prediction . 11 3 Literature Review of Computational Methods 13 3.1 Computational Methods . 13 3.1.1 Text Mining of Biomedical Literature . 14 3.1.2 FunctionalAnnotations. 14 3.1.3 Gene Properties and Sequencing Data . 14 3.1.4 Gene Expression . 14 3.1.5 Protein-Protein Interaction . 15 3.2 Using a Fusion of Evidence Types in Di↵erent Computational Methods 15 4 Complexity 17 4.1 ComplexNetworks ............................ 17 4.2 The Modularity of Genetic Diseases . 18 4.2.1 GuiltbyAssociation ....................... 19 iii 4.3 CentralityMeasures............................ 19 4.3.1 Degree . 20 4.3.2 Diameter . 21 4.3.3 AverageDistance ......................... 21 4.3.4 Eigenvector . 22 4.3.5 Eccentricity . 22 4.3.6 Closeness . 23 4.3.7 Radiality . 23 4.3.8 Centroid Value . 24 4.3.9 Stress . 24 4.3.10 Betweenness . 25 5 Methodology 26 5.1 Genetic Programming . 26 5.1.1 Function and Terminal Sets . 27 5.1.2 Selection Method . 27 5.1.3 Reproduction Operators . 28 5.2 Databases, Toolkits and Platforms . 28 5.2.1 Genotator . 29 5.2.2 Cytoscape . 29 5.2.3 GeneMANIA ........................... 29 5.2.4 CentiScaPe . 31 5.2.5 ECJ . 32 5.3 Experiment Description . 32 5.3.1 GP Language . 33 5.3.2 FitnessEvaluation ........................ 34 5.4 Benchmark Tests . 35 5.4.1 Leave-One-Out Cross Validation . 35 5.4.2 FoldEnrichmentAnalysis . 35 5.4.3 Receiver-Operating Characteristic (ROC) Analysis . 35 6 Case Study: Breast Cancer 37 6.1 Input Data . 37 6.2 GP Design . 38 6.3 UsingGeneMANIAOrderingoftheGenes . 40 6.4 Di↵erent Centrality Measures . 41 6.5 Improving the Results . 44 6.5.1 Leave-One-Out Cross Validation . 45 6.5.2 Comparison with Other Works . 46 6.5.3 Predicting novel disease genes for breast cancer . 46 7 Case Study: Parkinson’s Disease 48 7.1 Input Data . 48 7.2 GP Design . 49 7.3 Di↵erent Sets of Centrality Measures . 51 7.4 Improving the Results . 53 7.4.1 Leave-One-Out Cross Validation Test . 54 7.4.2 Comparison with Other Works . 55 7.4.3 Predicting novel Parkinson’s disease genes . 55 8 Conclusion and Future Work 58 Bibliography 60 Appendices 70 A Leave-One-Out Cross-Validation Results 70 A.1 BreastCancer............................... 70 A.1.1 AR................................. 70 A.1.2 ATM................................ 72 A.1.3 BARD1 .............................. 73 A.1.4 BRCA1 .............................. 74 A.1.5 BRCA2 .............................. 75 A.1.6 CASP8............................... 77 A.1.7 CHEK2 .............................. 78 A.1.8 NCOA3 .............................. 79 A.1.9 PIK3CA.............................. 81 A.1.10PPM1D .............................. 82 A.1.11PTEN ............................... 83 A.1.12RAD51............................... 84 A.1.13RB1CC1.............................. 86 A.1.14STK11............................... 87 A.1.15TP53................................ 88 A.2 Parkinson’sDisease............................ 90 A.2.1 APOE............................... 90 A.2.2 BDNF............................... 91 A.2.3 BST1 ............................... 92 A.2.4 COMT............................... 93 A.2.5 CYP2D6.............................. 95 A.2.6 DRD2 ............................... 96 A.2.7 GAK................................ 97 A.2.8 GBA................................ 99 A.2.9 LRRK2 .............................. 100 A.2.10MAOB............................... 101 A.2.11MAPT............................... 102 A.2.12PARK2 .............................. 104 A.2.13PINK1............................... 105 A.2.14PON1 ............................... 106 A.2.15SNCA ............................... 108 List of Tables 2.1 ProteinFunctions............................. 9 3.1 Using a fusion of evidence types in some well-known disease-gene as- sociationmethods............................. 16 5.1 GP Functions . 34 6.1 Known Disease Genes for Breast Cancer . 38 6.2 GPParametersofFirstExperiment. 39 6.3 Island Model Migration Pattern . 39 6.4 t-test assuming unequal variances over 20 runs of the first two experiments of breast cancer. P-value was set to 0.05 . 41 6.5 Centrality Measures Used in Third Experiment of Breast Cancer . 42 6.6 GP Parameters of Third Experiment, Breast Cancer . 42 6.7 t-test assuming unequal variances to compare di↵erent sets of experiment 3 of breast cancer. P-value was 0.05 for all the tests. 43 6.8 GP Parameters of Last/Best Experiment, Breast Cancer . 44 6.9 Island Model Migration Pattern . 44 6.10 Two sample t-test assuming unequal variances, comparing third and fourth experiments of breast cancer. 45 6.11 Successfully Predicted Genes via GP from the 15 Known Disease Genes for Breast Cancer listed in Table 6.1 . 46 6.12 AnalysisandComparisontoOtherFrameworks . 47 7.1 Known Disease Genes . 49 7.2 GP Parameters of First Experiment, Parkinson’s Disease . 50 7.3 Island Model Migration Pattern . 50 7.4 Unpaired two sample t-test assuming unequal variances, comparing first and second experiments of Parkinson’s disease. P-value is 0.05. 52 7.5 Centrality Measures Used in Third Experiment of Parkinson’s Disease 52 vii 7.6 GP Parameters of Third Experiment, Parkinson’s Disease . 53 7.7 Island Model Migration Pattern . 53 7.8 t-test assuming unequal variances to compare di↵erent sets of experiment 3 of Parkinson’s disease. P-value is 0.05 for all of the tests. 54 7.9 GP Parameters of Last Experiment, Parkinson’s Disease . 54 7.10 An unpaired two sample t-test, assuming unequal variances comparing fitnessvaluesofthirdandfourthexperiments. 56 7.11 Successfully Predicted Genes via GP from the 15 Known Disease Genes for Parkinson’s Disease listed in Table 7.1 . 56 7.12 AnalysisandComparisontoOtherFrameworks . 57 List of Figures 2.1 Structure of Purines and Pyrimidines. Image from [63]. 4 2.2 DNAstructure.Imagefrom[12]. 5 2.3 Structure of Chromosome. Image from [95]. 6 2.4 RNA stem-loop. It happens when there are complementary regions on the same strand. Image from [15]. 7 2.5 Example of internal loop in RNA strand. Image from [14]. 7 2.6 StructureofRNAmolecule. Imagefrom[16].. 8 2.7 Four levels of protein structure. Image from [13]. 10 4.1 HumanInteractome. Imagefrom[18].. 18 4.2 Graph G with 5 vertices and 5 edges. 20 5.1 A fitness function is used to evaluate the individuals. There are two reproduction operators (Crossover and Mutation) that will breed new population based on the fitness of current individuals. Image from [74]. 28 5.2 Asampleexpressiontree. ........................ 33 6.1 Average of the mean fitness and best fitness of 20 runs over 50 gener- ations. Breast cancer, experiment 1. 40 6.2 Average of the mean fitness and best fitness of 20 runs over 50 gener- ations. Breast cancer, experiment 2.

Load more