Computational Approaches for Disease Gene Identification

Computational Approaches for Disease Gene Identification YANG PENG School of Computer Engineering A Thesis Submitted to Nanyang Technological University In Fulfillment of the Requirement For the Degree of Doctor of Philosophy May 2013 Acknowledgements First and foremost, my special gratitude goes to my supervisors, Professor Kwoh Chee-Keong, Professor Li Xiaoli and Professor Ng See-Kiong for their immense patience and invaluable advice they provide me during the essential part of my life. The work that I have done here would not have been possible without them. During working with them, I have gained both the scientific knowledge and the way to do original research work. I am benefit from their thoroughness and diligence in revising every word and sentence in my research write-ups and strict attitude toward scientific research work. I have to thank Assist Professor Zheng Jie, from whom I am impressed by his sharp eyes in finding the key underlying problems and his critical thinking skills to let me see many issues with my own eyes. I have to thank Associate Professor Lin Feng, Assist Professor Manoranjan Dash for their effective comments and support during my Ph.D qualifying examination. Besides professors, I would warmly thank my colleague and friend, Dr. Mei Jianping for her friendly help and kindly sharing with me her MATLAB codes of many machine learning algorithms used in this study. I also extend my special appreciation to Dr. Li Yongjin, from whom I was inspired to learn a lot of knowledge and approaches related with disease gene prediction. I also enjoyed the time that I spent with my other colleagues in Bioinformatics Research Center i including Dr. Liu qian, Dr. Wu min, Dr. Zhao Liang, Dr. Li Zhenhua, Dr. Piyushkumar and with my lab mates including Ouyang Xuchang, Su Tran To Chinh, Liu Wenting, Chen Haifen, Zhang fan, who worked closed with me to discover new problems in Bioinformatics. Especially, I would like to give my special thanks to my brother, girlfriend and parents. Their encouragement and understanding are my motivation to finish my Ph.D study. Finally, I express my sincere gratitude to everyone who has contributed to this thesis. ii Contents ACKNOWLEDGEMENTS ........................................................................................ I ABSTRACT ............................................................................................................ VI IMPORTANT ABBREVIATIONS USED ................................................................ X LIST OF FIGURES ................................................................................................. XII LIST OF TABLES .................................................................................................. XIV CHAPTER 1. INTRODUCTION ........................................................................... 1 1.1 BACKGROUND ................................................................................................... 1 1.1.1 Motivation and Objective of Disease Gene Identification ......................... 1 1.1.2 Challenges of Disease Gene Identification ................................................ 3 1.1.3 Wet-lab Experiments for Disease Gene Identification ............................... 5 1.2 RELATED PRIOR WORKS .................................................................................... 7 1.3 MAJOR CONTRIBUTIONS AND ORGINAZATION ................................................. 11 1.4 OUTLINE ......................................................................................................... 15 CHAPTER 2. LITERATURE REVIEW ............................................................. 16 2.1 PRIORITIZATION OF CANDIDATE GENES BASED ON BIOLGICAL DATA SOURCE 17 2.1.1 Sequence Based Methods ......................................................................... 17 2.1.2 Gene Expression Based Methods ............................................................. 18 2.1.3 Ontology Based Methods ......................................................................... 20 2.1.4 PPI Network Based Methods ................................................................... 22 2.2 INTEGRATION METHODOLOGIES ON CANDIDATE GENES PRIORITIZATION ....... 24 2.3 SUMMARY ....................................................................................................... 33 CHAPTER 3. PREDICTING DISEASE GENE VIA PROTEIN COMPLEX NETWORK PROPAGATION .................................................................................. 36 iii 3.1 INTRODUCTION ............................................................................................... 36 3.2 METHOD ......................................................................................................... 38 3.2.1 Overall Network Structure in RWPCN .................................................... 38 3.2.2 Constructing Phenotype Network ............................................................ 40 3.2.3 Constructing Protein Complex Network .................................................. 41 3.2.4 Random walk with restart on protein complexes network (RWPCN) ..... 43 3.3 EXPERIMENT RESULTS .................................................................................... 48 3.3.1 Experimental settings and evaluation metrics .......................................... 48 3.3.2 Experimental Results ................................................................................ 50 3.4 SUMMARY ........................................................................................................ 67 CHAPTER 4. POSITIVE UNLABELED LEARNING FOR DISEASE GENE IDENTIFICATION ....................................................................................... 69 4.1 INTRODUCTION ............................................................................................... 70 4.2 METHOD ......................................................................................................... 73 4.2.1 Gene characterization ............................................................................... 73 4.2.2 Feature Selection ...................................................................................... 76 4.2.3 PU learning to identify the disease genes from U .................................... 80 4.3 RESULT ........................................................................................................... 85 4.3.1 Experimental data, settings and evaluation metrics ................................. 85 4.3.2 Experimental Result ................................................................................. 90 4.4 SUMMARY ..................................................................................................... 103 CHAPTER 5. ENSEMBLE BASED POSITIVE UNLABELED LEARNING FOR DISEASE GENE IDENTIFICATION ................................... 105 5.1 INTRODUCTORY............................................................................................. 106 5.2 Material and Method .................................................................................... 109 iv 5.2.1 Experimental data and gene network modeling ..................................... 110 5.2.2 The proposed Technique EPU ................................................................ 113 5.2.3 Ensemble positive unlabeled learning EPU ........................................... 118 5.3 EXPERIMENTAL RESULTS .............................................................................. 125 5.3.1 Experimental setting ............................................................................... 125 5.3.2 Evaluation metrics .................................................................................. 126 5.3.3 Experimental result ................................................................................. 126 5.4 SUMMARY ..................................................................................................... 137 CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS .................... 138 6.1 CONCLUSIONS AND DISCUSSION .................................................................... 138 6.2 FUTURE DIRECTIONS ..................................................................................... 141 6.2.1 Integration of More Data Sources for Disease Gene classification ........ 141 6.2.2 Phenotype Entities Similarity Calculation ............................................. 141 6.2.3 Improving the network propagation based method (RWPCN) using machine learning classification approaches ........................................................ 142 6.2.4 Prioritization of loci using GWAS data .................................................. 143 6.3 FINAL REMARKS ............................................................................................ 143 REFERENCES ......................................................................................................... 145 AUTHOR’S PUBLICATIONS ............................................................................... 163 v Abstract Identifying disease genes from human genome is an important and fundamental problem in biomedical research. Despite many publications of machine learning methods applied to discover new disease genes, it still remains a challenge because of the pleiotropy of genes, the limited number of confirmed disease genes among whole genome and the genetic heterogeneity of diseases. Recent approaches have applied the concept of ‘guilty by association’ to investigate the association between a disease phenotype and its causative genes, which means that candidate genes with

Computational Approaches for Disease Gene Identification

Inherited Neuropathies

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Targeting PH Domain Proteins for Cancer Therapy

The Roma Paradigm

Expression of Odontogenic Ameloblast-Associated Protein (ODAM) in Dental and Other Epithelial Neoplasms

Whole Exome Sequencing in Families at High Risk for Hodgkin Lymphoma: Identification of a Predisposing Mutation in the KDR Gene

Appendix 2. Significantly Differentially Regulated Genes in Term Compared with Second Trimester Amniotic Fluid Supernatant

Integrative Epigenomic and Genomic Analysis of Malignant Pheochromocytoma

Open Data for Differential Network Analysis in Glioma

WO 2012/174282 A2 20 December 2012 (20.12.2012) P O P C T

"The Genecards Suite: from Gene Data Mining to Disease Genome Sequence Analyses". In: Current Protocols in Bioinformat

The Spotted Gar Genome Illuminates Vertebrate Evolution and Facilitates Human-To-Teleost Comparisons