Computational Approaches for Disease Gene Identification

Computational Approaches for Disease Gene Identification

Computational Approaches for Disease Gene Identification YANG PENG School of Computer Engineering A Thesis Submitted to Nanyang Technological University In Fulfillment of the Requirement For the Degree of Doctor of Philosophy May 2013 Acknowledgements First and foremost, my special gratitude goes to my supervisors, Professor Kwoh Chee-Keong, Professor Li Xiaoli and Professor Ng See-Kiong for their immense patience and invaluable advice they provide me during the essential part of my life. The work that I have done here would not have been possible without them. During working with them, I have gained both the scientific knowledge and the way to do original research work. I am benefit from their thoroughness and diligence in revising every word and sentence in my research write-ups and strict attitude toward scientific research work. I have to thank Assist Professor Zheng Jie, from whom I am impressed by his sharp eyes in finding the key underlying problems and his critical thinking skills to let me see many issues with my own eyes. I have to thank Associate Professor Lin Feng, Assist Professor Manoranjan Dash for their effective comments and support during my Ph.D qualifying examination. Besides professors, I would warmly thank my colleague and friend, Dr. Mei Jianping for her friendly help and kindly sharing with me her MATLAB codes of many machine learning algorithms used in this study. I also extend my special appreciation to Dr. Li Yongjin, from whom I was inspired to learn a lot of knowledge and approaches related with disease gene prediction. I also enjoyed the time that I spent with my other colleagues in Bioinformatics Research Center i including Dr. Liu qian, Dr. Wu min, Dr. Zhao Liang, Dr. Li Zhenhua, Dr. Piyushkumar and with my lab mates including Ouyang Xuchang, Su Tran To Chinh, Liu Wenting, Chen Haifen, Zhang fan, who worked closed with me to discover new problems in Bioinformatics. Especially, I would like to give my special thanks to my brother, girlfriend and parents. Their encouragement and understanding are my motivation to finish my Ph.D study. Finally, I express my sincere gratitude to everyone who has contributed to this thesis. ii Contents ACKNOWLEDGEMENTS ........................................................................................ I ABSTRACT ............................................................................................................ VI IMPORTANT ABBREVIATIONS USED ................................................................ X LIST OF FIGURES ................................................................................................. XII LIST OF TABLES .................................................................................................. XIV CHAPTER 1. INTRODUCTION ........................................................................... 1 1.1 BACKGROUND ................................................................................................... 1 1.1.1 Motivation and Objective of Disease Gene Identification ......................... 1 1.1.2 Challenges of Disease Gene Identification ................................................ 3 1.1.3 Wet-lab Experiments for Disease Gene Identification ............................... 5 1.2 RELATED PRIOR WORKS .................................................................................... 7 1.3 MAJOR CONTRIBUTIONS AND ORGINAZATION ................................................. 11 1.4 OUTLINE ......................................................................................................... 15 CHAPTER 2. LITERATURE REVIEW ............................................................. 16 2.1 PRIORITIZATION OF CANDIDATE GENES BASED ON BIOLGICAL DATA SOURCE 17 2.1.1 Sequence Based Methods ......................................................................... 17 2.1.2 Gene Expression Based Methods ............................................................. 18 2.1.3 Ontology Based Methods ......................................................................... 20 2.1.4 PPI Network Based Methods ................................................................... 22 2.2 INTEGRATION METHODOLOGIES ON CANDIDATE GENES PRIORITIZATION ....... 24 2.3 SUMMARY ....................................................................................................... 33 CHAPTER 3. PREDICTING DISEASE GENE VIA PROTEIN COMPLEX NETWORK PROPAGATION .................................................................................. 36 iii 3.1 INTRODUCTION ............................................................................................... 36 3.2 METHOD ......................................................................................................... 38 3.2.1 Overall Network Structure in RWPCN .................................................... 38 3.2.2 Constructing Phenotype Network ............................................................ 40 3.2.3 Constructing Protein Complex Network .................................................. 41 3.2.4 Random walk with restart on protein complexes network (RWPCN) ..... 43 3.3 EXPERIMENT RESULTS .................................................................................... 48 3.3.1 Experimental settings and evaluation metrics .......................................... 48 3.3.2 Experimental Results ................................................................................ 50 3.4 SUMMARY ........................................................................................................ 67 CHAPTER 4. POSITIVE UNLABELED LEARNING FOR DISEASE GENE IDENTIFICATION ....................................................................................... 69 4.1 INTRODUCTION ............................................................................................... 70 4.2 METHOD ......................................................................................................... 73 4.2.1 Gene characterization ............................................................................... 73 4.2.2 Feature Selection ...................................................................................... 76 4.2.3 PU learning to identify the disease genes from U .................................... 80 4.3 RESULT ........................................................................................................... 85 4.3.1 Experimental data, settings and evaluation metrics ................................. 85 4.3.2 Experimental Result ................................................................................. 90 4.4 SUMMARY ..................................................................................................... 103 CHAPTER 5. ENSEMBLE BASED POSITIVE UNLABELED LEARNING FOR DISEASE GENE IDENTIFICATION ................................... 105 5.1 INTRODUCTORY............................................................................................. 106 5.2 Material and Method .................................................................................... 109 iv 5.2.1 Experimental data and gene network modeling ..................................... 110 5.2.2 The proposed Technique EPU ................................................................ 113 5.2.3 Ensemble positive unlabeled learning EPU ........................................... 118 5.3 EXPERIMENTAL RESULTS .............................................................................. 125 5.3.1 Experimental setting ............................................................................... 125 5.3.2 Evaluation metrics .................................................................................. 126 5.3.3 Experimental result ................................................................................. 126 5.4 SUMMARY ..................................................................................................... 137 CHAPTER 6. CONCLUSIONS AND FUTURE DIRECTIONS .................... 138 6.1 CONCLUSIONS AND DISCUSSION .................................................................... 138 6.2 FUTURE DIRECTIONS ..................................................................................... 141 6.2.1 Integration of More Data Sources for Disease Gene classification ........ 141 6.2.2 Phenotype Entities Similarity Calculation ............................................. 141 6.2.3 Improving the network propagation based method (RWPCN) using machine learning classification approaches ........................................................ 142 6.2.4 Prioritization of loci using GWAS data .................................................. 143 6.3 FINAL REMARKS ............................................................................................ 143 REFERENCES ......................................................................................................... 145 AUTHOR’S PUBLICATIONS ............................................................................... 163 v Abstract Identifying disease genes from human genome is an important and fundamental problem in biomedical research. Despite many publications of machine learning methods applied to discover new disease genes, it still remains a challenge because of the pleiotropy of genes, the limited number of confirmed disease genes among whole genome and the genetic heterogeneity of diseases. Recent approaches have applied the concept of ‘guilty by association’ to investigate the association between a disease phenotype and its causative genes, which means that candidate genes with

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    181 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us