Novel Frameworks for Mining Heterogeneous and Dynamic
Total Page:16
File Type:pdf, Size:1020Kb
Novel Frameworks for Mining Heterogeneous and Dynamic Networks A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Electrical and Computer Engineering and Computer Science of the College of Engineering by Chunsheng Fang B.E., Electrical Engineering & Information Science, June 2006 University of Science & Technology of China, Hefei, P.R.China Advisor and Committee Chair: Prof. Anca L. Ralescu November 3, 2011 Abstract Graphs serve as an important tool for discrete data representation. Recently, graph representations have made possible very powerful machine learning algorithms, such as manifold learning, kernel methods, semi-supervised learning. With the advent of large-scale real world networks, such as biological networks (disease network, drug target network, etc.), social networks (DBLP Co- authorship network, Facebook friendship, etc.), machine learning and data mining algorithms have found new application areas and have contributed to advance our understanding of proper- ties, and phenomena governing real world networks. When dealing with real world data represented as networks, two problems arise quite naturally: I) How to integrate and align the knowledge encoded in multiple and heterogeneous networks? For instance, how to find out the similar genes in co-disease and protein-protein inter- action networks? II) How to model and predict the evolution of a dynamic network? A real world exam- ple is, given N years snapshots of an evolving social network, how to build a model that can cap- ture the temporal evolution and make reliable prediction? In this dissertation, we present an innovative graph embedding framework, which identifies the key components of modeling the evolution in time of a dynamic graph. Different from the many state-of-the-art graph link prediction and modeling algorithms, it formulates the link prediction problem from a geometric perspective that can capture the dynamics of the intrinsic continuous graph manifold evolution. It is attractive due to its simplicity and the potential to relax the mining problem into a feasible domain which enables standard machine learning and regression models to utilize historical graph time series data. I To address the first problem, we first propose a novel probability-based similarity measure which led to promising applications in content based image retrieval and image annotation, fol- lowed by a manifold alignment framework to align multiple heterogeneous networks, which demonstrate its power in mining biological networks. Finally, the dynamic graph mining framework generalizes most of the current graph embedding dynamic link prediction algorithms. Comprehensive experimental results on both synthesized and real-world datasets demonstrate that our proposed algorithmic framework for multiple heteroge- neous networks and dynamic networks, can lead to better and more insightful understanding of real world networks. Scalability of our algorithms is also considered by employing MapReduce cloud computing architecture. II III Acknowledgements The doctoral study is a long and winding journey. I could not earn my lifelong prefix “Dr.” in front of my name without all kinds of help. My first appreciation is for my academic adviser Professor Anca Ralescu, who possesses all the qualities of a great adviser one can imagine. She supported my research explorations over a wide spectrum of problems during the four plus years I spent at UC; she had the patience to advice me from high level guidance to every single detail of my research; her warm encouragement to the students helps them broaden their horizons and always stay with state-of-the-art of machine learn- ing research; she willingly shares her experience from research to everyday life, such as careful planning and sustaining efforts to reach the goal. These will be my lifelong treasures. My next appreciation goes to my dissertation committee. I have taken almost all advanced level CS courses taught by Professors Kenneth Berman, Yizong Cheng, and Fred Annexstein. These courses aided my computer science knowledge foundation. Without their distinguished lectures, it would have been much more difficult for an Electrical Engineering graduate to step into the shrine of Computer Science. Professor Anil Jegga helped me with biological network research; Prof. Dan Ralescu always had a way to illustrate abstract mathematical concepts as vividly as his cello melody. During my research assistant period at the BioMedical Informatics Division, Cincinnati Chil- dren’s Hospital Medical Center, Prof. Jason Lu and Prof. Aaron Zorn, supported my research and guided me into the horizon of computational bioinformatics. Their talents in exploring cutting- edge interdisciplinary research territories of machine learning, bioinformatics and medical imag- ing mining, resources management and academic writing skills, have significantly helped me shape my technical problem solving abilities. IV Thanks to the world-leading co-op/internship programs offered by University of Cincinnati, I was fortunate enough to have two precious internship experiences: In April 2011, I did my Software Design Engineer internship at one of the world’s best and most competitive IT companies – Amazon.com. I spent 3 months of wonderful learning life in Amazon Seattle headquarter, and eventually delivered my project to production. Special thanks owes to my mentor Catalin Constatin and the whole Payment Platforms team, also to Amazon TRMS re- search team who gave me lots of valuable comments after my invited seminar talk. My next internship role was a research scientist in another outstanding company, Riverain Medi- cal Group. Working together with the great R&D team and supervised by Jason F. Knapp, we were able to push the medical imaging products to the next milestone in industry. I appreciate the CS department interim director Prof. Prabir Bhattachaya, and acting head Prof. Raj Bhatnagar, CSGSA, GSGA, and all related administrative personnel such as Julie Muenchen etc., for supporting me to present our research in world-class research conferences such as ACM SIG KDD@San Diego, NIPS@Vancouver, ICPR@Tempa, etc. There would be a long list of my UC friends to appreciate that might comprise of another volume of a dissertation. To name a few but not limited to: Ravikumar Sugandharaju who organized the afternoon coffee discussions for hacking the world’s most challenging technical coding problem; Friends in the Machine Learning and Computational Intelligence (MLCI) research group led by my adviser, including Mojtaba Kohram, Mohammad Rawashdeh: lots of ideas were generated during the brain storming in the MLCI weekly meeting; Minlu Zhang, Jingyuan Deng, Xiao Zhang, Xiangxiang Meng, Chen Lu, Yingying Wang, etc. who had great discussions about bioinformatics or statistics. DQE study group in 2008, Vaibhav Pandit, and Aravind Ranganathan who made our summer-long preparation for comprehensive V exams so much fun; Friends I met from research conferences have also gave me lots of inspira- tions of how to solve research problems. Also, I’d like to express my appreciation to those professors and friends in China who have sup- ported my application to USA graduate school study: Prof. Stan Z. Li (CAS), Prof. Zhongfu Ye (USTC), Prof. Peijiang Yuan (CAS), Prof. Jirong Chen (USTC), Prof. Shoumei Li (BJUT), and my friends (all are Ph.D’s now) in NLPR, CASIA: Meng Ao, Zhen Lei, Shengcai Liao, Ran He, Xiaotong Yuan, Dong Yi, Weishi Zheng, Rui Wang, etc. At the end, I wish to thank my family: my beloved wife Junjun Yu for her warm love, support and culinary delicacies that fit right into my appetite for years; My father, mother, grandma and all other family members on the other side of this planet have been my driving forces. My life journey started under their love, guidance and inspiration, which prepared me for everything that came my way. In retrospect, I would like to summarize my Ph.D journey as this quote: Far and away the best prize that life offers is the chance to work hard at work worth doing. -Theodore Roosevelt, Labor Day address, 1903 VI To my family, and My late grandfather VII Contents Chapter 1 Problem Statement and Introduction ...................................................................... 1 1 . 1 Mining Heterogeneous Domain Knowledge Problem ................................................ 1 1 . 2 Mining Dynamic Network Problem ............................................................................. 3 1 . 3 Spectral graph theory and manifold ............................................................................ 4 1 . 4 Graph embedding .......................................................................................................... 7 1 . 5 Similarity measure for Heterogeneous data ................................................................ 7 1 . 6 Predictive models ........................................................................................................... 8 1 . 7 Roadmap of the Dissertation ...................................................................................... 10 Chapter 2 Related work ............................................................................................................ 11 2 . 1 Combining similarity in image retrieval ................................................................... 11 2 . 2 Graph-based knowledge transfer ..............................................................................