Towards Large-Scale Network Analytics

Towards Large-scale Network Analytics Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Xintian Yang, M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2012 Dissertation Committee: Srinivasan Parthasarathy, Advisor P. Sadayappan Gagan Agrawal !c Copyright by Xintian Yang 2012 Abstract In this thesis, we present a framework for efficient analysis of large-scale network datasets. There are four important components in our framework: a) a high performance computing platform with Graphics Processing Units (GPUs)andefficient implementations of mining algorithms on top of the GPU platform. b) an efficient summarization method to compress the storage space of large-scale streaming and heterogeneous network data with textual content and network topology. c) a complex query engine that depends on the summarized input data and can help to discover new knowledge from network content and topology. d) a visual front-end to present mining results to users. First, the key challenge we address in this work is that of scalability – handling large datasets in terms of the efficiency of the back-end mining algorithms. Several of the mining algorithms that we have investigated share a common Sparse Matrix- Vector Multiplication (SpMV) kernel. We present a novel approach to compute this kernel on the GPUs, particularly targeting sparse matrices representing graphs with power-law attributes. Using real web graph data, we show how our representation scheme, coupled with a novel tiling algorithm, can yield significant benefits over the current state of the art GPU efforts on a number of core data mining algorithms such as PageRank, HITS and Random Walk with Restart. We also extend this efficient single GPU kernel to a cluster environment with multiple GPUs. The multi-GPU ii kernel enables our framework to handle out-of-core datasets such as the Web graph. Additionally, the high performance of GPU kernel relies on programmer expertise and careful tuning of the parameters. We proposed an online parameter auto-tuning method with offline benchmarking component to accurately predict the parameters depending on the input data characteristics. Second, we proposed an efficient summarization method to build an in-memory summary of high speed streaming network input data, which can contain both user generated content and topological information about user connections. The summary can be used as input to our framework to perform analytical tasks. Experimental results show that our method can efficiently and incrementally summarize the stream data. The memory footprint of the summarization algorithm grows logarithmically with time instead of linearly. The raw data can be approximately reconstructed by querying the summary so as to support the analytical applications over the original data. Third, we proposed new complex queries on the summarized network content and topology. The queries about network content not only can detect popular topics discussed by users within the network during a period of time, but alsocancapturethe evolution of such topics over time. The queries about the network topology project the entire network topology onto a subgraph conditioned on a network content keyword and a time interval. Graph mining queries are performed on such subgraphs to find relevant users of a topic keyword, communities of users for the keyword, or dynamic community events among a set of users. Finally, we developed a visual-analytic toolkit for the interrogation of graph data such as those found in social, bibliometric, WWW and biological applications. The iii tool we have developed incorporates common visualization paradigmssuchaszoom- ing, coarsening and filtering while naturally integrating information extracted by data mining algorithms. The visual front-end provides features that are specifically useful in the analysis of graphs, capturing the static and dynamic nature of both individ- ual entities as well as interactions among them. The tool provides the user with the option of selecting multiple views, designed to capture different aspects of the underlying graph data from the perspective of a node, a community or a subset of nodes of interest. Standard visual templates and cues are used to highlight critical changes that have occurred in dynamic graphs. Two case studies based on bibliometric and Wikipedia data are presented to demonstrate the utility of the toolkit for visual knowledge discovery. In each of the above components, we propose new methods to either speed up mining tasks or reduce the data storage size in those tasks. We compare our methods with existing approaches on real datasets drawn from various domains. iv To my family. v Acknowledgments First of all, I sincerely thank my advisor Dr. Srinivasan Parthasarathy, for his patience and guidance in my PhD study. Without his support, I could not make it to the end of this journey. I would also like to thank Dr. Gagan Agrawal and Dr. P. (Saday) Sadayap- pan for serving on my candidacy and dissertation committee. I am grateful for Dr. Sadayappan’s insightful advices in the GPU work we published together. Former and present members of the Data Mining Research Lab have been very sup- portive and helpful both inside and outside school. They are: Matthew Otey, Sameep Mehta, Keith Marsolo, Amol Ghoting, Chao Wang, Gregory Buehrer, Sitaram Asur, Duygu Ucar, Shirish Tatikonda, Venu Satuluri, Matt Goyder, Ye, Wang, S.M. Faisal, Tyler Clemons, Yu-keng Shih, Yiye Ruan, Dave Fuhry and Yang Zhang. I would like to thank them along with my advisor, for the research discussions, project and paper collaborations. presentation rehearsals, paper proof readings and many other things. I would like to give my special thanks to Sitaram Asur, Amol Ghoting andMatt Otey. Sitaram gave me a lot of help on my first paper submission. Amol Ghoting mentored me during my internship in IBM research. Matt Otey was instrumental in referring me to the summer internship position in Google which eventually led to my full time job. vi I am grateful to the National Science Foundation for supporting myresearch through grants RI-CNS-0403342, CCF-0702587, CAREER-IIS-0347662, IIS-0917070 and SoCS-1111118. Any opinions, findings, and conclusions or recommendations expressed here are those of the author and, if applicable, his adviser and collaborators, and do not necessarily reflect the views of the National Science Foundation. Finally and the most importantly, I would like to thank my wife, my mom anddad, and my parents in-law. In the long journey of pursuing my PhD degree, various good and bad things can happen and make the final goals crumbling. The unconditional supports from my family can always strengthen my determination to conquer all the difficulties. vii Vita November 27, 1983 .........................Born - Harbin, China July, 2006 ..................................B.E. Computer Science and Technol- ogy, Harbin Institute of Technology, China Sept, 2004 - Jan, 2005 . Exchange Student, University of Hong Kong, Hong Kong June, 2010 - Sept, 2010 . Research Intern, IBM T.J. Watson Re- search Center, Yorktown Heights, NY Nov, 2010 ..................................M.S. Computer Science and Engineer- ing, Ohio State University, Columbus, OH June, 2011 - Sept, 2011 . Software Engineering Intern, Google, Kirkland, WA Sept, 2010 - June, 2012 . Graduate Teaching Associate, Ohio State University, Columbus, OH Sept, 2007 - present . Graduate Research Associate, Ohio State University, Columbus, OH Publications Research Publications Xintian Yang, Amol Ghoting, Yiye Ruan, Srinivasan Parthasarathy. AFramework for Summarizing and Analyzing Twitter Feeds. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery andDataMining,KDD ’12. Xintian Yang, Srinivasan Parthasarathy, P. Sadayappan. Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining. In Proceedings of the VLDB Endowment, 4(4):231–242, 2011 viii Xintian Yang, Srinivasan Parthasarathy, P. Sadayappan. Fast Mining Algorithms of Graph Data on GPUs. In 2nd Workshop on Large-scale Data Mining: Theory and Applications LDMTA, 2010. Xintian Yang, Sitaram Asur, Srinivasan Parthasarathy, Sameep Mehta. A Visual- analytic Toolkit for Dynamic Interaction Graphs. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery andDataMining,KDD ’08, pages 1016–1024, New York, NY, USA, 2008. ACM. Fields of Study Major Field: Computer Science and Engineering ix Table of Contents Page Abstract....................................... ii Dedication...................................... v Acknowledgments.................................. vi Vita ......................................... viii ListofTables.................................... xiii ListofFigures ................................... xv 1. Introduction.................................. 1 1.1 Challenges in Analyzing Large Graphs . 3 1.2 ProposedFramework ......................... 5 1.3 OurContributions ........................... 7 1.4 Organization .............................. 10 2. BackgroundandRelatedWork . .. .. 11 2.1 Graph Mining Algorithms . 11 2.1.1 Link Analysis . 11 2.1.2 Graph Clustering . 16 2.1.3 Dynamic Graph Analysis . 20 2.2 GPUBackground............................ 26 2.3 Sparse Matrix and Vector Multiplication . 28 2.4 Summarizationofnetworkcontentandtopology. 33 2.5 Visual Analytics . 34 x 3. HighPerformanceMiningKernels. 36 3.1 Methodology

Towards Large-Scale Network Analytics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support