A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures

A Map-Reduce-Like System for Programming and Optimizing Data-Intensive Computations on Emerging Parallel Architectures Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Wei Jiang, B.S., M.S. Graduate Program in Department of Computer Science and Engineering The Ohio State University 2012 Dissertation Committee: Gagan Agrawal, Advisor P. Sadayappan Radu Teodorescu c Copyright by Wei Jiang 2012 Abstract Parallel computing environments are ubiquitous nowadays, including traditional CPU clusters and the emergence of GPU clusters and CPU-GPU clusters because of their performance, cost and energy efficiency. With this trend, an important research issue is to effectively utilize the massive computing power in these architectures to accelerate data- intensive applications arising from commercial and scientific domains. Map-reduce and its Hadoop implementation have become popular for its high programming productivity but exhibits non-trivial performance losses for many classes of data-intensive applications. Also, there is no general map-reduce-like support up to date for programming heterogeneous systems like a CPU-GPU cluster. Besides, it is widely accepted that the existing fault tolerant techniques for high-end systems will not be feasible in the exascale era and novel solutions are clearly needed. Our overall goal is to solve these programmability and performance issues by providing a map-reduce-like API with better performance efficiency as well as efficient fault tolerance support, targeting data-intensive applications and various new emerging parallel architectures. We believe that a map-reduce-like API can ease the programming difficulty in these parallel architectures, and more importantly improve the performance efficiency of paral- lelizing these data-intensive applications. Also, the use of a high-level programming model can greatly simplify fault-tolerance support, resulting in low overhead checkpointing and recovery. ii We performed a comparative study showing that the map-reduce processing style could cause significant overheads for a set of data mining applications. Based on the observation, we developed a map-reduce system with an alternate API (MATE) using a user-declared reduction-object to be able to further improve the performance of map-reduce programs in multi-core environments. To address the limitation in MATE that the reduction object must fit in memory, we extended the MATE system to support the reduction object of arbitrary sizes in distributed environments and apply it to a set of graph mining applications, obtaining better performance than the original graph mining library based on map-reduce. We then supported the generalized reduction API in a CPU-GPU cluster with the ability of automatic data distribution among CPUs and GPUs to achieve the best-possible heterogeneous execution of iterative applications. Finally, in our recent work, we extended the generalized reduction model with supporting low overhead fault tolerance for MPI programs in our FT-MATE system. Especially, we are able to deal with CPU/GPU failures in a cluster with low overhead checkpointing, and restart the computations from a different number of nodes. Through this work, we would like to provide useful insights for designing and implementing efficient fault tolerance solutions for the exascale systems in the future. iii This is dedicated to the ones I love: my parents and my younger sister. iv Acknowledgments It would not have been possible to write this doctoral dissertation without the help and support of the kind people around me, to only some of whom it is possible to give particular mention here. First of all, I would like to thank my advisor Prof. Gagan Agrawal. This dissertation would not have been possible without the help, support and patience of Prof. Gagan Agrawal, not to mention his advice and unsurpassed knowledge of parallel and distributed computing in computer science. He has taught me, both consciously and unconsciously, how good computer systems and software should be done. I appreciate all his contributions of time, ideas, and funding to make my Ph.D. life productive and stimulating. The joy, passion and enthusiasm he has for his research were contagious and motivational for me, even during tough times in the Ph.D. pursuit. I am also indebted to him for the excellent example he has provided as a successful computer scientist and professor. Second, I would like to acknowledge the financial, academic and technical support of the Ohio State University, particularly in the award of a University Fellowship that provided the necessary financial support for my first year. The library facilities and computer facilities of the University, as well as the Computer Science and Engineering Department, have been indispensable. I also thank the Department of Computer Science and Engineer- ing for their support and assistance since the start of my graduate associate work in 2008, v especially the head of department, Prof. Xiaodong Zhang. My work was also supported by the National Science Foundation. Also, I would like to thank my colleagues in my lab. The members of the Data-Intensive and High Performance Computing Research Group have contributed immensely to my personal and professional time at Ohio State. The group has been a source of friendships as well as good advice and collaboration. I am especially grateful for the fun group of original members who stuck it out in graduate school with me all the way. My time at Ohio State was made enjoyable in large part due to the many friends and groups that became part of my life. I am thankful for the pleasant days spent with my roommates and hangout buddies with our memorable time in the city of Columbus, for my academic friends who have helped me a lot in improving my programming and paper writing skills, and for many other people and memories of trips into mountains, rivers and lakes. Special thanks are given to my family for their personal support and great patience at all times. My parents and my younger sister have given me their unequivocal support throughout, as always, for which my mere expression of thanks likewise does not suffice. Last, but by no means least, I thank my friends in America, China and elsewhere for their support and encouragement throughout, some of whom have already been named. vi Vita October29th,1985 ..........................Born-Tianmen, China 2007 .......................................B.S.Software Engineering, Nankai University, Tianjin, China 2011 .......................................M.S.Computer Science and Engineering, The Ohio State University Publications Research Publications Wei Jiang and Gagan Agrawal. “MATE-CG: A MapReduce-Like Framework for Accel- erating Data-Intensive Computations on Heterogeneous Clusters”. In Proceedings of the 26th IEEE International Symposium on Parallel and Distributed Processing, May 2012. Yan Liu, Wei Jiang, Shuangshuang Jin, Mark Rice, and Yousu Chen. “Distributing Power Grid State Estimation on HPC Clusters: A System Architecture Prototype”. In Proceed- ings of the 13th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing, May 2012. Yi Wang, Wei Jiang and Gagan Agrawal. “SciMATE: A Novel MapReduce-Like Frame- work for Multiple Scientific Data Formats”. In Proceedings of the 12th IEEE/ACM Inter- national Symposium on Cluster, Cloud and Grid Computing, pages 443–450, May 2012. Vignesh Ravi, Michela Becchi, Wei Jiang, Gagan Agrawal and Srimat Chakradhar. “Schedul- ing Concurrent Applications on a Cluster of CPU-GPU Nodes”. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 140– 147, May 2012. Wei Jiang and Gagan Agrawal. “Ex-MATE: Data-Intensive Computing with Large Reduc- tion Objects and Its Application to Graph Mining”. In Proceedings of the 11th IEEE/ACM vii International Symposium on Cluster, Cloud and Grid Computing, pages 475–484, May 2011. Wei Jiang, Vignesh T. Ravi, and Gagan Agrawal. “A Map-Reduce System with an Alter- nate API for Multi-core Environments”. In Proceedings of the 10th IEEE/ACM Interna- tional Symposium on Cluster, Cloud and Grid Computing, pages 84–93, May 2010. Tekin Bicer, Wei Jiang, and Gagan Agrawal. “Supporting Fault Tolerance in a Data- Intensive Computing Middleware”. In Proceedings of the 24th IEEE International Sympo- sium on Parallel and Distributed Processing, pages 1–12, Apr. 2010. Wei Jiang, Vignesh T. Ravi, and Gagan Agrawal. “Comparing Map-Reduce and FREERIDE for Data-Intensive Applications”. In Proceedings of the 2009 IEEE International Confer- ence on Cluster Computing, pages 1–10. Aug. 2009. Fields of Study Major Field: Computer Science and Engineering viii Table of Contents Page Abstract......................................... ii Dedication........................................ iv Acknowledgments.................................... v Vita ........................................... vii ListofTables ...................................... xii ListofFigures...................................... xiii 1. Introduction.................................... 1 1.1 High-Level APIs: MapReduce and Generalized Reduction . ....... 3 1.2 ParallelArchitectures:CPUsandGPUs . ... 5 1.3 DissertationContributions . .. 7 1.4 RelatedWork................................ 15 1.5 Outline ................................... 22 2. Comparing Map-Reduce and FREERIDE for Date-Intensive Applications

Load more