Supporting Fault Tolerance and Dynamic Load Balancing in FREERIDE-G
Total Page:16
File Type:pdf, Size:1020Kb
Supporting Fault Tolerance and Dynamic Load Balancing in FREERIDE-G Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Tekin Bicer, B.S. Graduate Program in Computer Science and Engineering The Ohio State University 2010 Thesis Committee: Dr. Gagan Agrawal, Advisor Dr. P. Sadayappan ABSTRACT Over the last 2-3 years, the importance of data-intensive computing has increas- ingly been recognized, closely coupled with the emergence and popularity of map- reduce for developing this class of applications. Besides programmability and ease of parallelization, fault tolerance and load balancing are clearly important for data- intensive applications, because of their long running nature, and because of the po- tential for using a large number of nodes for processing massive amounts of data. Two important goals in supporting fault-tolerance are low overheads and effi- cient recovery. With these goals, our work proposes a different approach for enabling data-intensive computing with fault-tolerance. Our approach is based on an API for developing data-intensive computations that is a variation of map-reduce, and it in- volves an explicit programmer-declared reduction object. We show how more efficient fault-tolerance support can be developed using this API. Particularly, as the reduc- tion object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. Supporting dynamic load balancing in heterogeneous environments, such as grids, is another crucial problem. Successfully distributing the tasks among the processing units according to their processing capabilities, and during this process minimizing the overhead of the system are significant to the overall execution. Our approach is ii based on the independent data elements which can be processed by any processing resource in the system. We evaluated and showed how effectively our dynamic load balancing system can perform with our data-intensive computing API. Specifically, the problem space that our API is working on enables us to process data elements independently. This property can be exploited, and any data element in the system can be assigned to any computational resource during the execution. We have extensively evaluated our approaches using two data-intensive applica- tions. Our results show that the overheads of our schemes are extremely low. Fur- thermore, we compared our fault tolerance approach with one of the Map-Reduce implementations, Hadoop. Our fault tolerance system outperforms Hadoop both in absence and presence of failures. iii I dedicate this work to my brother Mehmet and to my parents Cihangir and Ismet. iv ACKNOWLEDGMENTS I would like to thank my adviser, Prof. Gagan Agrawal for his guidance throughout the course of my Masters. He has always been there when I needed him. I also thank Prof. P. Sadayappan for agreeing to serve on my Master’s examination committee. He has always been a huge inspiration. I would especially like to thank my friends Vignesh Ravi and David Chiu for their help throughout the duration of my Masters. I also acknowledge all my colleagues in Data Grid Labs. Last, but not least, I would like to thank my advisors, instructors and friends back home. v VITA 2006 ........................................B.E., Computer Engineering, Izmir Institute of Technology Izmir, Turkey. 2006-2007 ..................................Software Engineer, Kentkart A.S. Izmir, Turkey 2008-2010 ..................................Masters Student, Department of Computer Science and Engineering, The Ohio State University. PUBLICATIONS Research Publications Tekin Bicer, Wei Jiang, Gagan Agrawal “Supporting Fault Tolerance in a Data- Intensive Computing Middleware”. 24th IEEE International Parallel & Distributed Processing Symposium, April 2010 FIELDS OF STUDY Major Field: Computer Science and Engineering Studies in: High Performance Computing, Grid Technologies Prof. Gagan Agrawal vi TABLE OF CONTENTS Page Abstract....................................... ii Dedication...................................... iv Acknowledgments.................................. v Vita ......................................... vi ListofFigures ................................... ix Chapters: 1. Introduction.................................. 1 1.1 API for Parallel Data-Intensive Computing . 5 1.2 Remote Data Analysis and FREERIDE-G . 7 2. Supporting Fault Tolerance in FREERIDE-G . 13 2.1 Alternatives............................... 13 2.2 OurApproach ............................. 14 2.3 Fault Tolerance Implementation . 16 2.4 Applications and Experimental Results . 23 2.4.1 Overheads of Supporting Fault-Tolerance . 25 2.4.2 Overheads of Recovery: Different Failure Points . 30 2.4.3 Comparison of FREERIDE-G and Hadoop . 32 2.5 RelatedWork ............................. 37 2.6 Summary ................................ 39 vii 3. Supporting Dynamic Load Balancing in FREERIDE-G . 40 3.1 OurApproach ............................. 41 3.2 Detailed Design and Implementation . 42 3.3 ExperimentalResults ......................... 47 3.3.1 Effectiveness and Overheads . 47 3.3.2 Overheads with Different Slowdown Ratios . 51 3.3.3 Distribution of Data Elements with Varying Slowdown Ratios 54 3.3.4 Overheads with Different Assignment Granularity . 56 3.4 RelatedWork ............................. 59 3.5 Summary ................................ 60 4. Conclusions .................................. 61 Bibliography .................................... 62 viii LIST OF FIGURES Figure Page 1.1 Processing Structure: FREERIDE (left) and Map-Reduce (right) . 7 1.2 FREERIDE-GSystemArchitecture . 9 2.1 Application Execution in FREERIDE-G . 21 2.2 Compute Node Processing with Fault Tolerance Support . 22 2.3 FailureRecoveryProcedure . 24 2.4 Evaluating Overheads using k-means clustering (6.4 GB dataset) . 26 2.5 Evaluating Overheads using k-means clustering (25.6 GB dataset) . 27 2.6 Evaluating Overheads using PCA clustering (4 GB dataset) . 28 2.7 Evaluating Overheads using PCA clustering (17 GB dataset) . 29 2.8 K-means Clustering with Different Failure Points (25.6 GB dataset, 8 computenodes).............................. 31 2.9 PCA with Different Failure Points (17 GB dataset, 8 compute nodes) 32 2.10 Comparison of Hadoop and FREERIDE-G: Different No. of Nodes - k-meansclustering(6.4GBDataset) . 33 2.11 Performance With Different Failure Points (k-means, 6.4 GB dataset, 8computenodes)............................. 34 3.1 Load Balancing System’s Workflow . 42 ix 3.2 Evaluating Overheads using K-means clustering (6.4 GB dataset) . 48 3.3 Evaluating Overheads using K-means clustering (25.6 GB dataset) . 50 3.4 EvaluatingOverheadsusingPCA(4GBdataset) . 51 3.5 Evaluating Overheads using PCA (17 GB dataset) . 52 3.6 K-means clustering with Different Slowdown Ratios (6.4 GB dataset, 8comp.nodes) .............................. 53 3.7 PCA with Different Slowdown Ratios (4 GB dataset, 8 comp. nodes) 54 3.8 K-means clustering with Different Slowdown Distribution (6.4 GB dataset, 8comp.nodes) .............................. 55 3.9 PCA with Different Slowdown Distribution (4 GB dataset, 8 comp. nodes) ................................... 56 3.10 Performance with Different Number of Chunks per Request (k-means, 6.4GBdataset,4comp.nodes) . 57 x CHAPTER 1 INTRODUCTION The availability of large datasets and increasing importance of data analysis in commercial and scientific domains is creating a new class of high-end applications. Recently, the term Data-Intensive SuperComputing (DISC) has been gaining popu- larity [4], reflecting the growing importance of applications that perform large-scale computations over massive datasets. The increasing interest in data-intensive computing has been closely coupled with the emergence and popularity of the map-reduce approach for developing this class of applications [6]. Implementations of the map-reduce model provide high-level APIs and runtime support for developing and executing applications that process large- scale datasets. Map-reduce has been a topic of growing interest in the last 2-3 years. On one hand, multiple projects have focused on trying to improve the API or imple- mentations [8, 18, 27, 28]. On the other hand, many projects are underway focusing on the use of map-reduce implementations for data-intensive computations in a va- riety of domains. For example, early in 2009, NSF funded several projects for using 1 the Google-IBM cluster and the Hadoop implementation of map-reduce for a vari- ety of applications, including graph mining, genome sequencing, machine translation, analysis of mesh data, text mining, image analysis, and astronomy1. Even prior to the popularity of data-intensive computing, an important trend in high performance computing has been towards the use of commodity or off-the- shelf components for building large-scale clusters. While they enable better cost- effectiveness, a critical challenge in such environments is that one needs to be able to deal with failure of individual processing nodes. As a result, fault-tolerance has become an important topic in high-end computing [12, 32]. Fault tolerance is clearly important for data-intensive applications because of their long running nature and because of the potential for using a large number of nodes for processing massive amounts of data. Fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation