Parallelization with Standard ML a Standard ML API for Hadoop and Comparison with Existing Large-Scale Solutions

Parallelization with Standard ML A Standard ML API for Hadoop and comparison with existing large-scale solutions Master's Thesis in Computer Science Ngoc Tuan Nguyen May 14, 2015 Halden, Norway www.hiof.no Abstract In recent years, the world has witnessed an exponential growth and availability of data. The term "big data" has become one of the hottest topics which attracts the investment of a lot of people from research to business, from small to large organizations. People have data and want to get insights from them. This leads to the demand of a parallel processing models that are scalable and stable. There are several choices such as Hadoop and its variants and Message Passing Interface. Standard ML is a functional programming language which is mainly used in teaching and research. However, there is not much support for this language, especially in parallel model. Therefore, in this thesis, we develop a Standard ML API for Hadoop called MLoop to provide SML developers a framework to program with MapReduce paradigm in Hadoop. This library is an extension of Hadoop Pipes to support SML instead of C++. The thesis also conducts experiments to evalu- ate and compare proposed library with other notable large-scale parallel solutions. The results show that MLoop achieves better performance compared with Hadoop Streaming which is an extension provided by Hadoop to support programming languages other than Java. Although its performance is really not as good as the native Hadoop, MLoop often gets at least 80% the performance of the native Hadoop. In some cases (the summation problem for example), when strengths of SML are utilized, MLoop even outperforms the native Hadoop. Besides that, MLoop also inherits characteristics of Hadoop such as scalability and fault tolerance. However, the current implementation of MLoop suffers from several shortcomings such as it does not support job chaining and global counter. Finally, the thesis also provides several useful guide-lines to make it easier to choose the suitable solution for the actual large-scale problems. Keywords: Standard ML, Hadoop, MPI, MapReduce, Evaluation, large-scale, big data, parallel processing. i Acknowledgments I have relied on many people directly and indirectly to finish this thesis. First of all, I would like to thank my supervisor Roland Olsson at Østfold University College. He guided me in the work with this thesis. Thank you for your great support. I would also like to thank professor Øystein Haugen for your invaluable effort, proof- reading and very useful discussion at the end of my thesis. I am particularly grateful to my family, my friends, especially my parents, for your continued support during my studying period in Norway. Your encouragement of both mental and material aspects is a big motivation for me to finish this thesis. iii Prerequisites This thesis covers many technologies in large-scale processing as well as relates to different programming languages. It is impossible to go into detail on every aspect that is mentioned in this thesis. Therefore, it is assumed that the reader has a background knowledge of computer programming. The background chapters of related technologies only introduce most important aspects. v Contents Abstract i Acknowledgments iii List of Figures ix List of Tables xi Listings xiv 1 Introduction 1 1.1 Related Work . .2 1.2 Motivation . .4 1.3 Methodology . .5 1.4 Report Outline . .5 2 MapReduce and Its Open Implementation Hadoop 7 2.1 MapReduce . .7 2.2 Hadoop: An Open Implementation of MapReduce . 16 2.3 Developing a MapReduce Application . 22 2.4 Hadoop Extensions for Non-Java Languages . 31 3 Message Passing Interface - MPI 35 3.1 What is MPI . 35 3.2 Basic Concepts . 35 3.3 Sample Programs in MPI . 39 4 Standard ML API for Hadoop 49 4.1 Programming in Standard ML . 49 4.2 Parallelization for SML . 53 4.3 Extend Hadoop Pipes for Standard ML . 53 4.4 Architecture of MLoop . 54 4.5 MLoop Implementation . 56 4.6 Writing MapReduce programs in MLoop . 64 4.7 Sample Programs in MLoop . 67 4.8 MLoop Limitations . 71 vii viii CONTENTS 5 Evaluation and Results 73 5.1 Evaluation Metrics . 73 5.2 Experiments . 74 5.3 Discussion . 84 5.4 Guidelines . 86 6 Conclusions and Future Work 89 6.1 Summary . 89 6.2 Future Work . 90 Bibliography 93 A Working with MLoop 95 A.1 Prerequisites . 95 A.2 Installation . 95 A.3 MLoop Documentation . 99 A.4 Running Sample Programs in Hadoop . 101 B Program Source Code 103 B.1 Hadoop Java Programs . 103 B.2 MLoop Programs . 111 B.3 MPI Programs . 114 List of Figures 2.1 Execution Overview [16] . 10 2.2 Complete view of MapReduce on Word Count problem . 12 2.3 The architecture of HDFS [25] . 13 2.4 Hadoop Ecosystem . 17 2.5 Hadoop YARN . 18 2.6 HDFS Federation . 21 2.7 HDFS High Availability . 22 2.8 A complex dependencies between jobs . 25 2.9 Hadoop Pipes data flows . 32 3.1 A tree-structured global sum . 38 4.1 Integration of Pydoop with C++ . 54 4.2 SML API for Hadoop . 55 4.3 Sequence diagram of map phase in MLoop . 58 4.4 Sequence diagram of reduce phase in MLoop . 59 5.1 Performance on Word Count problem . 77 5.2 Performance on Graph problem . 78 5.3 Performance on 17-Queen problem . 79 5.4 Performance on Summation problem . 79 5.5 Scalability on Word Count with different input sizes . 81 5.6 Scalability on Word Count with different cluster sizes . 82 ix List of Tables 3.1 Predefined Operators in MPI . 39 4.1 A C++ class . 56 4.2 C Wrapper for MyClass class . 57 5.1 Local cluster specification . 74 5.2 Google cluster specification . 75 5.3 Hadoop configuration . 75 5.4 Word Count with different sizes of data . 80 5.5 Word Count with different cluster sizes . 81 5.6 The number of code lines . 83 xi Listings 2.1 Map and Fold high-order functions . .8 2.2 Word Count with MapReduce . .9 2.3 MapReduce WordCount program . 22 2.4 Chaining job to run sequentially . 25 2.5 Chaining job with JobControl . 25 2.6 Mapper and Reducer for Summation example . 26 2.7 Mapper and Reducer for Graph example . 28 2.8 Iterative jobs for Graph example . 29 2.9 Mapper and Reducer for first five jobs of N-Queens . 29 2.10 Mapper for the last job of N-Queens . 30 2.11 mapper.py . 31 2.12 reducer.py . 31 3.1 A simple MPI send/recv program . 35 3.2 Read and distribute data . 40 3.3 Compute the word count on each part of data . 41 3.4 Aggreate the result . 41 3.5 Global sum with MPI . 42 3.6 The worker of N-Queen with MPI . 43 3.7 The master of N-Queen with MPI . 44 3.8 MPI - Graph Search (step 1): Generate new nodes from GRAY node . 45 3.9 MPI - Graph Search (step 2,3): Gathering data and sorting . 46 3.10 MPI - Graph Search (step 4): Aggregate the data . 46 4.1 C++ Mapper implementation . 61 4.2 MLoop setup for map task . 61 4.3 C++ Reducer and Combiner implementation . 62 4.4 MLoop setup for reduce and combiner task . 62 4.5 C++ Record Reader implementation . 63 4.6 SML functions for reader . 63 4.7 C++ Record Writer implementation . 64 4.8 SML functions for writer . 64 4.9 Counting words in MLoop . 65 4.10 A custome record reader to read two lines at a time . 66 4.11 A custom record writer in MLoop . ..

Load more