Development of Application Programs Oriented to Agent-Based Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Development of Application Programs Oriented to Agent-Based Data Analysis Chang Liu A whitepaper submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science & Software Engineering University of Washington 2020 Project Committee: Munehiro Fukuda, Committee Chair Robert Dimpsey, Committee Member Michael Stiber, Committee Member Program Authorized to Offer Degree: Computer Science & Software Engineering ©Copyright 2020 Chang Liu Contents 1. Introduction ............................................................................................................................ 1 1.1 Background ....................................................................................................................... 1 1.2 Motivation ......................................................................................................................... 1 1.3 Project Goals ..................................................................................................................... 2 2. Related Work ......................................................................................................................... 3 2.1 Data-streaming Tools in Data Analysis ............................................................................ 3 2.1.1 Hadoop ................................................................................................................... 3 2.1.2 Spark ...................................................................................................................... 5 2.1.3 Limitations ............................................................................................................. 6 2.2 Agent-based Modelling in Data Analysis: MASS ............................................................ 7 2.2.1 MASS Library ........................................................................................................ 7 2.2.2 Previous Work on MASS....................................................................................... 9 2.3 Summary ......................................................................................................................... 10 3. Selection of Applications ..................................................................................................... 10 3.1 Data-streaming-based Applications ................................................................................ 11 3.1.1 Top K ................................................................................................................... 11 3.1.2 Markov-Chain ...................................................................................................... 12 3.1.3 Summary .............................................................................................................. 13 3.2 Analysis of Structured Data ............................................................................................ 13 3.2.1 Connected Components ....................................................................................... 13 3.2.2 Matrix Multiplication ........................................................................................... 14 3.2.3 Summary .............................................................................................................. 15 4. Application Development with MapReduce, Spark, and MASS ..................................... 16 4.1 Top K: Implementation and Issues ................................................................................. 16 4.1.1 Top K Hadoop Implementation ........................................................................... 16 4.1.2 Top K Spark Implementation............................................................................... 17 4.1.3 Top K MASS Implementation ............................................................................. 18 4.2 Markov: Implementation and Issues ............................................................................... 20 4.2.1 Markov Chain Hadoop Implementation .............................................................. 20 4.2.2 Markov Chain Spark Implementation .................................................................. 22 4.2.3 Markov Chain MASS Implementation ................................................................ 23 4.3 Connected Components Implementation and Issues ...................................................... 24 4.3.1 Connected Components Hadoop Implementation ............................................... 24 4.3.2 Connected Components Spark Implementation................................................... 25 4.3.3 Connected Components MASS Implementation ................................................. 26 4.4 Matrix Multiplication: Implementation and Issues ......................................................... 27 4.4.1 Matrix Multiplication Hadoop Implementation ................................................... 27 4.4.2 Matrix Multiplication Spark Implementation ...................................................... 28 4.4.3 Matrix Multiplication MASS Implementation..................................................... 29 i 5. Comparison of MapReduce, Spark, and MASS ............................................................... 31 5.1 Comparison of Programmability..................................................................................... 31 5.1.1 Intuitive Programming ......................................................................................... 31 5.1.2 Boilerplate Code Ratio ......................................................................................... 36 5.2 Performance .................................................................................................................... 37 5.2.1 Top K Application and Performance ................................................................... 38 5.2.2 Markov Chain Application and Performance ...................................................... 39 5.2.3 Connected Components & Matrix Multiplication Application and Performance 41 5.3 Summary ......................................................................................................................... 44 6. Conclusion ............................................................................................................................ 46 ii University of Washington Abstract Development of Application Programs Oriented to Agent-Based Data Analysis Chang Liu Chair of the Supervisory Committee: Professor Munehiro Fukuda Computing and Software Systems Different software has been developed in the last two decades to accelerate big data parallelism processes, such as MapReduce, Spark, and Storm. Besides these data-streaming tools, agent-based modelling (ABM) provides an alternative approach for data discovery and gain recognition on its programmability and intuition of coding. As a tool of ABM, the Multi-Agent Spatial Simulation library (MASS) is developed by Distributed Systems Laboratory (DSL) at the University of Washington, Bothell and strives to parallelize agent-based models over a cluster system. This project is to extend the initial achievement in agent-based data analysis by exploring more data science applications and identifying more data science applications that would be the better fit to agent-based parallelization. This project presents the algorithm designs and implementation process for four applications of Top K, Markov Chain, Connected Components, and Matrix Multiplication on MapReduce, Spark, and MASS. In addition, this project compares the programmability and execution performance of four applications on three frameworks. This study further discussed the previous research on the MASS and the strengths and weaknesses of MASS when it is used to implement data applications. iii 1. Introduction 1.1 Background Traditional software applications normally could store and process data with the help of a centralized server. However, when it comes to the large dataset, the data analysis is not just about scale and volume, but also involves velocity, variety, and complexity. Different software has been developed in the last two decades to accelerate big data computing, such as MapReduce, Spark, and Storm. These software tools provide their standard program paradigm to facilitate parallelization on cluster computing. The concept behind these tools is data streaming through batch processing. Batch processing is an automated job that “can run without end-user interaction, or can be scheduled to run as resources permit” [1]. These tools perform data transformation and computation on batches of data in parallel to maximize performance (Spark by mini-batches, and Storm process event by event). However, these applications are not the only solutions for large dataset. There are still alternative approaches being developed such as agent-based data discovery. Researchers and engineers have been developing and testing these frameworks for data processes. This project tends to join previous effort by exploring tools and algorithms for agent-based data discovery. 1.2 Motivation MapReduce and Spark are two major software tools to facilitate big data analysis with a cluster system or even in the cloud. Nevertheless, not all data science applications fit into their computation models. For instance, consider one of the graph problems: identifying triangles in a 1 social network, which shows the friendship degree of the network. The corresponding MapReduce and Spark programs need to separately enumerate all combinations