Hadoop in the Browser Simulating Mapreduce in the Browser

Hadoop in the browser Simulating MapReduce in the browser H. Bilen & M. Zwart Technische Universiteit Delft Hadoop in the browser Simulating MapReduce in the browser by H. Bilen & M. Zwart in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science at the Delft University of Technology, Supervisor: dr. T. Abeel Client: dr. C. Hauff Bachelor coordinator: O. Visser Preface This report details the development and research done during the ’Hadoop in the browser’ project for the course: TI3806 Bachelorproject. It was commissioned by Assistant Professor Claudia Hauff from the Web Information Systems group (WIS) at the faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS) at the Delft University of Technology. The report documents the 10 weeks that were spent researching the most optimal solution and the development of this solution, Trifle. Trifle is a web based platform for practicing with toy MapReduce problems to improve the quality of teaching in the Big Data Processing course in the Computer Science curriculum. It allows for dynamic lectures, has high accessibility and user testing indicated that it in fact does improve the understanding of MapReduce. The goal of the report is to show the reader how the application works from both a student and lecturer perspective, which design choices were made and how we recommend the system to be deployed. We would like to thank dr. Thomas Abeel for his regular and constructive feedback on our report, dr. Claudia Hauff for offering us this unique opportunity and attending our user test session. Lastly special thanks go out to our user test group who offered their time to help us test the platform and get some valuable feedback from them. H. Bilen & M. Zwart Delft, June 2016 iii Abstract With the current increase of user generated data, the need for tools to process large quantities of data is increasing. One of the courses of the Computer Science BSc curriculum is the Big Data Processing course. The Big Data Processing course teaches students ways of doing so. A popular and teached method is using MapReduce, a programming model to process large quantities of data. Big Data Processing students currently write their implementations for MapReduce related assign- ments of the lab in the Cloudera Virtual Machine (VM). This VM is slow, cannot be used interactively and it cannot be used to teach all MapReduce principles like memory separation. Since there are no existing solutions that solves these problems we have decided to write our own. Writing our own solution required diverse knowledge including but not limited to JavaScript, ECMAScript 6, WebWorkers, HTML5 (&CSS), PHP, MySQL, API development, Wordpress, Linux, QUnit and Selenium. Our solution, Trifle, overcomes previously mentioned and other shortcomings. Trifle is a web-based solution that simulates MapReduce within the browser. The framework uses JavaScript together with WebWorkers and our front-end is written using Wordpress. By simulating MapReduce within the browser we managed to create a solution that requires no cluster, is easy to use, works multi- platform and most importantly enables lecturers to teach some MapReduce principles that could not be taught before. Furthermore, we integrated a submissions system that can be used to do interactive lectures in which the lecturer can see problems in real-time and explain obstacles. A user test we have done verifies that Trifle helps to better grasp the idea behind MapReduce. v Contents 1 Context 1 1.1 About the Big Data Processing course . 1 1.2 MapReduce . 1 1.3 Hadoop . 2 1.4 Current MapReduce teaching methods . 2 2 Problem definition 3 2.1 Shortcomings Cloudera VM. 3 2.2 Effective teaching model. 3 2.3 Requirements . 4 2.4 State of the art . 5 2.4.1 Website with backend that runs Hadoop jobs . 5 2.4.2 Multi VM approach. 5 2.4.3 P2P VM approach . 5 2.4.4 Client-side simulation . 6 2.5 Choice motivation . 8 3 Technical choices & Implementation 9 3.1 Solution specific MoSCoW overview . 9 3.2 Use cases . 9 3.3 High level system design . 10 3.3.1 Server . 10 3.3.2 Browser (Lecturer) . 10 3.3.3 Browser (Student) . 11 3.3.4 MapReduce framework . 11 3.3.5 Custom MapReduce framework . 12 3.4 Implementation . 14 3.5 Software methodology . 14 4 Research & results 15 4.1 Benchmarks. 15 4.1.1 Setup . 15 4.1.2 Benchmarked use cases. 15 4.1.3 Graphical results . 15 4.2 User test . 20 4.2.1 General results . 20 4.2.2 Questionnaire results . 20 4.2.3 Suggestions from participants . 25 4.2.4 Remarks client . 25 5 Discussion 27 5.1 The resulting implementation . 27 5.1.1 Recommendations based on benchmarks . 27 5.1.2 Feedback from testers . 28 5.2 Recommendations for the BDP course . 28 5.3 Future work . 28 vii viii Contents A Defined custom fields 29 B Use cases code examples 31 B.1 Use case 1 . 31 B.2 Use case 2 . 32 B.3 Use case 3 . 32 B.4 Use case 4 . 33 B.5 Use case 5 . 34 B.6 Use case 6 . 35 B.7 Use case 7 . 36 B.8 Use case 8 . 37 B.9 Use case 9 . 38 B.10Use case 10 . 39 C Implementation 41 C.1 MapReduce Framework . 41 C.1.1 Languages . 41 C.1.2 Components. 41 C.1.3 Data flow. 43 C.1.4 Simulating the required scenarios. 43 C.2 Student testing framework . 43 C.3 Front-end . 44 C.3.1 Design . 44 C.3.2 Ace . 45 C.3.3 Bootstrap . 45 C.3.4 jQuery . 47 C.4 Back-end. 47 C.4.1 Wordpress theme ’Trifle’. 47 C.4.2 Wordpress plugins . 48 C.4.3 Custom Database Tables . 48 C.4.4 Anonymous login . 49 C.5 Testing . 49 D Using the system 51 D.1 Installing the website . 51 D.2 Using Wordpress . 51 D.2.1 Page hierarchy . 51 D.2.2 Setting up the first pages . 51 D.3 Using the website. 55 E Project description 57 F Raw benchmark results 59 G User test questionnaire 65 H Infosheet 71 I SIG Feedback 73 I.1 SIG Feedback on code quality . 73 I.2 Taking advantage of the SIG Feedback . 73 Bibliography 75 1 Context This section provides context relevant to our research. We give an introduction to the Big Data Pro- cessing course (BDP) and its main teaching goals. Furthermore, we give an overview of tools currently used to teach the MapReduce paradigm. 1.1. About the Big Data Processing course The course focuses on teaching the students the principles of processing Big Data and discovering patterns in the (semi) unstructured data. The goals of this course include: explaining the differ- ence between OO-programming and functional programming, explaining the.

Hadoop in the Browser Simulating Mapreduce in the Browser

Mapreduce and Beyond

Integrating Crowdsourcing with Mapreduce for AI-Hard Problems ∗

Mapreduce: Simplified Data Processing On

Overview of Mapreduce and Spark

Finding Connected Components in Huge Graphs with Mapreduce

Large-Scale Graph Mining @ Google NY

Mapreduce Indexing Strategies: Studying Scalability and Efﬁciency ⇑ Richard Mccreadie , Craig Macdonald, Iadh Ounis

Mapreduce Basics

Mapreduce: a Major Step Backwards - the Database Column

Matlab-To-Map Reduce Automation

Mapreduce: a Flexible Data Processing Tool

Scaling to Build the Consolidated Audit Trail