Big Data Software
Total Page:16
File Type:pdf, Size:1020Kb
Big Data Software Spring 2017 Bloomington, Indiana Editor: Gregor von Laszewski Department of Intelligent Systems Engeneering Indiana University [email protected] Contents 1 S17-ER-1001 Globus Toolkit Saber Sheybani, 6 2 S17-IO-3000 CDAP Cask Data Application Platform Avadhoot Agasti 9 3 S17-IO-3004 MySQL Cory Coulter 12 4 S17-IO-3005 RabbitMQ Abhishek Gupta 15 5 S17-IO-3008 Docker Container Vishwanath Kodre 18 6 S17-IO-3010 Couchbase Server: A Usable Overview Matthew Lawson 21 7 S17-IO-3011 Apache Airavata Scott McClary 25 8 S17-IO-3012 Google Bigtable Mark McCombe 28 9 S17-IO-3013 Apache Beam (Google Cloud Dataflow) Leonard Mwangi 32 10 S17-IO-3014 Xen: A bare metal hypervisor Piyush Rai 35 11 S17-IO-3015 Apache Lucene Roy Choudhury, Sabyasachi 37 1 12 S17-IO-3016 CoreOS Ribka Rufael 40 13 S17-IO-3017 MongoDB Nandita Sathe 43 14 S17-IO-3019 vCloud and vSphere Michael Smith 47 15 S17-IO-3020 Google Fusion Table Milind Suryawanshi 50 16 S17-IO-3021 CUBRID RDBMS Abhijit Thakre 53 17 S17-IO-3022 Netty vs ZeroMQ in Realtime Analytics Sunanda Unni 56 18 S17-IO-3023 H-Store Karthick Venkatesan 59 19 S17-IO-3024 Not Submitted Ashok Vuppada 63 20 S17-IR-2001 HTCondor: Distributed Workflow Management System Niteesh Kumar Akurati 66 21 S17-IR-2002 Google Dremel: SQL-Like Query for Big Data Jimmy Ardiansyah 69 22 S17-IR-2004 Apache Samza Ajit Balaga, S17-IR-2004 72 23 S17-IR-2006 An Overview of Apache Spark Snehal Chemburkar, Rahul Raghatate 75 2 24 S17-IR-2008 An overview of HadoopDB and its Architecture Karthik Anbazhagan 80 25 S17-IR-2011 Ansible Anurag Kumar Jain, Gregor von Laszewski 83 26 S17-IR-2012 Lustre File System Pratik Jain 86 27 S17-IR-2013 An overview of Flume and its Applications in BigData Sahiti Korrapati 89 28 S17-IR-2014 An Overview of Apache Sqoop Harshit Krishnakumar 92 29 S17-IR-2016 Apache Spark's Machine Learning Library (MLlib) Anvesh Nayan Lingampalli 95 30 S17-IR-2017 An Overview of OpenNebula Project and its Applications Author Missing 98 31 S17-IR-2018 Analysis of Pentaho Bhavesh Reddy Merugureddy 101 32 S17-IR-2019 Twister: A new approach to MapReduce Programming Vasanth Methkupalli 104 33 S17-IR-2021 Docker Machine and Swarm Shree Govind Mishra 107 34 S17-IR-2022 Triana Abhishek Naik 110 35 S17-IR-2024 LDAP Ronak Parekh, Gregor von Laszewski 113 3 36 S17-IR-2026 Ceph - Distributed Storage System Rahul Raghatate, Snehal Chemburkar 116 37 S17-IR-2027 Twitter Heron Shahidhya Ramachandran 120 38 S17-IR-2028 A Report on Kubernetes Srikanth Ramanam 124 39 S17-IR-2029 An overview of Azure Machine Learning and its Applications Naveenkumar Ramaraju 127 40 S17-IR-2030 Microsoft Azure Data Factory Sowmya Ravi 130 41 S17-IR-2031 Google Cloud storage: A journey towards Cloud storage Kumar Satyam 133 42 S17-IR-2034 Apache Drill Yatin Sharma 136 43 S17-IR-2035 Oracle PGX Piyush Shinde 139 44 S17-IR-2036 An Overview of the Java Message Service (JMS) Rahul Singh 143 45 S17-IR-2037 File Transfer Protocol - An Overview Sriram Sitharaman 146 46 S17-IR-2038 Introduction to H2O Sushmita Sivaprasad 150 47 S17-IR-2041 Tajo: A Distributed Warehouse System for Large Datasets Sagar Vora 153 4 48 S17-IR-2044 Allegro Graph Diksha Yadav 156 49 S17-TS-0003 TBD Tony Liu, Vibhatha Abeykoon, Gregor von Laszewski 158 50 S17-TS-0006 Not Submitted Author Missing 161 5 Review Article Spring 2017 - I524 1 Globus Toolkit SABER SHEYBANI1,* 1School of Informatics and Computing, Bloomington, IN 47408, U.S.A. *Corresponding authors: [email protected] paper-001, September 21, 2017 The Globus Toolkit is an open source software stack, developed to serve as the middleware for grid com- puting. It is organized as a collection of loosely coupled components consisting of services, programming libraries and development tools designed for building grid-based applications. GT components fall into five broad domain areas: Security, Data Management, Execution Management, Information Services, and Common Runtime. © 2017 https://creativecommons.org/licenses/. The authors verify that the text is not plagiarized. Keywords: Globus Toolkit, grid computing, WSRF, OGSA, cluster resource management https://github.com/ssheybani/sp17-i524/tree/master/paper1/S17-ER-1001/report.pdf 1. INTRODUCTION Initially, work on Globus was motivated by the demands of "virtual organizations" in science. These organizations need access to resources and services which are not easily replicable locally. Examples of these needs include access and management of equipment and large amounts of data, located in various remote databases on a regular basis. Although each application may have different specific requirements, the demand for a few basic functions is shared between most applications. "They often need to discover available resources, configure a computing resource to run an application, move data reliably from one site to another, monitor system components, control who can do what, and manage user credentials". These functions can be well addressed in a grid computing framework [1]. Fig. 1. Grid Architecture [2] According to [2], “a grid is a system that coordinates resources that are not subject to centralized control, using standard, open, general-purpose protocols and interfaces to deliver nontrivial qual- to standardize practically all the services that can commonly ities of service”. The major difference between grids and other be found in a grid system (job management services, resource distributed systems is that grids have significantly larger extent management services, security services, etc.) by specifying a set of heterogeneity of resources. They also enable resource sharing of standard interfaces for these services. In order to realize that among virtual organizations, whereas other distributed com- goal, OGSA needs to choose some sort of distributed middle- puting technologies only support a single organization [3]. The ware on which to base the architecture. In other words, if OGSA grid architecture consists of a number of layers that altogether (for example) defines that the JobSubmissionInterface has a sub- serve the functions and purposes of the systems (See figure 1). mitJob operation, there has to be a common and standard way to EGEE: Enabling Grids for E-Science in Europe [4], NEESit: Net- invoke that operation. Web Services were chosen as the under- work for Earthquake Engineering Simulation [5], TeraGrid [6] lying technology. Web services are platform-independent and are examples of grid systems. language-independent since they are message-oriented and rely A grid system often comprises various services such as Job on language-neutral XML dialects to send messages, to specify Management Service and Resource Discovery and Management interfaces, etc. Hence, web services are well-suited for building Service, which constantly need to interact with each other. How- loosely coupled systems such as grid systems. ever, each of these services may have been implemented by dif- However, although Web services can, in theory, be either ferent vendors, which is not necessarily compatible with other stateless or stateful, they are usually stateless. This means that services. The Open Grid Services Architecture (OGSA) aims the Web service can’t "remember" information, or keep state, 6 Review Article Spring 2017 - I524 2 from one invocation to another. On the other hand, OGSA re- Allocation and Management (the heart of GT Execution quires Stateful Services. The Web Services Resource Frame- Management, providing services to deploy and monitor work (WSRF) specifies how we can make our Web Services jobs), Community Scheduler Framework (provides a single stateful, along with adding other features. In other words, while interface to different resource schedulers), Workspace Man- OGSA is the architecture, WSRF is the infrastructure on which agement (allows users to dynamically create and manage that architecture is built on. workspaces on remote hosts), and Grid Telecontrol Protocol With this background, we can define Globus Toolkit as an (provides a WSRF-enabled service interface for control of open source toolkit that provides an implementation of OGSA, remote instruments). WSRF and a number of other standards for grid computing. Figure 2 displays the relationship between OGSA, Web service, Information Services : Commonly referred to as the Monitor- WSRF, Globus Toolkit. ing and Discovery System (MDS), these services deal with monitoring and discovery of resources in a virtual orga- nization. They include Index Service(for aggregation of resources of interest to a VO), Trigger Service (same as In- dex Service, but is configured to perform certain actions based on the data collected from resources), and WebMDS (provides a web browser-based view of data collected by GT4 aggregator services). Common Runtime : These components provide a set of funda- mental libraries and tools for hosting existing services as well as developing new services, in languages of C, Java, and Python (C Runtime, Java Runtime, and Python Run- time). 3. APPLICATIONS Globus Toolkit has been used in many scientific and industrial applications. Examples of large-scale e-science projects relying on the Globus Toolkit include the Network for Earthquake Engi- neering and Simulation (NEES), FusionGrid, the Earth System Fig. 2. Relationship between OGSA, GT4, WSRF, and Web Grid (ESG), the NSF Middleware Initiative and its GRIDS Center, Services [2] and the National Virtual Observatory. In the design of the Large Hadron Collider at CERN, Globus-based technologies have been developed through the European Data Grid, and the U.S. efforts 2. COMPONENTS like the Grid Physics Network (GriPhyN) and Particle Physics Data Grid [7]. The libraries and services in Globus Toolkit can be classified into Globus Toolkit is helping to bridge the gap for commercial ap- five broad domain areas, namely security, data management, exe- plications of Grid computing.