Big Data Software

Big Data Software Spring 2017 Bloomington, Indiana Editor: Gregor von Laszewski Department of Intelligent Systems Engeneering Indiana University [email protected] Contents 1 S17-ER-1001 Berkeley DB Saber Sheybani 5 2 S17-IO-3000 Apache Ranger Avadhoot Agasti 8 3 S17-IO-3005 Amazon Kinesis Abhishek Gupta 11 4 S17-IO-3008 Google Cloud DNS Vishwanath Kodre 14 5 S17-IO-3010 Robot Operating System (ROS) Matthew Lawson 17 6 S17-IO-3011 Apache Crunch Scott McClary 22 7 S17-IO-3012 Apache MRQL - MapReduce Query Language Mark McCombe 25 8 S17-IO-3013 Lighting Memory-Mapped Database (LMDB) Leonard Mwangi 29 9 S17-IO-3014 SciDB: An Array Database Piyush Rai 32 10 S17-IO-3015 Cassandra Sabyasachi Roy Choudhury 34 11 S17-IO-3016 Apache Derby Ribka Rufael 37 1 12 S17-IO-3017 Facebook Tao Nandita Sathe 40 13 S17-IO-3019 InCommon Michael Smith, 43 14 S17-IO-3020 Hadoop YARN Milind Suryawanshi, Gregor von Laszewski 46 15 S17-IO-3021 Apache Tez- Application Data processing Framework Abhijt Thakre 49 16 S17-IO-3022 Deployment Model of Juju Sunanda Unni 53 17 S17-IO-3023 AWS Lambda Karthick Venkatesan 56 18 S17-IO-3024 Not Submitted Ashok Vuppada 60 19 S17-IR-2001 HUBzero: A Platform For Scientific Collaboration Niteesh Kumar Akurati 62 20 S17-IR-2002 Apache Flink: Stream and Batch Processing Jimmy Ardiansyah 65 21 S17-IR-2004 Jelastic Ajit Balaga, S17-IR-2004 68 22 S17-IR-2006 An Overview of Apache Spark Snehal Chemburkar, Rahul Raghatate 71 23 S17-IR-2008 An overview of Apache THRIFT and its architecture Karthik Anbazhagan 76 2 24 S17-IR-2011 Hyper-V Anurag Kumar Jain 79 25 S17-IR-2012 Retainable Evaluator Execution Framework Pratik Jain 82 26 S17-IR-2013 A brief introduction to OpenCV Sahiti Korrapati 85 27 S17-IR-2014 An Overview of Pivotal Web Services Harshit Krishnakumar 88 28 S17-IR-2016 An Overview of Apache Avro Author Missing 91 29 S17-IR-2017 An Overview of Pivotal HD/HAWQ and its Applications Author Missing 93 30 S17-IR-2018 An overview of Cisco Intelligent Automation for Cloud Bhavesh Reddy Merugureddy 96 31 S17-IR-2019 KeystoneML Vasanth Methkupalli 99 32 S17-IR-2021 Amazon Elastic Beanstalk Shree Govind Mishra 102 33 S17-IR-2022 ASKALON Abhishek Naik 105 34 S17-IR-2024 Memcached Ronak Parekh, Gregor von Laszewski 108 35 S17-IR-2026 Naiad Rahul Raghatate, Snehal Chemburkar 111 3 36 S17-IR-2027 Dryad : Distributed Execution Engine Shahidhya Ramachandran 117 37 S17-IR-2028 A Report on Apache Apex Srikanth Ramanam 121 38 S17-IR-2029 Apache Mahout Naveenkumar Ramaraju 124 39 S17-IR-2030 Neo4J Sowmya Ravi 127 40 S17-IR-2031 OpenStack Nova: Compute Service of OpenStack Cloud Kumar Satyam 130 41 S17-IR-2034 Heroku Yatin Sharma 133 42 S17-IR-2035 D3 Piyush Shinde 136 43 S17-IR-2036 An overview of the open source log management tool - Graylog Rahul Singh 140 44 S17-IR-2037 Jupyter Notebook vs Apache Zeppelin - A comparative study Sriram Sitharaman 143 45 S17-IR-2038 Introduction to Terraform Sushmita Sivaprasad 147 46 S17-IR-2041 Google BigQuery - A data warehouse for large-scale data analytics Sagar Vora 150 47 S17-IR-2044 Hive Diksha Yadav 155 4 Review Article Spring 2017 - I524 1 Berkeley DB SABER SHEYBANI1,* 1School of Informatics and Computing, Bloomington, IN 47408, U.S.A. *Corresponding authors: [email protected] Paper 2, April 30, 2017 Berkeley DB is a family of open source, NoSQL key-value database libraries. It provides a simple function-call API for data access and management over a number of programming languages, including C, C++, Java, Perl, Tcl, Python, and PHP. Berkeley DB is embedded because it links directly into the application and runs in the same address space as the application. As a result, no inter-process communication, either over the network or between processes on the same machine, is required for database operations. It is also extremely portable and scalable, it can manage databases up to 256 terabytes in size. For data management, Berkeley DB offers advanced services, such as concurrency for many users, ACID transactions, and recovery. Berkeley DB is used in a wide variety of products and a large number of projects, including gateways from Cisco, Web applications at Amazon.com and open-source projects such as Apache and Linux. © 2017 https://creativecommons.org/licenses/. The authors verify that the text is not plagiarized. Keywords: NoSQL, embedded database, Oracle, open source database management system https://github.com/cloudmesh/sp17-i524/tree/master/paper2/S17-ER-1001/report.pdf 1. INTRODUCTION Semistructured databases which include NoSQL databases are the type of database model that enable storage of hetereogenous Data management has always been a fundamental issue in pro- data, by allowing records with different attributes. This how- gramming. Since 1960s, countless database management sys- ever, is achieved by sacrificing the knowledge of data type by tems have been developed to fulfil different sorts of demands. the database system. The data in this case must be self-describing, The question for every user is choosing the system that best fits meaning that the description (schema) of the data must be in the requirements of its appliction. itself. XML (Extensible Markup Language) schema language is Database management systems can be categorized based a widely used language for providing schema for these database on data models, into a number of groups: Hierarchical systems[1]. Databases, Network Databases, Relational Databases, Object- Berkeley DB fits into the last category, as a NoSQL database based Databases, and Semistructured Databases. system. The records are stored as key-value pairs and a few Hierarchical databases use the oldest type of data models, logical operations can be executed on them, namely: insertion, which is a tree-like structure. The records are connected to each deletion, finding a record by its key, and updating an already other with a hard-coded link. found record. "Berkeley DB never operates on the value part of In a Network model, records are also connected with links, but a record. Values are simply payload, to be stored with keys and there is no hierarchy. Instead, the structure is graph-like and all reliably delivered back to the application on demand." There of the nodes can connect to each other. is no notion of schema and no support for SQL queries. "The In Relational databases, there are no physical links, but the application must understand the keys and values that it uses. data is structured in tables (relations). Each row represents On the other hand, there is literally no limit to the data types a record and each column represents an attribute. The tables that can be stored in a Berkeley DB database. The application are connected with common attributes, which makes query- never needs to convert its own program data into the data types ing much easier than the two former models. For this reason, that Berkeley DB supports. Berkeley DB is able to operate on database management systems using relational models are the any data type the application uses, no matter how complex" [2]. most widely used ones. Object-based data models extend concepts of object-oriented 2. ARCHITECTURE programming into database systems, in order to provide per- sistent storage of objects and other capabilities of databases for Berkeley DB’s architecture can be explained by five major sub- object-oriented programming. systems: Access Methods: Providing general-purpose support 5 Review Article Spring 2017 - I524 2 for creating and accessing database files. Memory Pool: The crash of the system).Berkeley DB "libraries provide strict ACID general-purpose shared memory buffer pool. Multiple Transac- transaction semantics, by default. However, applications are tion: Implementing the transaction model, realizing ACID prop- allowed to relax the isolation guarantees the database system erties. processes and threads within processes share access to makes" [4]. databases using this subsystem. Locking: The general-purpose Berkeley DB runs in the same address space as the applica- lock manager for processes. Logging: The write-ahead logging tion. As a result, there is no need for communication between that supports the Berkeley DB transaction model. processes and threads. On the other hand, as an embedded database management system, it does not provide a standalone server. However, server applications can be built over Berke- ley DB and many examples of Lightweight Directory Access Protocol (LDAP) servers have been built using it [2]. The database library for Berkeley DB consumes less than 300 kilobytes of text space on common architectures. That makes it a feasible solution for embedded systems with small capacities. Nonetheless, it can manage up to 256 terabytes databases. 3.1. Supported Operating Systems and Languages Berkeley DB supports nearly all modern operating systems. They include Windows, Linux, Mac OS X, Android, iPhone, So- laris, BSD, HP-UX, AIX, and RTOS such as VxWorks,and QNX. The supported programming languages include "C, C++, Java, C#, Perl, Python, PHP, Tcl, Ruby and many others" [6]. 3.2. Required Infrastructure As infrastructure, Berkeley DB requires "underlying IEEE/ANSI Std 1003.1 (POSIX) system calls and can be ported easily to new architectures by adding stub routines to connect the native system interfaces to the Berkeley DB POSIX-style system calls" [7]. Fig. 1. Berkeley DB Subsystems [3] 4. PRODUCTS AND LICENSING The products include three implementations on C, C++, and Java Figure 1 displays a diagram of the Berkeley DB library archi- (Oracle Berkeley DB, Oracle Berkeley XML, and Oracle Berkeley tecture. The arrows are calls that invoke the destination. Each JE, respectively) [8].

Big Data Software

Working with Storm Topologies Date of Publish: 2018-08-13

Apache Flink™: Stream and Batch Processing in a Single Engine

Getting Started with Derby Version 10.14

Operational Database Offload

Apache Apex: Next Gen Big Data Analytics

The Cloud‐Based Demand‐Driven Supply Chain

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Building Machine Learning Inference Pipelines at Scale

Optimized Database Performance Contents Your Customers Need Lots of Data, Fast

Unravel Data Systems Version 4.5

Horn: a System for Parallel Training and Regularizing of Large-Scale Neural Networks

Return of Organization Exempt from Income