CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench
Total Page:16
File Type:pdf, Size:1020Kb
CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been completed by many others in the past, and there is ample amount of documentation regarding the process. From the perspective of a beginner, or even someone with little knowledge of implementing a big data system from scratch, the process can be a overwhelming. Wading through the documentation, making mistakes along the way, and correcting those mistakes are considered by many part of the learning process. This paper will share our experiences with the installation of Hadoop on an Amazon Web Services cluster and analyzing this data in a meaningful way. The goal is to share areas where we encountered trouble so the reader may benefit from our learning. Introduction The Hadoop based installation was implemented by Lawrence Ni, Priya Patil, and James Tench as a group by working on the project over a series of Sunday afternoons. In addition, between meetings the individual members performed additional research to prepare for the following meeting. The Hadoop installation on Amazon Web Services (AWS) consisted of four servers hosted on micro EC2 instances. The cluster was setup in with one NameNode and three Data Nodes. In a real implementation, multiple Name Nodes would have been implemented to account for any machine failures. In addition to running Hadoop, the NameNode ran Hive as its NoSQL database to query the data. In addition to processing data on the AWS cluster, every step was first implemented on a local machine to test prior to running any job. On our local machines we ran MongoDb to query json data in an easy manner. In addition, the team implemented a custom Flume Agent to handle streaming data from Twitter’s firehose. AWS Amazon Web Services offers various products that can be used in a cloud environment. Running an entire cluster of hardware in the cloud is referred to platform as a service. To get started with setting up a cloud infrastructure you begin by creating an account with AWS. AWS offers a free level, which is basically low end machines. For our implementation, these low end machines served our needs. After creating an account with AWS, the documentation for creating an EC2 instance is the place to start. An EC2 instance is the standard type of machine that can be launched in the cloud. The entire set up for AWS was as easy as following a wizard to launch the instances. Configuration After successfully launching 4 instances, to get the machines running Hadoop it is necessary to download the Hadoop files and configure each node. This is the first spot where the group encountered configuration issues. The trouble was minor and easy to resolve, but it was more about remembering the installation steps for Hadoop in pseudo mode. Hadoop communicates via SSH and must be able to do so without being prompted for a password. For AWS machines to communicate, it must be done via SSH and you must have your digitally signed key available. To remedy the communication problem, a copy of the PEM file that is used locally was copied to each machine. Once the file was copied to each machine, a connection entry was made in the ~/.ssh config file with the ip address info for the other nodes. The next step after configuring the connection settings with SSH was to setup each of the Hadoop config files. Again, this process was straight-forward. Following the documentation on the Apache Hadoop website was all that was needed to set up the configuration. The key differences between installing on a cluster vs. pseudo mode were creating a slaves file, setting the replication factor, and adding the IP addresses of the data nodes. Flume The Twitter firehose API was chosen as the datasource for our project. The firehose is a stream of “tweets” coming from twitter live. To connect to the API, it is necessary to go to Twitter’s developer page and register as a developer. Upon registration you may create an “app” and obtain an API key for the app. This key is used to connect, and download data from the various Twitter APIs. Because of the streaming nature of the data (vs. connecting to a REST API), a method for moving the data from the stream to HDFS is needed. Flume provides this API. Flume works by using sinks, channels and sources. A sink is a data source, and in our case is the streaming API. A channel is the method it will use to store data as it moves to permanent storage. For this project, memory is used as the channel. Finally the sink is where data is stored. In our case, we are storing data in HDFS. Flume is also very well documented, and the documentation will guide you through the majority of the process for creating a Flume Agent. One area documented on the Flume website references the Twitter API and warns the user that to code is experimental and subject to change. This was the first area of configuring Flume where trouble was encountered. For the most part, the Apache Flume example worked for downloading data and storing it in HDFS. However, the Twitter API allows for filtering of the data via keywords passed with the API request. The default Apache implementation did not implement the ability to pass keywords, so there was no filter. To get around this problem, there is a well documented java class from Cloudera that includes the ability to use a Flume Agent with a filter condition. For our project we elected to copy the Apache implementation, and modify it by adding in the filter code from Cloudera. Once we had this in place, Flume was streaming data from to HDFS. After a few minutes letting Flume run on a local machine, the program began throwing exceptions, and the exceptions starting increasing. To solve this problem it was necessary to modify the Flume Agent config files so that the memory channel was flushed to disc often enough. After modifying the transaction capacity setting, and some trial and error the Flume Agent began running without exceptions. The key to getting the program to run without exception was to set the transaction capacity higher than the batch size. Once this was working as desired, the Flume Agent was copied to the Namenode on AWS. The Namenode launched the Flume Agent, was allowed to download data for days. Flume Java code MongoDb The Twitter API sends data in JSON format. MongoDb handles JSON naturally because it stores data in a binary JSON format called BSON. For these reasons, we used MongoDb on a local machine to understand the raw data better. Sample files were copied from the AWS cluster to a local machine. The files were imported into MongoDb via the mongoimport command. Once the data was loaded, querying to view to format of the tweets, test for valid data, and review simple aggregations was done with the mongo query language. Realizing we wanted a method to process large amounts of data directly on HDFS the group decided that MongoDb would not be our best choice for direct manipulation of the data on HDFS. For those reasons, the extent of the MongoDb usage was limited to only analyzing and reviewing sample data. MapReduce The first attempt to process large queries on the Hadoop cluster involved writing a MapReduce job. The JSONObject library created by Douglas Crockford was used to parse the raw JSON and extract the components being aggregated. MapReduce for finding one summary metric was easily implemented by using the JSONObject library to extract screen_name as the key, and the followers_count as the value for our MapReduce Job. Once again, the job was tested locally first, then processed on the cluster. With about 3.6 gb of data, the cluster process our count job in about 90 seconds. We did not consider this bad performance for 4 low end machines processing almost 4gig of data. Although the MapReduce job was not difficult to create in Java, it lacked the flexibility of running various ad hoc queries at will. This lead to the next phase of processing our data on the cluster. mapreduce code HIVE Apache Hive, like the other products mentioned prior was also very well documented and easy to install on the cluster following the standard docs. Moving data into hive proved to be the challenge. For HIVE to process data it needs a method for serializing and deserializing data when you send a query request. This is referred to as a Serde. Finding a JSON Serde was the easy part. We used the Hive-JSON-Serde from user rcongiu on github. The initial trouble with setting up the hive table was telling the Serde file what the format of the data would look like. Typically a create table statement needs to be generated to define what each field looks like inside the nested JSON document. During the development and implementation of the table, many of the data fields that we expected to hold a value were returning null. This is where we learned that in order for the Serde to work properly, the table definition needed to be very precise. Because each tweet from twitter did not alway contain complete data, our original implementation was failing. To create the perfect schema definition, another library called hive-json-schema by user quux00 on github was used. This tool was very good at auto generating a hive schema if you provided it with a single sample JSON document.