International Journal of Pure and Applied Mathematics Volume 118 No. 14 2018, 229-233 ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu Special Issue ijpam.eu

LOG DATA PROCESSING WITH MAPREDUCE

L. Chandra Sekhar Reddy1 ,Dr. D. Murali2 1Assistant Professor, Department of CSE,CMR College of Engineering & Technology, T.S. 2Professor, Department of CSE, MallaReddy College of Engineering for Women,T.S dabbumurali@.com 1 , [email protected]

Abstract: The advent of the has presented new source framework for distributed environment to handle difficulties. It has made the current conventional the data storage, processing, security and privacy. frameworks no longer relevant. Hadoop framework has Hardware needed to run hadoop is cost effective and can been fruitful in dealing with the Big Data challenges. be scaled horizontally as in the case of traditional database MapReduce has been the one of the key methodologies in systems [3].Organizations like , Yahoo, and taking care of regularly expanding computational had challenges with huge data quantities early demands implemented by the Big Data. Most of the on. The Hadoop venture came about because of their traditional database systems fail not only because of the endeavors to deal with their data sizes. It is intended to huge data volumes but also with wide variety of data. The keep running on a vast number of machines that don't traditional frameworks are not built for processing share any memory. Hadoop takes into account taking care unstructured and semi-structured data. This paper reviews of hardware failures through its distributed architecture. the processing of unstructured log data by MapReduce. Information is spread over the group of machines utilizing Efficient analyzation of a stock dataset is implemented HDFS—Hadoop Distributed . Data is and phases involved are elucidated. processed utilizing MapReduce, a Java-based programming model for processing data [4]. The main Keywords :Hadoop, MapReduce, Unstructured data, Log components of the hadoop are data, Stock data processing, Mapper, Reducer, Hadoop 1.HDFS (Hadoop Distributed File System) Distributed File System 2.MapReduce

1. Introduction HDFS (Hadoop Distributed File System)

We live in a world which is exorbitantly reliant and driven The Hadoop Distributed File System (HDFS) is intended by data. There has been a blast of information lately. 90 to store huge data sets dependably, and to stream those percent of the information on the planet has been created data collections at high transmission capacity to client over the most recent two years, more information is applications. In a huge cluster, a number of servers both produced in the previous two years than the whole history host specifically connected storage and execute client of the mankind [1]. Information is developing submitted jobs. By appropriating storage and processing exponentially, by the year 2020 it is normal that each crosswise over numerous servers, the cluster can grow as individual on the earth will produce 1.7 MB of data every needed while staying cost effective at each size [5]. second. The aggregate gathered data currently, 4.4 Zettabytes will go up to 44 Zettabytes, or 44 trillion 2. Mapreduce architecture Gigabytes by 2020. By then we will have more than 6.1 billion smart phone clients, 50 billion gadgets associated MapReduce comprises of a Job-Tracker, to which with web and out of all the collected information so far applications submit MapReduce jobs. The Job Tracker 33% of it will be put away on cloud. So far out of all the pushes jobs out to idle Task Tracker nodes in the cluster, collected information just 0.5% is really examined and while attempting to keep the work as near to the data as it utilized, this demonstrates the colossal potential of the could [6]. With a rack-aware filesystem, the Job Tracker data mining here [2]. knows which node is having the data, and which different machines are adjacent. If the work can't be facilitated on 1.1. Hadoop framework the actual node where the data dwells, high priority is given to the nodes on the same rack. This lessens network Hadoop is a framework from Apache Software traffic activity on the primary network. In the case of a Foundation, numerous companies are utilizing hadoop to Task Tracker failure or timeout the job is rescheduled. take care of their big data issues. Hadoop is an open

229 International Journal of Pure and Applied Mathematics Special Issue

3. Unstructured Data Processing - Limitations with Traditional RDBMS

The traditional RDBMS not only fails to handle Big Data but also not equipped to process unstructured data. In the first place, the data sizes have expanded hugely to the scope of petabytes, 1 petabyte is being 1,024 terabytes. RDBMS finds that it is almost impractical to deal with such colossal data volumes. To address this, RDBMS included more CPUs and more memory to the framework to scale up vertically [9].

Secondly, most of the data comes in unstructured or semi structured format social networking, audio, video, , and emails. In any case, the second issue identified with unstructured data is outside the domain of RDBMS on the grounds that relational can't sort Fig: 1: MapReduce architecture unstructured information. They're outlined and organized to suit structured data, for example, weblogs and 2.2. Mapreduce workflow transactional data. MapReduce empowers the developers with the ability to filter and aggregate the unstructured MapReduce is a programmig model and a related data and then process it with the business logic [10]. execution for preparing and processing huge datasets. MapReduce(MR) work is performed in two stages: Map 4. Processing stocks data and Reduce phases. The master picks unassigned workers and doles out map or reduce depending on the phase. We have a subset of a stock dataset with information of Before beginning the Map task, an input file is moved to stock symbols. In each line in the dataset, we have an Hadoop Distributed File System. While loading, the file information about stock symbol for a day. Information is divided into numerous blockswhich have a similar size, like the opening price, closing price, high, low, volume, normally 64MB, and each block will have three copies to etc., If you consider one line in the dataset, each line ensure adaptation to non-critical failure [7]. indicates a record. So, the first one is the exchange name, the next is the symbol, the date, the opening price, the Developers define a map function that procedures a closing price, high, low for the day and the volume. Using key/esteem combine to create a temporary key value this dataset, we would like to find the maximum closing pairs, and a reduce function that combines all values price for each stock symbol across several days. Let’s related to a same key. Numerous real undertakings are have a look at our dataset. expressible in this model, as one presented in this paper [8]. To implement the solution to our problem, we will be using a Single-node cluster developed by Cloudera. We will also be using Eclipse IDE to write the Java Classes for mapper and reducer.

4.1. Mapper Class :

Create a Java class with name (let's say) StockMapper and extend the class Mapper from org.apache.hadoop.mapreduce.Mapper package and override the map method to implement the mapper logic in this class. So, let’s work on the logic here, we have an input data which has lines of Text data. But, we don’t have any datatype which can store Text Type. So, we will be converting it into String datatype. Once you convert into the String type, now you will be using split method to

230 International Journal of Pure and Applied Mathematics Special Issue

differentiate each line from the record. Now from each Here the first path is the input data file with path and line or record, you will be needing two fields i.e., stock second path is the Output File Path. symbol and its closing price. So, assign stock symbol data Let us execute our logic now in hadoop. and closing price to two new variables. The logic here will look something similar to this. Hadoopjar /home/training/Desktop/StockPrice.jar ( Jar File ) String line = value.toString(); com.mapreduce.stocks.StocksDriver ( Driver Class File String [] items = line.split(","); ) /user/training/Stocks ( Input File Path in String stockSymbol = items [1]; HDFS ) Float closePrice = Float.parseFloat(items [3]); /user/training/StockOutput ( Output Folder Path) 4.2. Reducer Class:

Create a Java class with name (let’s say) StockReducer 5. Results and extend the reducer class from org.apache.hadoop.mapreduce.Reducer package and Let us now see the results of our logic that we have override reduce method to implement reducer logic in the implemented to find the maximum closing price of each class. From mapper class, we already have separate stock respective stock symbol. The above screenshot displays symbols and their respective closing price values. Now the complete mapreduce process happening during the using foreach loop we will be finding the maximum execution time. Now, let’s see the our output. We can closing price value by comparing with every stock symbol view the output in HDFS Browser as well as from in the dataset. The logic here will look something like terminal. this. 6. Conclusion float maxClosePrice =Float.MIN_VALUE; for(FloatWritable value : This paper deciphers how a log data, specifically values){maxClosePrice= Math.max(maxClosePrice, unstructured data is processed using MapReduce value.get()); programming paradigm. MapReduce is a very efficacious } approach to filter and analyze large datasets. In this paper unstructured data, a sample stock dataset is used to 4.3. Driver Class: demonstrate the implementation of MapReduce job. This This is the actual class where execution of the code starts work gives a better understanding of developing from the main method. In this class, we import many class MapReduce jobs and encourages more research in finding libraries like Job, FileInputFormat, FileOutputFormat, efficient ways to process unstructured log data. TextInputFormat, TextOutputFormat and Path Class. This is also the class, where you assign the mapper and References reducer classes for execution of the logic. FileInputFormat is used for adding the input path and [1] SamiddhaMukherjee , Ravi Shaw, Big Data – FileOutputFormat is used to set the output path to store Concepts, Applications, Challenges and Future the output in the prescribed location. Once, we have Scope, International Journal of Advanced Research written our Java Classes, the very next thing we do is in Computer and Communication Engineering Vol. creating a Jar File out of these classes by exporting the 5, Issue 2, February 2016. class. It will give you an understanding of how to create a [2] Marr, Bernard. “Big Data: 20 Mind-Boggling Facts Jar File. Everyone Must Read.” Forbes, Forbes Magazine, 19 Nov. 2015, In our example, we have create a jar file with name StockPrice.jar . To execute the program in hadoop, we [3] Harshawardhan S. Bhosale , Prof. Devendra P. need to upload the data into HDFS. The command that we Gadekar, A Review Paper on Big Data and use to insert data into HDFS is Hadoop, International Journal of Scientific and Research Publications, Volume 4, Issue 10, hadoopfs -put /home/training/Desktop/ StocksData / October, 2014 user/training/Stocks [4] Donna De Capite, Techniques in Processing Data on Hadoop, SAS Institute Inc., Cary, NC, 2014.

231 International Journal of Pure and Applied Mathematics Special Issue

[5] Shvachko, Konstantin, et al. “The Hadoop Maisamaguda,Hyderabad. His research interests includes Distributed File System.” 2010 IEEE 26th Data Mining, Big Data, Computer Networks,Compiler Symposium on Mass Storage Systems and Design, Design and Analysis of Algorithms and Machine Technologies (MSST),2010, Learning. doi:10.1109/msst.2010.5496972. [6] Gunarathne, Thilina, and Srinath Perera. Hadoop MapReduce Cookbook. Packt Publishing, 2013. [7] Lee, Kyong-Ha, et al. “Parallel Data Processing with MapReduce.” ACM SIGMOD Record, vol. 40, no. 4, Nov. 2012, p. 11., doi:10.1145/2094114.2094118. [8] Dean, Jeffrey, and . “MapReduce.” Communications of the ACM, vol. 51, no. 1, Jan. 2008, p. 107., doi:10.1145/1327452.1327492. [9] Anne Shields Jul 25, 2014 “Why Traditional Database Systems Fail to Support ‘Big Data.” - Market Realist, marketrealist.com/2014/07/traditional-database- systems-fail-support-big-data/ [10] Subramaniyaswamy, V., et al. “Unstructured Data Analysis on Big Data Using Map Reduce.” Procedia Computer Science, vol. 50, 2015, pp. 456–465., doi:10.1016/j.procs.2015.04.015.

L. Chandra Sekhar Reddy, received his B.Tech(CSE) from VITS college which is affiliated to JNTU in 2007. M.Tech(CSE) from GNIT, JNTUH in 2014. Pursuing his Phd from JJTU, Rajasthan from 2017. He is working in CMR College of Engineering & Technology, Medchal, Hyderabad. His research interests include Data Mining, Big Data, Compiler Design.

Dabbu Murali received his B.Tech. (CSE) and M.Tech. (CS) from JNTU College of Engineering Kukatpally, Hyderabad. He received Ph.D JNT University, Hyderabad in 2016.He is having 15 years of teaching and research experience, Presently he is working in Malla Reddy Engineering College for Women,

232 233 234