Analysis of Web Server Logs Using Apache Hive to Identify the User Behaviour

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-1, 2017 ISSN: 2454-1362, http://www.onlinejournal.in

B Shalem Raju1, Dr. K. Venkata Ramana 2 1M.Tech, Dept.of Computer Science and Systems Engineering, Andhra University(A), Visakhapatnam, Andhra Pradesh, India. .2Dr.K.venkata Ramana, M.Tech, Ph.D, Asst Professor, Department of CS&SE, Andhra University College Of Engineering, (A). Visakhapatnam, India

Abstract: Web log record is a log document made about who visited the site, where they came from, and put away by a web server consequently. and what they did on the website. These log files Investigating such web server logs will give us carry a useful information for service providers so different bits of knowledge about site use. Because of analyzing these log files can give them insights about high utilization of web, the log documents are website traffic patterns, user activity, customer developing at much speedier rate with increment in interest etc. estimate. Handling this quickly developing log records utilizing social database innovation has been 2. Related Work. a testing errand nowadays. Subsequently to break down such huge datasets we require a parallel As data centre generates thousands of terabytes handling framework and a dependable information or petabytes of log files a day it is highly stockpiling instrument (Hadoop). Hadoop runs the challenging to store and analyze such high volumes enormous information where a monstrous amount of of log files. Analyzing log files looks complicate data is handled by means of group of product because of their high volume and complicate equipment. In this paper we introduce the approach structure. Traditional database techniques have failed utilized as a part of pre-preparing of high volume to handle these log files efficiently due to large size. web log documents, concentrate the statics of site In 2009, Andrew Pavlo and Erik Paulson compared and taking in the client conduct utilizing the the SQL DBMS with Hadoop MapReduce and engineering of Hadoop Map Reduce structure, suggested that Hadoop MapReduce loads data sooner Hadoop Distributed File System, and Hive QL query than RDBMS. Also traditional RDBMS cannot language. The entire site statistics such as number of handle large datasets. This is where big data users visit, frequent visit, identifying unauthorized technologies play a major role in handling large sets user, measuring user interesting patterns will be of data. Hadoop is the best suitable platform that obtained.. stores log files and does parallel implementation of Keywords. : Bigdata, Customer behaviour analysis, MapReduce program. For enterprises Apache Hadoop is a new way to store and analyze data. hadoop, log Analysis, web server logs Hadoop is an open-source project created by Doug Cutting under the administration of the Apache 1.Introduction. Software thousands of nodes with petabytes of data. While it can be used on a single machine, its true In today's competitive environment, capability lies in scaling to hundreds or thousands of manufacturers/service providers are keen to know computers. Tom White describes Hadoop is specially whether they provide the best service/product to designed to work on large volume of data using customers or whether customers look forward to get commodity hardware in parallel. Hadoop breaks log their service or buy product. Service providers files into equal sized blocks and these blocks are should need to know how to make their websites or evenly distributed among thousands of nodes in web application interesting to customers and how to cluster. Further, it does the replication of these improve advertising strategies to attract them. All blocks over multiple nodes to provide reliability and these questions can be answered by Log files. Log fault tolerance. files contain a list of actions that occurred whenever In case of large log files parallel computation of customer accesses the service provider's website or MapReduce improves performance by breaking job web application. Every hit to the Website will be into many tasks. Hadoop implementation shows that logged in a log file. These log files are stored in web MapReduce program structure can be an effective servers. The raw web log file is one line of text for solution to analyze large volume of weblog files in each hit to the website and contains information Hadoop environment. In my project Hadoop-MR log

Imperial Journal of Interdisciplinary Research (IJIR) Page 1252

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-1, 2017 ISSN: 2454-1362, http://www.onlinejournal.in file analysis tool, which provides a statistical report and each can be handled independently on various on total hits of a web page, traffic sources and user machines. To accomplish parallel execution, Hadoop activity, was performed on two machines with three actualizes a Map Reduce programming model. instances of Hadoop by distributing log files evenly MapReduce is a java based distributed among all nodes. A generic log analyzer framework programming model that consists of two phases: a for different kinds of log files was executed as parallel “Map” phase, followed by an aggregating distributed query processing to minimize response “Reduce” phase. A map function processes a time for the users that can be extendable for some key/value pair (k1, v1, k2, v2) to generate a set of format of logs. Hadoop framework handles large intermediate key/value pairs, and a reduce function volume of data in a cluster for web log mining. merges all intermediate values [v2] associated with Data cleaning, major part of preprocessing was the same intermediate key (k2). performed to remove inconsistent data. Unique Map (k1, v1) → [(k2, v2)] identification of fields was carried out to track the Reduce (k2, [v2]) → [(k3, v3)] user behaviour Maps are the individual undertakings that change the information records into middle of the road records. A MapReduce work as a rule parts the info information set into free lumps which are prepared by the guide errands. The structure sorts the yield of the guide, which are then contribution to the lessen undertakings. Both the info and the yield of the prepared employment are put away in the Hadoop record framework. The Hadoop cluster comprises of a solitary Name Node, an ace that deals with the record framework namespace and directs its entrance to records by customers. There can be various Data Nodes generally one for every hub in the group which intermittently answer to Name Node, the rundown of pieces it stores. HDFS imitates records for a designed number of times. It consequently re- recreates the information hinders on hubs that have fizzled. Utilizing HDFS a record can be made, erased, replicated, yet, can't be overhauled. The

Fig 1. Map Reduce Architecture record framework utilizes TCP/IP for correspondence between the bunches

4. Protocol.

The Internet protocol suite is the conceptual model and set of communications protocols used on the internet and similar computer networks. It is commonly known as TCP/IP because the original protocols in the suite are the Transmission Control Protocol (TCP) and the Internet Protocol (IP). The Internet protocol suite provides end-to-end data communication specifying how data should be Fig 2. System Architecture packetized, addressed, transmitted, routed and received. This functionality is organized into 3. Hadoop Map Reduce. four abstraction layers which are used to sort all related protocols according to the scope of Hadoop is an open source structure for networking involved. From lowest to highest, the substantial scale calculation and information layers are the link layer, containing communication handling on a group of item equipment. It permits methods for data that remains within a single applications to work with a huge number of network segment (link); the internet layer, computational autonomous PCs. The primary connecting independent networks, thus standard of Hadoop is moving calculations on the providing internetworking; the transport information rather the moving information for layer handling host-to-host communication; and calculation. Hadoop is utilized to breakdown the the application layer, which provides process-to- extensive number of information into smaller pieces process data exchange for applications.

Imperial Journal of Interdisciplinary Research (IJIR) Page 1253

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-1, 2017 ISSN: 2454-1362, http://www.onlinejournal.in

5. Proposed Methodology. Log files normally generated from the web server consist of large volume of data that cannot be handled by a traditional database or additional programming languages for computation. The proposed work aims on preprocessing the log file using Hadoop is shown in Fig 3.The work is split into phases, where the storage and processing is made in HDFS. Web server log files are first copied into Hadoop file system and then loaded to Hive table. Data cleaning, which is done using Hive query Language, is the first phase carried out in our project as a pre-processing step. Web server log files consist of number of records that correspond to automatic requests generated by web robots. The records usually carry a large volume of misleading, erroneous, and incomplete information. In our project web log files carrying requests from robots, spider and web crawlers are removed. Notably, requests created by web robots are not considered as Fig 3. Flowchart describing the methodology useful data and are filtered out from the log data in pre-processing the entries that have a status of Pseudo Distributed Mode. Hadoop framework "error" or "failure are removed. Further few access incorporate five daemons namely Namenode, records generated by automatic search engine agent Datanode, Jobtracker, Tasktracker, Secondary are identified and eliminated from the access log. namenode. In case of pseudo distributed mode all the The identification of status code is the important task daemons are run on local machine that actually carried out in the data cleaning .Only the log lines stimulates a cluster. with the status code value of "200" are identified as Apache Hive [11] is an important tool in the correct log. Therefore only the lines with the status Hadoop ecosystem that provides a Structured Query code value of "200" are extracted and stored in Hive Language called Hive QL to query the data stored in table for further analysis. The next step is to identify Hadoop Distributed File system. The log files that unique user, unique fields of date, status code, and are stored in the HDFS are loaded in to a hive table URL referred. These unique values are retrieved and and cleaning will be performed. The cleaned web log used for further analysis to find the total URL data is used to analyze daily statistics, unique user referred on a particular data or the maximum status and unique URLs, monthly statistics etc. code with successes on precise date. In our project Hadoop framework is used to compute the log processing through pseudo distributed mode of cluster. The web server logs of http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html (for a period of two months from July 1995 and August 1995) are used for processing in Hadoop framework.the log files are mainly analyzed with the help of Centos 6.6 OS with Apache Hadoop 1.1.2 and Apache Hive 0.10.0.

Fig 4. Raw log data processing and visualizing

Figure 4 illustrates copying raw log files into HDFS and then preprocessing is done using Apache Hive data warehouse too. In this project we used the GeoIP database for identifying the country where the request came from. By converting the ip addresses into numerical format and using this geoip database we can identify the geographical location of the user.

Imperial Journal of Interdisciplinary Research (IJIR) Page 1254

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-1, 2017 ISSN: 2454-1362, http://www.onlinejournal.in

6. Results. The above table shows some of the important statistics of the site used to identify the behaviour One of the main advantage of data cleaning is it like how many are trying to access the unauthorized produces quality results with increased efficiency. pages using status code 403 and we can identify the The results are shown in tables below. internal server errors using the status code 500 etc.

Table 1, GeoIP database sample. 7. Conclusion. IPStart IPEnd Country Country Address Address Code Name Websites are one of the vital means for associations for making promotions. Keeping in 16777216 16777471 AU Australia mind the end goal to get laid out outcomes or a 16777472 16778239 CN China particular site, we have to do log examination that 16778240 16779263 AU Australia improves the business techniques furthermore deliver 16779264 16781311 CN China quantifiable reports. In this venture with the 16781312 16785407 JP Japan assistance of Hadoop structure web server log 16785408 16793599 CN China records are broke down. Information gets put away on numerous hubs in a group so as to get the time 16793600 16809983 JP Japan required is decreased. Map Reduce works for 16809984 16842751 TH Thailand substantial datasets giving effective results. The process show the number of clients visits the pages, The follow table shows the total hits to the site customer's development, in which part of the site and the bandwidth used. customers are intrested. From these reports business gatherings can survey what parts of the site should be Table 2, Total hits and bandwidth used improved, who are the potential customers, what are Total Number of Hits 3463573 the districts from which the site is getting more hits, BandWidth 65559140 et cetera. This will help associations get ready for Used(MB) future advertising exercises. Log investigation should be possible utilizing a wide range of methods Following table show the result of user activity however what is vital is reaction time. Hadoop Map Table 3, Sample result Reduce show gives parallel conveyed preparing and IPADDRESS COUN Hits BandWidth dependable information stockpiling for colossal TRY (MB) volumes of web log records. Hadoop's capacity of 163.206.89.4 USA 9697 171059 moving handling to information as opposed to 198.133.29.18 USA 4204 103202 moving information to preparing upgrades reaction. 163.205.156.16 USA 2963 16828 163.206.137.21 USA 2787 166604 References. 163.205.1.45 USA 2017 66181 128.159.122.180 USA 1680 10810 [1]. SayaleeNarkhede and TriptiBaraskar, ―HMR Log 128.217.62.1 USA 1598 7299 Analyzer: Analyze Web Application Logs over 163.205.11.31 USA 1549 17150 HadoopMapReduce, International Jour nal of UbiComp (IJU) vol.4, No.3, July 2013. In this section we get general information related [2] Liu Zhijing, Wang Bin, (2003) ―Web mining research, to the website like how many times the website was International conference on computational intelligence and hit, total number of visitors, bandwidth used etc. It multimedia applications, pp. 84-89. lists out all the information that one should know [3] Yang, Q. and Zhang, H., (2003) ―Web-Log Mining about the websites. for predictive Caching‖, IEEE Trans.Knowledge and Data Eng., 15(4), pp. 1050-1053. [4] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Table 4, Status of the requests J. Abadi, David J. DeWitt, Samuel Madden,Michael Status Description Total Request Stonebraker, (2009)‖A Comparison of Approaches to 200 OK 3102414 Large-Scale Data Analysis‖, ACMSIGMOD‘09. 304 Not Modified 266805 [5]. Mr.YogeshPingle, VaibhavKohli, ShrutiKamat, 302 Found 73095 NimeshPoladia, (2012)―Big Data Processingusing 404 Not Found 20882 Apache Hadoop in Cloud System‖, National Conference on 403 Forbidden 224 Emerging Trends in Engineering & Technology. 500 Internal Server 65 [6] Tom White, (2009) ―Hadoop: The Definitive Guide. Error O’Reilly‖, Scbastopol, California. 501 Not Implemented 41 [7] Jeffrey Dean and Sanjay Ghemawat., (2004) 400 Bad Request 8 ―MapReduce: Simplified Data Processing on . LargeClusters‖, Google Research Publication.

Imperial Journal of Interdisciplinary Research (IJIR) Page 1255

Imperial Journal of Interdisciplinary Research (IJIR) Vol-3, Issue-1, 2017 ISSN: 2454-1362, http://www.onlinejournal.in

[8] Chen-Hau Wang, Ching-Tsorng Tsai, Chia-Chen Fan, Shyan- Ming Yuan, ―A Hadoop Based Weblog Analysis System‖, 2014 7th International Conference on Ubi-Media Computing and Workshops. [9] Sayalee Narkhede and Tripti Baraskar, ―HMR Log Analyzer: Analyze Web Application Logs over HadoopMapReduce, International Jour nal of UbiComp (IJU) vol.4, No.3, July 2013. [10] MilindBhandare, VikasNagare et al., ―Generic Log Analyzer Using HadoopMapreduce Framework,‖ International Journal of Emerging Technology and Advanced Engineering (IJETAE), vol.3, issue 9, September 2013. [11] Savitha K, Vijaya M S, ―Mining of web server logs in a distributed cluster using big data technologies‖, International Journal of Advanced Computer Science and Applications, Vol.5, NO.1, 2014.

Imperial Journal of Interdisciplinary Research (IJIR) Page 1256