Analyzing Logs from Proxy Server and Captive Portal Using K-Means Clustering Algorithm

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Analyzing Logs from Proxy Server and Captive Portal Using K-Means Clustering Algorithm Rolysent K. Paredes1, Alberto L. Yoldan Jr.2 & Jonard B. Bolanio3 1College of Computer Studies, Misamis University, Ozamiz City, Philippines. Country: Philippines 2Management Information Systems, Misamis University, Ozamiz City, Philippines. 3Management Information Systems, Misamis University, Ozamiz City, Philippines. Copyright: ©2020 Rolysent K. Paredes et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Article Received: 28 June 2020 Article Accepted: 24 November 2020 Article Published: 21 December 2020 ABSTRACT The traffic on World Wide Web is rapidly increasing, and an enormous amount of generated data due to users’ various interactions with websites. Thus, web data becomes one of the most valuable resources for information retrievals and knowledge discoveries. The study utilized the logs from the Proxy Server and Captive Portal database and used Web Usage Mining to discover useful and exciting patterns from the web data. Moreover, k-means clustering algorithm was used to provide specific groups of the user access patterns specifically for the number of user sessions and websites accessed by the network users. Based on the results, it had been found out that most of the time, users are more engage in utilizing the internet. Keywords: Web usage mining, Clustering, Web data visualization. 1. Introduction In this digital age, data is considered to be a useful and valuable property (Lohr, 2012) an organization whether educational or commercial needs to have. It is not only becoming more available but also more understandable (Lohr, 2012). Data has many different sources in so many different formats (Levitus, 2013) such as smartphones, social networks, online shopping, electronic communication, GPS, and instrumented machinery all produce torrents of data as a by-product of their ordinary operations (McAfee, 2012). In recent years, the huge influx of information onto World Wide Web has facilitated users not only in retrieving information but likewise in discovering knowledge (Sumathi, 2011). It is said that every second there are more data cross the internet than were stored in the whole internet just 20 years ago (McAfee, 2012). The increase in the size of data that is available on the web has made it essential to find intelligent ways to retrieve the data needed and the user’s behavioral pattern in collecting the said data (Pamutha et al., 2012). This data is composed of user interactions on the web and is recorded on the web logs (Deshmukh and Shelke, 2015). Web log file is located in three different locations they are web server logs, client browser, and web proxy server. In the study, pfSense was utilized as the source of the web log. It is one of the many free and open source software that can be used as proxy servers. According to Ribeiro and Pereira (2009), pfSense is currently a viable replacement for commercial firewalling/routing packages. The list of features, among others, include the following: firewall, routing, QoS differentiation, NAT, Redundancy, Load Balancing, VPN, Report and Monitoring, Real Time information, gateway, and a Captive Portal. Therefore, pfSense in a single machine can be a gateway, proxy server and can do authentication of the users using the captive portal. One of the methods that a user can be authenticated is using username and password. The usernames and passwords; and other captive portal transactions are stored in a MySQL database (pfSense, 2016). Web log file provides full and accurate usage of data to a web server, but the log file does not record cached pages visited. Web proxy server takes HTTP request from ISSN: 2582-0974 [10] www.mejast.com Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December the user, gives them to the web server, result passed to the web server and return to the web server (Lokeshkumar et al, 2014). These log files are primary sources for data analysis (Zhang, 2012) and visualization. But identifying the users from the proxy servers’ web logs is difficult since the logs contain the actual HTTP requests from many clients to multiple web servers (Losarwar and Joshi, 2012). To address the user identification problem, a captive portal database can be utilized to get the particular user of each entry in the web log (pfSense, 2016). To gain competitive advantage from the web logs, web mining technology provides techniques to extract knowledge from web data (Geeta et al., 2008). Web mining has three main areas, namely web content mining, web structure mining, and web usage mining (Liu, 2007). Web content mining (WUM) is the process of extracting and integration of useful data, information, and knowledge from web page contents (Lakshmi et al, 2013). While web structure mining considers the web as a graph where Pages are nodes and Hyperlink are edges (Srivastava et al., 2000). In the other hand, web usage mining (WUM) implements preprocessing on the web log files. It is an application of data processing techniques that discover usage patterns of users from the available web data (Pamutha, 2012). Although web mining has been utilized to extract knowledge from the web logs, there is still limited effort in applying it in the higher education institutions (HEIs) in the Philippines. Hence, only a few schools in the United States have the time or motivation to run analysis on their internet usage which is essential to learning more about the usage habits and patterns of internet users in the campus (Daniels et al., 2012). Furthermore, in the Philippines, reports on the patterns of internet use are typically gathered by few social research institutes like Social Weather Stations (SWS) (Labucay and Stations, 2011). Hence, the purpose of the study is to cluster and visualize the statistical information or the access patterns of the students and employees on the utilization of the internet specifically the number of sessions and accessed websites. Web usage mining (WUM) technique was used to obtain the necessary information from the proxy server’s web logs. For the pattern discovery, k-means was used as clustering algorithm due to its simplicity and speed which can run on large datasets (Poornalatha and Raghavendra, 2011). The generated data and graphs may assist the school administrators in their data analysis and decision making related to the utilization of the internet and other aspects of the school. 2. Theoretical Framework 2.1 Review of Related Literature The increase in the size of data available through the web has made it essential to find ways to retrieve the data needed and the user’s behavioral pattern in collecting the data (Pamutha et al., 2012). This data is composed of user interactions on the web and recorded on the web logs (Deshmukh and Shelke, 2015). Web log files' locations are web server logs, web proxy server, and the client browser. Web log file provides full and accurate usage of data to web server, but the log file does not record cached pages visited. To make use of the logs, the study of Geeta et al. (2008) revealed that web mining technology offers techniques to extract knowledge from web data. It is the application of data mining techniques on web data (Sathiyamoorthi and ISSN: 2582-0974 [11] www.mejast.com Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Murali Bhaskaran, 2009). Web mining is one of the necessary fields of data mining. To achieve performance, web personalization and schema modification of website they applied a technique called data mining on content, structure and log files (Lokeshkumar et al., 2014). It is also an invaluable help in the transformation from human understandable content to machine understandable semantics (Khede and Raikwal, 2015). Web mining has three main areas, namely web content mining, web structure mining, and web usage mining (Liu, 2007). Web content mining is a technique of extracting and integrating of useful data, information, and knowledge from web page contents (Lakshmi et al., 2013). Lakshmi et al. (2013) concluded that web pages are one of the most valuable advertisement tools in the international area for the foundation, institutions, etc. Therefore, the suitability of standards, content, and design of web pages are imperative for system administrator and web designer. In the other hand, web structure mining considers the web as a graph where Pages are nodes and Hyperlink are edges (Srivastava et al., 2000). Web structure mining is the discovery of the link structure of the web. The hyperlinks are the sources of pure navigation. It helps to understand which web pages are linked to which next set of web pages. Famous PageRank algorithm proposed by Larry Page and Sergey Brin is based on the link structure of WWW (Khede and Raikwal, 2015). Another area of web mining is web usage mining. Gomathi (2008) implements preprocessing on the web log files utilizing web usage mining (WUM). WUM is an application of data processing techniques that discover usage patterns of users from the available web data. It ensures an improved service of web-based applications. The user access log files present significant information about a web server. It applied to fix several world problems by discovering user navigational patterns. Thus, it leads to the improvements on website designs. Moreover, by studying the user’s web access patterns, recommendations on pertinent web content improvements can readily be made (Pamutha, 2012). All the three areas or types of web mining focus on the process of knowledge discovery of implicit, previously unknown and potentially useful information from the Web.

Analyzing Logs from Proxy Server and Captive Portal Using K-Means Clustering Algorithm

Technical Impacts of DNS Privacy and Security on Network Service Scenarios

Captive Portal Detection Error May Be Triggered If There Is HTTP 302 Response Code Received PRS-325375 While Connecting to IVE

Filtering and Identifying Web Activity by User Name

9 Caching Proxy Server

Tunneled Internet Gateway Wi-Fi Access for Mobile Devices in High-Security Environments Table of Contents

Anyconnect Captive Portal Detection and Remediation

Browser History Stealing with Captive Wi-Fi Portals

Captive Portal

Captive Portal Authentication Via Facebook

Browser History Stealing with Captive Wi-Fi Portals

Web Request Routing and Redirection What’S the Best Option for Your Web Security Deployment?

IB7200 - Connectivity in ICT4D