Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Analyzing Logs from and Captive Portal Using K-Means Clustering Algorithm

Rolysent K. Paredes1, Alberto L. Yoldan Jr.2 & Jonard B. Bolanio3

1College of Computer Studies, Misamis University, Ozamiz City, Philippines. Country: Philippines 2Management Information Systems, Misamis University, Ozamiz City, Philippines. 3Management Information Systems, Misamis University, Ozamiz City, Philippines.

Copyright: ©2020 Rolysent K. Paredes et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Article Received: 28 June 2020 Article Accepted: 24 November 2020 Article Published: 21 December 2020

ABSTRACT

The traffic on is rapidly increasing, and an enormous amount of generated data due to users’ various interactions with websites. Thus, web data becomes one of the most valuable resources for information retrievals and knowledge discoveries. The study utilized the logs from the Proxy Server and Captive Portal database and used Web Usage Mining to discover useful and exciting patterns from the web data. Moreover, k-means clustering algorithm was used to provide specific groups of the user access patterns specifically for the number of user sessions and websites accessed by the network users. Based on the results, it had been found out that most of the time, users are more engage in utilizing the . Keywords: Web usage mining, Clustering, Web data visualization.

1. Introduction

In this digital age, data is considered to be a useful and valuable property (Lohr, 2012) an organization whether educational or commercial needs to have. It is not only becoming more available but also more understandable (Lohr, 2012). Data has many different sources in so many different formats (Levitus, 2013) such as smartphones, social networks, online shopping, electronic communication, GPS, and instrumented machinery all produce torrents of data as a by-product of their ordinary operations (McAfee, 2012).

In recent years, the huge influx of information onto World Wide Web has facilitated users not only in retrieving information but likewise in discovering knowledge (Sumathi, 2011). It is said that every second there are more data cross the internet than were stored in the whole internet just 20 years ago (McAfee, 2012). The increase in the size of data that is available on the web has made it essential to find intelligent ways to retrieve the data needed and the user’s behavioral pattern in collecting the said data (Pamutha et al., 2012). This data is composed of user interactions on the web and is recorded on the web logs (Deshmukh and Shelke, 2015). Web log file is located in three different locations they are logs, client browser, and web proxy server.

In the study, pfSense was utilized as the source of the web log. It is one of the many free and open source software that can be used as proxy servers. According to Ribeiro and Pereira (2009), pfSense is currently a viable replacement for commercial firewalling/routing packages. The list of features, among others, include the following: , routing, QoS differentiation, NAT, Redundancy, Load Balancing, VPN, Report and Monitoring, Real Time information, gateway, and a Captive Portal. Therefore, pfSense in a single machine can be a gateway, proxy server and can do authentication of the users using the captive portal. One of the methods that a user can be authenticated is using username and password. The usernames and passwords; and other captive portal transactions are stored in a MySQL database (pfSense, 2016). Web log file provides full and accurate usage of data to a web server, but the log file does not record cached pages visited. Web proxy server takes HTTP request from

ISSN: 2582-0974 [10] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December the user, gives them to the web server, result passed to the web server and return to the web server (Lokeshkumar et al, 2014). These log files are primary sources for data analysis (Zhang, 2012) and visualization. But identifying the users from the proxy servers’ web logs is difficult since the logs contain the actual HTTP requests from many clients to multiple web servers (Losarwar and Joshi, 2012). To address the user identification problem, a captive portal database can be utilized to get the particular user of each entry in the web log (pfSense, 2016).

To gain competitive advantage from the web logs, web mining technology provides techniques to extract knowledge from web data (Geeta et al., 2008). Web mining has three main areas, namely web content mining, web structure mining, and web usage mining (Liu, 2007). Web content mining (WUM) is the process of extracting and integration of useful data, information, and knowledge from web page contents (Lakshmi et al, 2013). While web structure mining considers the web as a graph where Pages are nodes and Hyperlink are edges (Srivastava et al., 2000). In the other hand, web usage mining (WUM) implements preprocessing on the web log files. It is an application of data processing techniques that discover usage patterns of users from the available web data (Pamutha, 2012).

Although web mining has been utilized to extract knowledge from the web logs, there is still limited effort in applying it in the higher education institutions (HEIs) in the Philippines. Hence, only a few schools in the United States have the time or motivation to run analysis on their internet usage which is essential to learning more about the usage habits and patterns of internet users in the campus (Daniels et al., 2012). Furthermore, in the Philippines, reports on the patterns of internet use are typically gathered by few social research institutes like Social Weather Stations (SWS) (Labucay and Stations, 2011).

Hence, the purpose of the study is to cluster and visualize the statistical information or the access patterns of the students and employees on the utilization of the internet specifically the number of sessions and accessed websites. Web usage mining (WUM) technique was used to obtain the necessary information from the proxy server’s web logs. For the pattern discovery, k-means was used as clustering algorithm due to its simplicity and speed which can run on large datasets (Poornalatha and Raghavendra, 2011). The generated data and graphs may assist the school administrators in their data analysis and decision making related to the utilization of the internet and other aspects of the school.

2. Theoretical Framework

2.1 Review of Related Literature

The increase in the size of data available through the web has made it essential to find ways to retrieve the data needed and the user’s behavioral pattern in collecting the data (Pamutha et al., 2012). This data is composed of user interactions on the web and recorded on the web logs (Deshmukh and Shelke, 2015). Web log files' locations are web server logs, web proxy server, and the client browser. Web log file provides full and accurate usage of data to web server, but the log file does not record cached pages visited.

To make use of the logs, the study of Geeta et al. (2008) revealed that web mining technology offers techniques to extract knowledge from web data. It is the application of data mining techniques on web data (Sathiyamoorthi and

ISSN: 2582-0974 [11] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Murali Bhaskaran, 2009). Web mining is one of the necessary fields of data mining. To achieve performance, web personalization and schema modification of website they applied a technique called data mining on content, structure and log files (Lokeshkumar et al., 2014). It is also an invaluable help in the transformation from human understandable content to machine understandable semantics (Khede and Raikwal, 2015).

Web mining has three main areas, namely web content mining, web structure mining, and web usage mining (Liu, 2007). Web content mining is a technique of extracting and integrating of useful data, information, and knowledge from web page contents (Lakshmi et al., 2013). Lakshmi et al. (2013) concluded that web pages are one of the most valuable advertisement tools in the international area for the foundation, institutions, etc. Therefore, the suitability of standards, content, and design of web pages are imperative for system administrator and web designer.

In the other hand, web structure mining considers the web as a graph where Pages are nodes and Hyperlink are edges (Srivastava et al., 2000). Web structure mining is the discovery of the link structure of the web. The hyperlinks are the sources of pure navigation. It helps to understand which web pages are linked to which next set of web pages. Famous PageRank algorithm proposed by Larry Page and Sergey Brin is based on the link structure of WWW (Khede and Raikwal, 2015).

Another area of web mining is web usage mining. Gomathi (2008) implements preprocessing on the web log files utilizing web usage mining (WUM). WUM is an application of data processing techniques that discover usage patterns of users from the available web data. It ensures an improved service of web-based applications. The user access log files present significant information about a web server. It applied to fix several world problems by discovering user navigational patterns. Thus, it leads to the improvements on website designs. Moreover, by studying the user’s web access patterns, recommendations on pertinent web content improvements can readily be made (Pamutha, 2012). All the three areas or types of web mining focus on the process of knowledge discovery of implicit, previously unknown and potentially useful information from the Web. Each of them considers different mining objects of the Web (da Costa and Gong, 2005) which are vital in data visualization.

Nowadays, many organizations and institutions utilized the competitive capabilities of data visualization due to its powerful means to explore large datasets and a reliable way to present the results to a wider audience (Brodlie, 2012). Lavracˇ (2007) discovered that the use of data mining and decision support methods, including novel visualization techniques, can lead to better performance in decision making, can improve the effectiveness of developed solutions and enables tackling of new types of problems that have not been addressed before.

Kärkkäinen et al. (2013) implemented a gateway which filters traffic to the Internet so that only authenticated clients gain access to the used of the captive portal. The authentication server then prompts the user to enter his/her credentials. All other traffic is dropped until the user has been authenticated (using a local database or some backend service).

Ivancsy and Kovacs (2006) revealed that clustering is the process of grouping objects together in such a way that the objects that belong to the same group are similar and those that belong to different groups are dissimilar. Clustering technique used in many applications, for example, biological, financial applications and much more.

ISSN: 2582-0974 [12] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December In the study of Chitraa and Thanamani (2012), k-means was utilized for their enhanced clustering technique for web usage mining because it minimizes clustering errors. K-means is simple, very fast algorithm in partitioning the input dataset into k clusters. Further, each cluster is represented by an adaptively changing centroid (also called cluster center), starting from some initial values named seed-points. It computes the distances between the inputs (also called input data points) and centroids and assigns inputs to the nearest centroid.

Ansari (2014) concluded that K-means clustering produces moderately higher accuracy and lower clustering error as compared with kmedoids clustering algorithm. Also, it seems to be superior than Fuzzy C-Means algorithm (Ghosh and Dubey, 2013).

3. Operational Framework

The architectural design (figure 1) of the study is anchored on the web usage mining approach of Chitraa et al., (2010) with modification that include five stages, namely: 1) data collection; 2) preprocessing; 3) data transformation; 4) pattern discovery; and 5) pattern analysis.

Fig.1. Architectural Design of the Study

3.1. Data Collection

Data Collection is primarily the first step in the web usage mining process (Chitraa et al., 2010). In the study, all the test data are from Misamis University, Philippines which includes the school year 2017-2018.

Fig.2. Web Log File Entries

Proxy servers are employed to improve navigation speed through caching, and they collect data from the users accessing huge groups of web servers (Losarwar and Joshi, 2012). The web logs from the proxy server contain the actual HTTP requests from multiple clients to multiple Web servers (Chitraa et al., 2010). Web logs were collected

ISSN: 2582-0974 [13] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December from the institution’s pfSense proxy servers. These web logs are in .log format, and figure 2 shows the entries in the file. Table 1 below describes the attributes of a web log file with their descriptions.

Table 1. Attributes for each entry of a Web Log

Meanwhile, the database of the captive portal log was collected from a separate MySQL server. It is exported using any database management tools (e.g. phpmyadmin, navicat, sqlyog etc.) and typically the output file is SQL-formatted. The file contains data from the school year 2017-2018 wherein it composed of active users and radius account log’s account terminate cause attribute is Idle-Timeout or Session-Timeout. By using any database management tool, the file was imported to user sessions database. The data from the file is necessary for identifying the users and sessions.

Figure 3 presents the Entity-Relationship Diagram (in US units) of a captive portal log database which composed of two entities, namely: Radius Accounts and Radius Accounts Logs. The Radius Accounts entity is where the accounts of all the users reside. These accounts will be used by the users for authentication before connecting to the internet. The assigned IP address and group are also found in that entity. Every time a user logs in or out in the captive portal, a new entry will be saved in the Radius Account Logs along with the other important information of that session. The relationship between the two entities is one to many, since a single radius account can have many radius logs.

Radius Account Logs

PK radacctid Radius Accounts acctsessionid PK id acctuniqueid PK username username nasipaddress attribute nasportid op nasporttype value acctstarttime ipaddress acctstoptime macaddress acctsessiontime group acctterminatecause active framedipaddress

Fig.3. Entity Relationship Diagram of a Captive Portal Database

Tables 2 describes the attributes of a radius account entity and figure 4 shows how the entries for this entity looks like.

ISSN: 2582-0974 [14] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Table 2. Attributes of a Radius Account Entity

Fig.4. Radius Account Entity Sample Record Set

Tables 3 describes the attributes of a radius account logs entity and figure 5 shows how the entries for this entity look like.

Table 3. Attributes of a Radius Account Logs Entity

Fig.5. Radius Account Logs Entity Sample Record Set

3.1.1. User Sessions Database

The user sessions database is where to save all the logs from the web log that undergo data cleaning and the imported data from captive portal database. It is a modification of the captive portal database wherein web logs attribute is being added. Figure 6 shows the user sessions database.

ISSN: 2582-0974 [15] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December

Web Logs

PK log_id

date_time source_ip url domain website_ip

Radius Account Logs

PK radacctid Radius Accounts acctsessionid PK id acctuniqueid PK username username nasipaddress attribute nasportid op nasporttype value acctstarttime ipaddress acctstoptime macaddress acctsessiontime group acctterminatecause active framedipaddress

Fig.6. User Sessions Entity Relationship Diagram

The web logs is an entity where to save all cleaned logs from a web log file. The relationship between the radius accounts and web logs is one to many in which a single radius account can have many web logs. Table 4 depicts the attributes of the Web Logs with their descriptions.

Table 4. Web Logs Entity Attributes

3.2. Data Preprocessing

The purpose of preprocessing is to transform the unstructured raw data into a set of user profiles (Dong, 2009). It has three major tasks, namely: 1) data cleaning, 2) user identification, and 3) session identification. Data cleaning is the removal of irrelevant data (Suresh and Padmajavalli, 2006). User identification task is to identify the user that made the session while session identification is the login and logoff activities done by the users (Dunham, 2006).

3.2.1. Data Cleaning

For data cleaning, the algorithm of Yuan et al. (2003) is modified and implemented using PHP (PHP: Hypertext Preprocessor), a programming language for developing web applications. Along with the PHP code, a Structured Query Language (SQL) is also used to interact with the user sessions database. Further, this step, all entries that have status of “error” or “failure” should be removed, then some access records generated by automatic search engine agent should be identified and removed from the web log and also this process removes requests concerning non-analyzed resources such as images, multimedia files, and page style files (Suneetha and Krishnamoorthi,

ISSN: 2582-0974 [16] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December 2009). For example, requests for graphical page content (*.jpg & *.gif image) and requests for any other file which might be included into a web page or even navigation sessions performed by robots and web spiders. The only fields that will be considered for this task are the timestamp, client IP, client server URL stream and server IP. These data will be saved into web logs table of the user sessions database. Figure 7 presents the code for this task.

Fig.7. Algorithm for Data Cleaning implemented using PHP Programming Language

After running the code, it is identified that for school year 2017-2018, there are 103,502,136 log entries in a single web log file. And there are 12,343,914 entries that considered to be clean or relevant and these are saved in the web logs database table.

3.2.2. User Identification

User identification from the proxy servers’ web or access logs is difficult since the logs contain the actual HTTP requests from many clients to multiple web servers (Losarwar and Joshi, 2012). To address this issue, a captive portal database can be utilized. Thus, captive portals along with FreeRadius feature of pfSense can be used to assigned specific IP to each device and authenticate users before they can connect to the internet (pfSense, 2016). Figure 8 is a modification of the algorithm used by Losarwar and Joshi (2012) to identify the user for each entry in the web log file.

Fig.8. Algorithm for User Identification

3.2.3. Session Identification

To group the activities of a single user from the web log files is called a session. As long as the user is connected to the website, it is known as the session of that particular user. Most of the time, 30 minutes time-out was taken as a

ISSN: 2582-0974 [17] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December default session time-out (Losarwar and Joshi, 2012). Sessions can be retrieved quickly from the captive portal logs since users need to login first before they can connect to the internet (pfSense, 2016). The login and logoff represent the logical start and end of the session (Dunham, 2006). Figure 9 presents the algorithm for identifying individual session. Both Idle-Timeout and Session-Timeout are indicators that users created sessions.

Fig.9. Algorithm for Session Identification

3.3. Data Transformation

In this stage, data from the user sessions database is extracted and transformed into a comma separated values (CSV) file. This file contains the dataset which is necessary for discovering session patterns. The CSV file is significant for a Matlab software in generating the clusters using k-means. Figure 10 presents the PHP code in getting the values for the attributes – sites accessed by students and employees for a specified time range for each semester in school year 2017-2018.

Fig.10. PHP code to extract and transform the values for the number of websites being accessed to CSV file

ISSN: 2582-0974 [18] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Table 5 depicts the data from the CSV file generated during data transformation. This includes the 1st semester data pertaining the websites being accessed by the students and employees for a specified time range.

Table 5. Data on the number of websites accessed by students and employees for specific time range for 1st Semester, S.Y. 2017-2018

Table 6 shows the data from the CSV file generated during data transformation. This includes the 2nd semester data pertaining the websites being accessed by the students and employees for a specified time range.

Table 6. Data on the number of websites accessed by students and employees for specific time range for 2nd Semester, S.Y. 2017-2018

Figure 11 is the PHP code in getting the values for the attributes –sessions made by the students and employees for a specified time range for each semester in school year 2017-2018.

ISSN: 2582-0974 [19] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December

Fig.11. PHP code to transform extracted values for the number of sessions to CSV file

Table 7 and 8 show the data from the generated CSV files. These files contain the number of sessions for a specified time range made by the students and employees for two semesters of school year 2017-2018.

Table 7. Data on the number of sessions made by students and employees for a specific time range for 1st Semester, S.Y. 2017-2018

ISSN: 2582-0974 [20] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Table 8. Data on the number of sessions made by students and employees for a specific time range for 2nd Semester, S.Y. 2017-2018

3.4. Pattern Discovery

Once all user transactions have been identified, a variety of data mining techniques is performed for pattern discovery in the web usage mining (Chitraa et al., 2010) and one of those is clustering (Domenech and Lorenzo, 2007). Clustering techniques are widely utilized in web usage mining (WUM) to capture similar trends and interests among users accessing a website.

Clustering aims to divide a data set into groups or clusters where inter-cluster similarities are minimized while the intra cluster similarities are maximized (Ansari, 2014). The knowledge discovered from the clustering may be used to analyze the session patterns of the users (Poornalatha and Raghavendra, 2011).

The k-Means clustering algorithm is one of the most commonly used methods for partitioning the data (Ansari, 2014). It is more suitable for large datasets. k-Means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Euclidean distance is used as a metric.

The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets (Poornalatha and Raghavendra, 2011).

K-Means algorithm is iterative in nature and repeats for each object. It converges until the objects are stable (i.e., no object changes in the group) (Teknomo, 2007). K-Means clustering is simple, and the necessary steps it follows are 1. The number of clusters, K, is determined. 2. Assume a centroid or center of the K clusters. Any object can be randomly chosen and initialized as an initial centroid, or the first K objects can serve as the initial centroids. 3. The distance of each object from each of the centroids is calculated. 4.

Group the objects based on minimum distance (find the closest centroid for each object). Figure 12 presents the flowchart of k-means Nearest Neighbor algorithm.

ISSN: 2582-0974 [21] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December

Fig.12. K-Means Nearest Neighbor Flowchart

To perform the k-means algorithm, a Matlab software was used. Matlab is one of the most popular software especially in technical and engineering fields because it has toolbox or built-in function to calculate k-means (Savaş and Yıldız, 2010). Thus, visualization techniques, such as graphing user access patterns can also be done using the software.

Figure 13 is the Matlab code to calculate and graph the clusters for the number of websites that are being accessed by the students and employees.

Fig.13. Matlab code for clustering and graphing the number of websites

accessed by students and employees

ISSN: 2582-0974 [22] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Table 9. Summary on the number of websites accessed by students and employees for SY 2017-2018

Table 10. Clustered data on the number of websites accessed for 1st Semester, SY: 2017-2018

Table 11. Clustered data on the number of websites accessed for 2nd Semester, SY: 2017-2018

ISSN: 2582-0974 [23] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Table 9, presents the data in the CSV file that will be loaded to the Matlab software. This data is necessary so that the code shown in figure 13 will work. After running the Matlab code, clusters for the number of websites that are being accessed are identified, and those are being presented in tables 10, 11, and 12.

Table 12. Clustered data on the number of websites accessed

throughout the SY: 2017-2018

Figure 14 shows the Matlab code to calculate and graph the clusters for the number of sessions made by the students and employees.

Fig.14. Matlab code for clustering

and graphing sessions

After running the Matlab code, clusters for the number of sessions made by the students and faculty are identified and those are being presented in tables 13, 14, and 15.

ISSN: 2582-0974 [24] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Table 13. Clustered Sessions Data for 1st Semester, SY: 2017-2018

Table 14. Clustered Sessions Data for 2nd Semester, SY: 2017-2018

Table 15. Clustered Sessions Data for SY: 2017-2018

ISSN: 2582-0974 [25] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December

3.5. Pattern Analysis

The last phase of web usage mining is pattern analysis which deals with the visualization and interpretation of the unusual pattern to users and filtering of uninteresting information. Visualization assists an analyst to better apprehend navigation patterns and to predicate trends of data (Geeta et al., 2008). Thus, visualization techniques, such as graphing patterns are utilized for an easier interpretation of the results (Lakshmi, 2013).

3.5.1. User Access Patterns Graphs

Figure 15 shows the data points of the clusters for the number of websites being accessed by the students and employees for the 1st semester of school year 2017-2018. Based on table 10 and figure 15, group 2 (in red) has more data points compare to group 1 (in blue), which means that students and employees tend to access more sites starting 9:00 AM-5:59 PM.

Fig.15. Data points of the clusters on the number of websites being accessed for 1st Semester, SY: 2017-2018 Figure 16 displays the data points of the clusters for the number of websites being accessed by the students and employees for the 2nd semester of the school year 2017-2018. Based on table 11 and figure 16, group 2 (in blue) has more data points compare to group 1 (in red) which means that still students and employees tend to access more sites starting 9:00 AM-5:59 PM.

Fig.16. Data points of the clusters on the number of websites being accessed for 2nd Semester, SY: 2017-2018

ISSN: 2582-0974 [26] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Figure 17 shows the data points of the clusters for the number of websites being accessed by the students and employees for the whole school year 2017-2018. Based on table 12 and figure 17, group 2 (in blue) has more data points compare to group 1 (in red) which means that for the whole school year students and employees incline in browsing more sites starting 9:00 AM-5:59 PM.

Fig.17. Data points of the clusters on the number of websites accessed

throughout SY: 2017-2018

Figure 18 presents the data points of the clusters for the number of sessions made by the students and employees for 1st semester of school year 2017-2018. Based on table 13 and figure 18, group 1 (in blue) has more data points compare to group 2 (in red), which means that students and employees are likely to have more sessions starting 8:00 AM-4:59 PM.

Fig.18. Data points of the clusters for the number of sessions made by students and employees for

1st Semester, SY: 2017-2018

Figure 19 illustrates the data points of the clusters for the number of sessions made by the students and employees for the 2nd semester of the school year 2017-2018. Based on table 14 and figure 19, group 1 (in blue) has more data

ISSN: 2582-0974 [27] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December points compare to group 2 (in red), which means that still students and employees tend to have more sessions starting 8:00 AM-4:59 PM.

Fig.19. Data points of the clusters for the number of sessions made by students and employees

for 2nd Semester, SY: 2017-2018

Figure 20 shows the data points of the clusters for the number of sessions made by the students and employees throughout the school year 2017-2018. Based on table 15 and figure 20, group 1 (in blue) has more data points compare to group 2 (in red), which means that during the school year 2017-2018, students and employees are more active in using the Internet starting 8:00 AM-4:59 PM.

Fig.20. Data points of the clusters for the number of sessions made by students and

employees throughout SY: 2017-2018

4. Conclusion

Analyzing the logs coming from the proxy server and captive portal may help the school administrators in their data analysis pertaining the utilization of the internet in the campus like determining the time when to have a heavy

ISSN: 2582-0974 [28] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December traffic in the network. Thus, based on the results, it had been found out that most of the time, users are more engage in utilizing the internet. It can also be used in identifying when the students and employees stay active in browsing the internet and the number of websites they accessed. Hence, it is recommended to exploit the use of other clustering algorithms other than k-means in identifying and grouping web user patterns. Thus, utilizing former web and captive portal logs from previous school years for optimal data coverage. Further, extract other attributes from the logs since the study focuses only on the number of sessions created and websites accessed by the students and employees for a given time range.

Declarations

Source of Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Competing Interests Statement

The authors declare no competing financial, professional and personal interests.

Consent for publication

We declare that we consented for the publication of this research work.

References

Ansari, Z. (2014). Web User Session Cluster Discovery Based on k-Means and k-Medoids Techniques. International Journal of Computer Science & Engineering Technology (IJCSET), 1105-1113.

Brodlie, K., Osorio, R. A., & Lopes, A. (2012). A review of uncertainty in data visualization. In Expanding the frontiers of visual analytics and visualization (pp. 81-109). Springer London.

Chitraa, V., Davamani, D., & Selvdoss, A. (2010). A survey on preprocessing methods for web usage data. arXiv preprint arXiv:1004.1257.

Chitraa, V., & Thanamani, A. S. (2012). An Enhanced Clustering Technique for Web Usage Mining. International Journal of Engineering Research & Technology (IJERT) Vol, 1. da Costa, M. G., & Gong, Z. (2005, July). Web structure mining: an introduction. In 2005 IEEE International Conference on Information Acquisition (pp. 6-pp). IEEE.

Daniels, A., King, S., & Warwick, J. (2012) “A Week in the Life”: A Visual Analysis of Internet Use by School-Age Students.

Deshmukh, C. R., & Shelke, R. R. (2015). URL Mining Using Agglomerative Clustering Algorithm. International Journal of Science, Engineering and Computer Technology, 5(2), 24.

Domenech, J. M., & Lorenzo, J. (2007, December). A tool for web usage mining. In International Conference on Intelligent Data Engineering and Automated Learning (pp. 695-704). Springer Berlin Heidelberg.

ISSN: 2582-0974 [29] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December Dong, D. (2009, May). Exploration on web usage mining and its application. In Intelligent Systems and Applications, 2009. ISA 2009. International Workshop on (pp. 1-4). IEEE.

Dunham, M. H. (2006). Data mining: Introductory and advanced topics. Pearson Education India.

Geeta, R.B., Shashikumar, Totad, G., Prasad, & Reddy PVGD. (2009). Amalgamation of Web Usage Mining and Web Structure Mining.

Gomathi, C.; Moorthi, M.; Duraiswamy, K. (2008). "Preprocessing of Web Log Files in Web Usage Mining, "The Icfai Journal of Information Technology, 35J - 2008 -03 -06 -01(35J -2008 -03 -06 -01): pp. 55 -66, 2008.

Ghosh, S., & Dubey, S. K. (2013). Comparative analysis of k-means and fuzzy c-means algorithms. International Journal of Advanced Computer Science and Applications, 4(4).

Ivancsy, R., & Kovacs, F. (2006, February). Clustering techniques utilized in web usage mining. In Proceedings of the 5th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases (pp. 237-242).

Kärkkäinen, T., Pitkänen, M., & Ott, J. (2013). Enabling ad-hoc-style communication in public wlan hot-spots. ACM SIGMOBILE Mobile Computing and Communications Review, 17(1), 4-13.

Khede, M. A., & Raikwal, M. J. (2015). Applying Web Usage and Structural Mining for Web-Page Recommendations: A Survey.

Labucay, I. D., & Stations, S. W. (2011, September). Internet use in the Philippines. In Annual Conference of the World Association for Public Opinion (pp. 21-21).

Lakshmi, N., Rao, R. S., & Reddy, S. S. (2013). An Overview of Preprocessing on Web Log Data for Web Usage Analysis. International Journal of Computer Applications. India, 2(4), 274-279.

Lavrač, N., Bohanec, M., Pur, A., Cestnik, B., Debeljak, M., & Kobler, A. (2007). Data mining and visualization for decision support and modeling of public health-care resources. Journal of Biomedical Inf., 40(4), 438-447.

Levitus, S., Antonov, J. I., Baranova, O. K., Boyer, T. P.,

Coleman, C. L., Garcia, H. E., & Reagan, J. R. (2013). The world ocean database. Data Science Journal, 12(0), WDS229-WDS234.

Liu, B. (2007). Web data mining: exploring hyperlinks, contents, and usage data. Springer Sci. & Business Media.

Lohr, S. (2012). The age of big data. New York Times, 11.

Lokeshkumar, R., Sindhuja, R., & Sengottuvelan, D. P. (2014). A Survey on Pre-processing of Web Log File in Web Usage Mining to Improve the Quality of Data. International Journal of Emerging Technology and Advanced Engineering, ISSN, 2250-2459.

Losarwar, V., & Joshi, D. M. (2012, July). Data Preprocessing in Web Usage Mining. In International Conference on Artificial Intelligence and Embedded Systems (ICAIES'2012) July (pp. 15-16).

ISSN: 2582-0974 [30] www.mejast.com

Middle East Journal of Applied Science & Technology Vol.3, Iss.4, Pages 10-31, October-December McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big data. The management revolution. Harvard Bus Rev, 90(10), 61-67.

Pamutha, T., Chimphlee, S., Kimpan, C., & Sanguansat, P. (2012). Data preprocessing on web server log files for mining users access patterns. International Journal of Research and Reviews in Wireless Communications (IJRRWC) Vol, 2. pfSense (2016). pfSense Overview. Available at ://www.pfsense.org/about-pfsense/ [Retrieved on Aug. 8, 2016]

Poornalatha, G., & Raghavendra, P. S. (2011, July). Web user session clustering using modified K-means algorithm. In International Conference on Advances in Computing and Communications (pp. 243-252). Springer Berlin Heidelberg.

Ribeiro, A., & Pereira, H. (2009). L7 Classification and Policing in the pfSense Platform. In 21st International Teletraffic Congress (ITC 21), Paris, France.

Sathiyamoorthi, V., & Murali Bhaskaran, V. (2009), "Data Preparation Techniques for Web Usage Mining in World Wide Web-An Approach,"International Journal of Recent Trends in Engineering, Vol 2, No.4, 2009.

Savaş, K., & Yıldız, K. (2010). A Web Based Clustering Analysis Toolbox (WBCA) design Using MATLAB. Procedia-Social and Behavioral Sciences, 2(2), 5276-5280.

Srivastava, J., Cooley, R., Deshpande, M., & Tan, P. N. (2000). Web usage mining: Discovery and applications of usage patterns from web data. Acm Sigkdd Explorations Newsletter, 1(2), 12-23.

Sumathi, C.P., Padmaja Valli, R., & Santhanam, T. (2011). "An Overview of Preprocessing of Web Log Files for Web Usage Mining, "Journal of Theoretical and Applied Information Technology, Vol. 34, No. 1, 2011.

Suneetha, K. R., & Krishnamoorthi, R. (2009). Identifying user behavior by analyzing web server access log file. IJCSNS International Journal of Computer Science and Network Security, 9(4), 327-332.

Suresh, R. M., & Padmajavalli, R. (2006, December). An overview of data preprocessing in data and web usage mining. In 2006 1st International Conference on Digital Information Management.

Teknomo, K. (2007). “K-Means Clustering Tutorial,” Abailable at http://www.croce.ggf.br/dados/K%20mean%20Clustering1.pdf [Retrieved on August 9, 2016]

Yuan, F., Wang, L. J., & Yu, G. (2003, November). Study on data preprocessing algorithm in web log mining. In Machine Learning and Cybernetics, 2003 International Conference on (Vol. 1, pp. 28-32). IEEE.

Zhang, Y., Xiao, Y., Chen, M., Zhang, J., & Deng, H. (2012). A survey of security visualization for computer network logs. Security and Communication Networks, 5(4), 404-421.

ISSN: 2582-0974 [31] www.mejast.com