Application of ARIMA(1,1,0) Model for Predicting Time Delay of Search Engine Crawlers

26 Informatica Economică vol. 17, no. 4/2013 Application of ARIMA(1,1,0) Model for Predicting Time Delay of Search Engine Crawlers Jeeva JOSE1, P. Sojan LAL2 1 Department of Computer Applications, BPC College, Piravom, Kerala, India 2 School of Computer Sciences, Mahatma Gandhi University, Kottayam, Kerala, India [email protected], [email protected] World Wide Web is growing at a tremendous rate in terms of the number of visitors and number of web pages. Search engine crawlers are highly automated programs that periodically visit the web and index web pages. The behavior of search engines could be used in analyzing server load, quality of search engines, dynamics of search engine crawlers, ethics of search engines etc. The more the number of visits of a crawler to a web site, the more it contributes to the workload. The time delay between two consecutive visits of a crawler determines the dynamicity of the crawlers. The ARIMA(1,1,0) Model in time series analysis works well with the forecasting of the time delay between the visits of search crawlers at web sites. We con- sidered 5 search engine crawlers, all of which could be modeled using ARIMA(1,1,0).The re- sults of this study is useful in analyzing the server load. Keywords: ARIMA, Search Engine Crawler, Web logs, Time delay, Prediction Introduction before it crawls the web pages. The crawlers 1 Crawlers also known as ‘bots’, ‘robots’ or which access this file first and proceeds to ‘spiders’ are highly automated programs crawling are known as ethical crawlers and which are seldom regulated manually[1][2]. other crawlers who do not access this file are Crawlers form the basic building blocks of called unethical crawlers. The robots.txt file search engines which periodically visit the contains the information about which pages web sites, identify new web sites, update the are allowed for crawling and which all fold- new information and index the web pages in ers and pages are denied access. Certain pag- search engine archives. The log files generat- es and folders are denied access because they ed at web sites play a vital role in analyzing contain sensitive information which is not in- user as well as the behavior of the crawlers. tended to be publically available. There may Most of the works in web usage mining or be situations where two or more versions of a web log mining is related to user behavior as page will be available one as html and other they have application in target advertising, one as pdf. The crawlers can be made do online sales and marketing, market basket avoid crawling the pdf version to avoid re- analysis, personalization etc. There is open dundant crawling. Also certain files like Ja- source software available like Google Ana- vaScript, images, style sheets etc. can be lytics which measures the number of visitors, avoided for saving the time and bandwidth. duration of the visits, the demographic from There are two ways to do this. First one is which the visitor comes etc. But it cannot with the help of robots meta tag and the other identify search engine visits because Google one is with the help of robots.txt file. The ro- Analytics track users with the help of JavaS- bots.txt file contains the list of all user agents cript and search engine crawlers do not ena- and the folders or pages which are disallowed ble the JavaScript embedded in web pages [30]. The structure of a robots.txt file is fol- when the crawlers visit the web sites [3]. lows. The search engine crawlers initially access User-agent: the robots.txt file which specifies the Robot Disallow: Exclusion Protocol. Robots.txt is a text file “User-agent:” is the search engine crawler kept at the root of the web site directory. The and “Disallow:” lists the files and directories crawlers are supposed to access this file first to be excluded from indexing. In addition to DOI: 10.12948/issn14531305/17.4.2013.03 Informatica Economică vol. 17, no. 4/2013 27 “User-agent:” and “Disallow:” entries, com- ther increase, when there are a number of ment lines are included by putting the # sign mobile crawlers from different search en- at the beginning of the line. For example all gines. user agents are disallowed from accessing the all these mobile crawlers will stay in the folder /a.# All user agents are disallowed to memory of the remote system and will see the /a directory. consume lot of memory that could have User-agent: * otherwise been used for some other use- Disallow: /a/ ful purposes; The crawlers which initially access the ro- it can also happen that the remote system bots.txt and then the subsequent files or fold- may not allow the mobile crawlers to re- ers are known as ethical crawlers whereas side permanently in its memory due to others are known as unethical crawlers. Some security reasons; crawlers like “Googlebot”, “Yahoo! Slurp” in case a page changes very quickly then and “MSNbot” cache the robots.txt file for a the mobile crawler immediately accesses web site and hence during the modification the changed page and sends it to the of robots.txt file, these robots may disobey search engine to maintain up-to-date in- the rules. Roughly, a crawler starts oﬀ with dex. This will result in wastage of net- the URL for an initial page p0. It retrieves p0, work bandwidth and CPU cycles etc [30]. extracts any URLs in it, and adds them to a Recently web crawlers are used for focused queue of URLs to be scanned. Then the crawling, shopbot implementation and value crawler gets URLs from the queue (in some added services on the web. As a result more order), and repeats the process. Every page active robots are crawling on the web and that is scanned is given to a client that saves many more are expected to follow which will the pages, creates an index for the pages, or increase the search engine traffic and web summarizes or analyzes the content of the server activity [4]. The Auto Regressive In- pages [26]. Certain crawlers avoid too much tegrated Moving Average (ARIMA) Model load on a server by crawling the server at a was used to predict the time delay between low speed during peak hours of the day and two consecutive visits of a search engine at a high speed during late night and early crawler. We used the differenced first-order morning [2]. A crawler for a large search en- autoregressive model, ARIMA(1,1,0) for gine has to address two issues. First, it has to forecasting the time delay between two con- have a good crawling strategy, i.e., a strategy secutive visits of search engine crawlers. for deciding which pages to download next. Second, it needs to have a highly optimized 2 Background Literature system architecture that can download a large There are several works that mentions about number of pages per second while being ro- the search engine crawler behavior. A fore- bust against crashes, manageable, and con- casting model is proposed for the number of siderate of resources and web servers [24]. pages crawled by search engine crawlers at a There are two important aspects in designing web site [3]. Sun et al has conducted a large efficient web spiders, i.e. crawling strategy scale study of robots.txt [2]. A characteriza- and crawling performance. Crawling strate- tion study and metrics of search engine gy deals with the way the spider decides to crawlers is done to analyze the qualitative what pages should be downloaded next. Gen- features, periodicity of visits and the perva- erally, the web spider cannot download all siveness of visits to a web site [4]. The work- pages on the web due to the limitation of its ing of a search engine crawler is explained in resources compared to the size of the web [5]. Neilsen NetRatings is one of the leading [28]. internet and digital media audience infor- The mobile crawlers that always stay in the mation and analysis services. NetRatings memory of the remote system occupy a con- have provided a study on the usage statistics siderable portion of it. This problem will fur- of search engines in United States [6]. Com- DOI: 10.12948/issn14531305/17.4.2013.03 28 Informatica Economică vol. 17, no. 4/2013 mercial search engines play a lead role in the traffic generated at web sites is contribut- World Wide Web information dissemination ed by search engine crawlers [13]. The ad- and access. The evidence and possible causes vantages of preprocessing are: of search engine bias is also studied [7]. An the storage space is reduced as only the empirical pilot study is done to see the rela- data relevant to web mining is stored; tionship between JavaScript usage and web the user visits and image files are re- site usage. The intention was to establish moved so that the precision of web min- whether JavaScript based hyperlinks attract ing is improved. or repel crawlers resulting in an increase or The web logs are unstructured and unformat- decrease in web site visibility [8]. The ethics ted raw source of data. Unsuccessful status of search engine crawlers is identified using codes and entries pertaining to irrelevant data quantitative models [9]. Analysis of the tem- like JavaScript, images, stylesheets etc. in- poral behavior of search engine crawlers at cluding user information are removed. The web sites is also done [10]. There is a signifi- most widely used log file formats are Com- cant difference in the time delay between and mon Log File Format and Extended Log File among various search engine crawlers at web Format.

Application of ARIMA(1,1,0) Model for Predicting Time Delay of Search Engine Crawlers

Study of Web Crawler and Its Different Types

Distributed Web Crawling Using Network Coordinates

Panacea D4.1

Report for Portal Specific

Scalability and Efficiency Challenges in Large-Scale Web Search

Towards a Distributed Web Search Engine

Distributed Web Crawlers Using Hadoop

Design and Implementation of Distributed Web Crawler for Drug Website Search Using Hefty Based Enhanced Bandwidth Algorithms

A Practical Geographically Distributed Web Crawler

A Focused Web Crawler Driven by Self-Optimizing Classifiers

UCYMICRA: Distributed Indexing of the Web Using Migrating Crawlers

DATA20021 Information Retrieval Lecture 5: Web Crawling Simon J