Intelligent Web Agent for Search Engines
Total Page:16
File Type:pdf, Size:1020Kb
International Conference on Trends and Advances in Computation and Engineering, TRACE- 2010 Intelligent Web Agent for Search Engines Avinash N. Bhute1,Harsha A. Bhute2 , Dr.B.B.Meshram3 Abstract II. WEB CRAWLER In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on A Web crawler is a computer program that browses the the Web. Search engines are retrieve the efficient information. We World Wide Web in a methodical, automated manner. Other collected data on the Internet from several different sources, e.g., terms for Web crawlers are ants, automatic indexers, bots, and current as well as projected number of users, hosts, and Web sites. worm [3] or Web spider, Web robot. Web crawling or The trends cited by the sources are consistent and point to spidering is term alternatively used for same. Many sites, in exponential growth in the past and in the coming decade. Hence it is particular search engines, use spidering as a means of not surprising that about 85% of Internet users surveyed claim using search engines and search services to find specific information and providing up-to-date data. Web crawlers are mainly used to users are not satisfied with the performance of the current generation create a copy of all the visited pages for later processing by a of search engines; the slow retrieval speed, communication delays, search engine that will index the downloaded pages to provide and poor quality of retrieved results. Web agents, programs acting fast searches. Crawlers can also be used for automating autonomously on some task, are already present in the form of maintenance tasks on a Web site, such as checking links or spiders, crawler, and robots. Agents offer substantial benefits and validating HTML code. Also, crawlers can be used to gather hazards, and because of this, their development must involve specific types of information from Web pages, such as attention to technical details. This paper illustrates the different harvesting e-mail addresses (usually for spam). A Web types of agents ,crawlers, robots ,etc for mining the contents of web crawler is one type of bot, or software agent. In general, it in a methodical, automated manner, also discusses the use of crawler to gather specific types of information from Web pages, such as starts with a list of URLs to visit, called the seeds. As the harvesting e-mail addresses. crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the Keywords: Chatterbot, spiders, Web agents, web crawler, , crawl frontier. URLs from the frontier are recursively visited Web robots. according to a set of policies. I. INTRODUCTION A. Crawling policies The potential Internet-wide impact of autonomous software There are important characteristics of the Web that make agents on the World Wide Web [1] has spawned a great deal crawling it very difficult: of discussion and occasional controversy. Based upon Large volume, experience in the design and operation of the Spider [2], such Fast rate of change, and tools can create substantial value to users of the Web. Dynamic page generation Unfortunately, agents can also be pests, generating substantial These characteristics combine to produce a wide variety of loads on already overloaded servers, and generally increasing possible crawlable pages. Internet backbone traffic. Much of the discussion in this paper The large volume implies that the crawler can only is about the agents, its types and architecture. download a fraction of the Web pages within a given time. The behavior of a Web crawler is the outcome of a combination of following policies: Selection Policy Re-Visit Policy Politeness Policy Parallelization Policy i. Selection policy It states which pages to download, As a crawler always Avinash N Bhute1, Author is with Sinhgad College of Engineering, Department of information technology as Assistant Professsor, Pune, downloads just a fraction of the Web pages, The current size Maharastra India. email- [email protected]. of the Web, evens large search engines cover only a portion of Harsha A Bhute2 is Pursuing ME, Computer Engg. (Computer N/W), the publicly-available Internet; a study by Lawrence and Giles S.C.O.E, Pune Maharastra, India. [email protected]. showed that no search engine indexes more than 16% of the Dr.B.B.Meshram3 is professor and head of computer engineering department. VJTI, Mumbai, Maharastra, India [email protected] Web in 1999[4].It is highly desirable that the downloaded 87 International Conference on Trends and Advances in Computation and Engineering, TRACE- 2010 fraction contains the most relevant pages and not just a large files, a server would have a hard time keeping up with random sample of the Web. Cho and other author. made the requests from multiple crawlers. first study on policies for crawling scheduling. Their data set As noted by Koster[9], the use of Web crawlers is useful for was a 180,000-pages crawl from the stanford.edu domain, in a number of tasks, but comes with a price for the general which a crawling simulation was done with different community. The costs of using Web crawlers include: strategies.[5] The ordering metrics tested were breadth-first, Network resources, as crawlers require considerable backlink-count and partial Pagerank calculations. One of the bandwidth and operate with a high degree of parallelism conclusions was that if the crawler wants to download pages during a long period of time; with high Pagerank early during the crawling process, then the Server overload, especially if the frequency of accesses to partial Pagerank strategy is the better, followed by breadth- a given server is too high; first and backlink-count. However, these results are for just a Poorly-written crawlers, which can crash servers or single domain. routers, or which download pages they cannot handle; and Najork and Wiener [6] performed an actual crawl on 328 Personal crawlers that, if deployed by too many users, can million pages, using breadth-first ordering. They found that a disrupt networks and Web servers. breadth-first crawl captures pages with high Pagerank early in A partial solution to these problems is the robots exclusion the crawl (but they did not compare this strategy against other protocol, also known as the robots.txt protocol[10] that is a strategies). The explanation given by the authors for this result standard for administrators to indicate which parts of their is that "the most important pages have many links to them Web servers should not be accessed by crawlers. This standard from numerous hosts, and those links will be found early, does not include a suggestion for the interval of visits to the regardless of on which host or page the crawl originates. same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search ii.Re-visit Policy engines like Ask Jeeves, MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots.txt file to indicate It states that to check for changes to the pages. The Web the number of seconds to delay between requests. has a very dynamic nature, and crawling a fraction of the Web can take a really long time, usually measured in weeks or iv. Parallelization Policy months. By the time a Web crawler has finished its crawl, many events could have happened. These events can include It states how to coordinate distributed Web crawlers. A creations, updates and deletions. parallel crawler is a crawler that runs multiple processes in From the search engine's point of view, there is a cost parallel. The goal is to maximize the download rate while associated with not detecting an event, and thus having an minimizing the overhead from parallelization and to avoid outdated copy of a resource. The most-used cost functions are repeated downloads of the same page. To avoid downloading freshness and age[7]. Two simple re-visiting policies were the same page more than once, the crawling system requires a studied by Cho and Garcia-Molina[8]: policy for assigning the new URLs discovered during the Uniform policy: This involves re-visiting all pages in the crawling process, as the same URL can be found by two collection with the same frequency, regardless of their rates of different crawling processes. change. Proportional policy: This involves re-visiting more often III. WEB CRAWLER ARCHITECTURE the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency. Web crawlers are a central part of search engines, and (In both cases, the repeated crawling order of pages can be details on their algorithms and architecture are kept as done either in a random or a fixed order.) business secrets. When crawler designs are published, there is Cho and Garcia-Molina proved the surprising result that, in often an important lack of detail that prevents others from terms of average freshness, the uniform policy outperforms the reproducing the work. There are also emerging concerns about proportional policy in both a simulated Web and a real Web "search engine spamming", which prevent major search crawl. The explanation for this result comes from the fact that, engines from publishing their ranking algorithms. when a page changes too often, the crawler will waste time by trying to re-crawl it too fast and still will not be able to keep its copy of the page fresh iii. Politeness Policy It states how to avoid overloading Web sites. Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site.