Information Retrieval: Web Based Analysis and Searching

Information Retrieval: Web Based Analysis and Searching Karan Vombatkere ([email protected]) Avram Webberman ([email protected]) (Dated: 8 December 2017) Department of Computer Science University of Rochester CSC 461: Database Systems Term Paper This paper provides a survey of Information nized data, and are heavily optimized towards fast Retrieval (IR) methods and techniques, in re- querying using a specified language such as SQL. lation to web crawling and web based search- Information retrieval systems on the other hand, ing. The information retrieval model is in- particularly web based search systems, must process troduced and defined as well as compared free-form search input to process unstructured with the structured database model. The data on the Web and return a list of pointers to approaches to IR and query modes are dis- documents. The interesting part of an IR system cussed. Several techniques for performing is that there is often no 'correct' information or and optimizing/improving web based Infor- answer as a consequence of the free-form search mation Retrieval were then evaluated and request, and the documents to be returned must compared. Additionally, the challenges of undergo complex statistical analysis to determine web crawling IR are summarized and avenues the most relevant set (and order) of information for future research and trends are discussed. to be returned. Note that sometimes a search request might even have a 'correct' answer (such as a mathematical query) and the IR system might then be expected to return that answer. A good example of this is a modern search engine such as I. INTRODUCTION Google. Information retrieval is the process of retrieving The following Table 27.1 from Fundamentals documents from a collection in response to a query of Database Systems [1] summarizes the compari- or a search request by a user. An IR system is son between databases and Information Retrieval different from a database system in that it doesn't systems. limit the user to a specific query language (such as SQL or MongoDB). Additionally an IR system has an element of flexibility due to the fact that the user does not need to have any prior knowledge of the schema or structure of a particular database. Thus, the user is able to use free-form search request, that is taken as input by the IR system and then processed to provide the user with desired information. Web search engines are one of the dominant ways to find information and thus it is critical for Information Retrieval systems to have the II. APPROACHES TO INFORMATION RETRIEVAL Web be a quickly accessibly repository of knowledge in the form of a virtual digital library. IR Web search combines both browsing and retrieval systems are often characterized at different lev- and is thus one of the main applications of Infor- els: types of users, data and information need. mation Retrieval today, where the web pages are This characterization is essential to design an ef- analogous to the documents mentioned previously. ficient IR system that can address specific problems. Efficient searching and retrieval is achieved using an indexed repository of web pages, usually using It is important to highlight the difference be- the technique of inverted indexing. Web pages are tween database systems and Information retrieval then ranked in order of relevance and information systems. The primary difference lies in the fact retrieved as per the user's query, based on several that databases handle highly structured and orga- web analysis techniques which shall be discussed in 2 more detail subsequently. of discovering useful information from Web content/data/documents which contain a com- There are two main approaches to IR: statistical bination of structured and unstructured data. and semantic. In the statistical approach docu- Structured data extraction usually involves the ments are analyzed and broken down into chunks of use of a wrapper technique that looks for dif- text and each word or phrase is counted, weighted ferent structural characteristics of the informa- and measured for relevance or importance. The tion on the page (either manually encoded or statistical properties are then compared with terms using AI methods) to extract specific content. from the query, and a relevance ranking for each Web crawlers also often integrate information document is generated. The methods/models used from diverse web pages to reduce content re- for this relevance assessment are the Boolean model, dundancy. The generation of concept hier- the Vector Space model and the Probabilistic model. archies and/or clustering methods is another technique used to organize web content in or- In the semantic approach to IR, the syntac- der to facilitate more accurate and efficient IR. tic, lexical, sentential and pragmatic levels of knowledge-understanding are used to generate a 3. Web Usage Analysis: Web usage analysis relevance ranking for documents. The development is the application of data analysis techniques of a sophisticated semantic system requires complex to discover usage patterns from Web data, in knowledge bases of semantic information as well as order to understand and better serve the needs retrieval heuristics. of IR systems. Web usage data describes the pattern of usage of Web pages, such as IP Ad- dresses, page references, and the date and time III. WEB SEARCH AND ANALYSIS of accesses for a user/application. It consists of three main phases of preprocessing, pattern The application of data analysis techniques for the discovery and pattern analysis. Once the data discovery and analysis of useful information from is processed in a form for ready analysis, meth- the Web is known as Web Analysis. The primary ods from statistics, machine learning and data goals of web analysis are to optimize and personal- mining are used to discover various patterns ize the relevance of search results in IR, as well as in the usage data, that might provide insights identify trends that may be of value to users. The and trends and help highlight patterns in the three broad sub-categories of web analysis in terms data. of structure, content and usage are detailed below. Web Crawling: 1. Web Structure Analysis: Web search engines crawl the web and create an index to Web Crawlers are programs that "crawl" through the web for searching purposes. The goal is to the internet generating copies of the Web pages generate a structural representation about the that they visit and downloading the content to a website and webpages, by focusing on the inner database. The crawler visits a URL, parses the structure of documents and the linking struc- content of that URL and looks for URLs that have ture defined by the hyperlinks at the inter- not been visited to add to its list to visit in the document level. The PageRank algorithm is future. While Web Crawlers have multiple uses, an example of a method to rank the relevance they are used by search engines to provide fast of web pages based on the importance of all searches to the user. the pages that are linked to it. If we consider a web surfer clicking on links to traverse pages This paper will discuss three main types of randomly, PageRank estimates how often a Web Crawler. These categories are referred to as particular page would be visited, by ranking focused, distributed, and incremental. nodes based on their structural importance. Algorithms like PageRank help identify a suit- Focused Web Crawlers are web crawlers able relevance ranking for information retrieval that collect Web pages according to some specific when there are hundreds of thousands of po- properties by prioritizing a crawler frontier. A tential webpages that are relevant to a par- crawl frontier dictates the logic that the crawler ticular query. The HITS algorithm is another should follow when visiting websites. One type of ranking algorithm that exploits the link struc- focused web crawler is one that uses a keyword ture of the Web. based approach, extracting URLs from Web pages that contain specific keywords and ignoring those 2. Web Content Analysis: This is the process that do not. Another type of focused crawler 3 uses exemplary documents rather than keywords, Web page or a collection of Web pages (Web site). which are information documents that specify the Another challenge related to Web Structure is the intellectual structure of some specific field. connection of pages by hyperlinks, which makes the relevance of different documents dependent on each Distributed Web Crawlers represent a type of other. For example, a Web page itself might not be distributed computing where search engines uses relevant, but it could link to a Web page that is multiple computers for indexing the content of the relevant. internet. There are several varieties of distributed web crawlers. One of these uses a data mining Crawling and Indexing present the challenge based approach, which is a process for analyzing of finding an architecture that provides fresh infor- data across various fields and extracting valuable mation as well as full coverage of the web. This is information. Some of the data mining concepts use problematic when using a centralized search engine. for efficient retrieval of information are clustering, One question is whether it would be beneficial to classification, and fuzzy set. use a distributed architecture instead. A different approach to distributed web crawl- The challenge of Searching is to find techniques ing is based on scalability which refers to increased that use all available evidence to find high quality throughput as the number of crawler nodes in- search results for users. Doing this in an efficient creases. One distributed scalable web crawler way involves using known information about the implementation is the DCrawler.

Load more