Quick viewing(Text Mode)

Information Retrieval: Web Based Analysis and Searching

Information Retrieval: Web Based Analysis and Searching

Information Retrieval: Web Based Analysis and Searching Karan Vombatkere ([email protected]) Avram Webberman ([email protected]) (Dated: 8 December 2017) Department of Computer Science University of Rochester CSC 461: Database Systems Term Paper

This paper provides a survey of Information nized data, and are heavily optimized towards fast Retrieval (IR) methods and techniques, in re- querying using a specified language such as SQL. lation to web crawling and web based search- systems on the other hand, ing. The information retrieval model is in- particularly web based search systems, must process troduced and defined as well as compared free-form search input to process unstructured with the structured database model. The data on the Web and return a list of pointers to approaches to IR and query modes are dis- documents. The interesting part of an IR system cussed. Several techniques for performing is that there is often no ’correct’ information or and optimizing/improving web based Infor- answer as a consequence of the free-form search mation Retrieval were then evaluated and request, and the documents to be returned must compared. Additionally, the challenges of undergo complex statistical analysis to determine web crawling IR are summarized and avenues the most relevant set (and order) of information for future research and trends are discussed. to be returned. Note that sometimes a search request might even have a ’correct’ answer (such as a mathematical query) and the IR system might then be expected to return that answer. A good example of this is a modern such as I. INTRODUCTION .

Information retrieval is the process of retrieving The following Table 27.1 from Fundamentals documents from a collection in response to a query of Database Systems [1] summarizes the compari- or a search request by a user. An IR system is son between databases and Information Retrieval different from a database system in that it doesn’t systems. limit the user to a specific query language (such as SQL or MongoDB). Additionally an IR system has an element of flexibility due to the fact that the user does not need to have any prior knowledge of the schema or structure of a particular database. Thus, the user is able to use free-form search request, that is taken as input by the IR system and then processed to provide the user with desired information.

Web search engines are one of the dominant ways to find information and thus it is critical for Information Retrieval systems to have the II. APPROACHES TO INFORMATION RETRIEVAL Web be a quickly accessibly repository of knowl- edge in the form of a virtual digital library. IR Web search combines both browsing and retrieval systems are often characterized at different lev- and is thus one of the main applications of Infor- els: types of users, data and information need. mation Retrieval today, where the web pages are This characterization is essential to design an ef- analogous to the documents mentioned previously. ficient IR system that can address specific problems. Efficient searching and retrieval is achieved using an indexed repository of web pages, usually using It is important to highlight the difference be- the technique of inverted indexing. Web pages are tween database systems and Information retrieval then ranked in order of relevance and information systems. The primary difference lies in the fact retrieved as per the user’s query, based on several that databases handle highly structured and orga- web analysis techniques which shall be discussed in 2 more detail subsequently. of discovering useful information from Web content/data/documents which contain a com- There are two main approaches to IR: statistical bination of structured and unstructured data. and semantic. In the statistical approach docu- Structured data extraction usually involves the ments are analyzed and broken down into chunks of use of a wrapper technique that looks for dif- text and each word or phrase is counted, weighted ferent structural characteristics of the informa- and measured for relevance or importance. The tion on the page (either manually encoded or statistical properties are then compared with terms using AI methods) to extract specific content. from the query, and a relevance ranking for each Web crawlers also often integrate information document is generated. The methods/models used from diverse web pages to reduce content re- for this relevance assessment are the Boolean model, dundancy. The generation of concept hier- the Vector Space model and the Probabilistic model. archies and/or clustering methods is another technique used to organize web content in or- In the semantic approach to IR, the syntac- der to facilitate more accurate and efficient IR. tic, lexical, sentential and pragmatic levels of knowledge-understanding are used to generate a 3. Web Usage Analysis: Web usage analysis relevance ranking for documents. The development is the application of data analysis techniques of a sophisticated semantic system requires complex to discover usage patterns from Web data, in knowledge bases of semantic information as well as order to understand and better serve the needs retrieval heuristics. of IR systems. Web usage data describes the pattern of usage of Web pages, such as IP Ad- dresses, page references, and the date and time III. WEB SEARCH AND ANALYSIS of accesses for a user/application. It consists of three main phases of preprocessing, pattern The application of data analysis techniques for the discovery and pattern analysis. Once the data discovery and analysis of useful information from is processed in a form for ready analysis, meth- the Web is known as Web Analysis. The primary ods from statistics, machine learning and data goals of web analysis are to optimize and personal- mining are used to discover various patterns ize the relevance of search results in IR, as well as in the usage data, that might provide insights identify trends that may be of value to users. The and trends and help highlight patterns in the three broad sub-categories of web analysis in terms data. of structure, content and usage are detailed below. Web Crawling: 1. Web Structure Analysis: Web search en- gines crawl the web and create an index to Web Crawlers are programs that ”crawl” through the web for searching purposes. The goal is to the internet generating copies of the Web pages generate a structural representation about the that they visit and downloading the content to a website and webpages, by focusing on the inner database. The crawler visits a URL, parses the structure of documents and the linking struc- content of that URL and looks for URLs that have ture defined by the hyperlinks at the inter- not been visited to add to its list to visit in the document level. The PageRank algorithm is future. While Web Crawlers have multiple uses, an example of a method to rank the relevance they are used by search engines to provide fast of web pages based on the importance of all searches to the user. the pages that are linked to it. If we consider a web surfer clicking on links to traverse pages This paper will discuss three main types of randomly, PageRank estimates how often a . These categories are referred to as particular page would be visited, by ranking focused, distributed, and incremental. nodes based on their structural importance. Algorithms like PageRank help identify a suit- Focused Web Crawlers are web crawlers able relevance ranking for information retrieval that collect Web pages according to some specific when there are hundreds of thousands of po- properties by prioritizing a crawler frontier. A tential webpages that are relevant to a par- crawl frontier dictates the logic that the crawler ticular query. The HITS algorithm is another should follow when visiting websites. One type of ranking algorithm that exploits the link struc- focused web crawler is one that uses a keyword ture of the Web. based approach, extracting URLs from Web pages that contain specific keywords and ignoring those 2. Web Content Analysis: This is the process that do not. Another type of 3

uses exemplary documents rather than keywords, Web page or a collection of Web pages (Web site). which are information documents that specify the Another challenge related to Web Structure is the intellectual structure of some specific field. connection of pages by hyperlinks, which makes the relevance of different documents dependent on each Distributed Web Crawlers represent a type of other. For example, a Web page itself might not be distributed computing where search engines uses relevant, but it could link to a Web page that is multiple computers for indexing the content of the relevant. internet. There are several varieties of distributed web crawlers. One of these uses a data mining Crawling and Indexing present the challenge based approach, which is a process for analyzing of finding an architecture that provides fresh infor- data across various fields and extracting valuable mation as well as full coverage of the web. This is information. Some of the data mining concepts use problematic when using a centralized search engine. for efficient retrieval of information are clustering, One question is whether it would be beneficial to classification, and fuzzy set. use a distributed architecture instead.

A different approach to distributed web crawl- The challenge of Searching is to find techniques ing is based on scalability which refers to increased that use all available evidence to find high quality throughput as the number of crawler nodes in- search results for users. Doing this in an efficient creases. One distributed scalable web crawler way involves using known information about the implementation is the DCrawler. user and their history. Other questions involve how to use language models to represent information as Some of the other approaches to distributed precisely as possible. Another interesting concern web crawling rely on Peer-To-Peer networks, dis- is how to measure how trustworthy a certain piece tributed hash tables, and a model based approach of information and incorporate it as a metric of that used an estimation of the accuracy and avail- retrieval. This introduces concerns of how much ability of the distributed web crawler. power this provides to search engine vendors.

Incremental Web Crawlers implement the Future Trends: There are three main fu- process of revisiting and prioritizing URLs. One ture trends in information retrieval that will be approach to incremental web crawlers calculates the discussed: Faceted Search, Social Search, and refresh time of pages to improve freshness. One idea Conversational Search. for the implementation of an efficient incremental web crawler calculated the refresh time of each page Faceted Search allows users to explore by dynamically for the next update as well as Web filtering available information according to facets. pages that needed frequent updating. A facet represents various characteristics of the class of objects, which are mutually exclusive and exhaustive. For example, a media fact could consist of oil, watercolor, stone, metal, mixed media, etc.

IV. CHALLENGES AND FURTHER TRENDS Social Search involves the use of other indi- viduals in searching. Research has shown that other individuals are a valuable source of information Challenges: Various challenges to the devel- when a user is searching. Some examples of col- opment of good information retrieval techniques laborative social searching are remote collaboration exist due largely to both the unstructured nature on search tasks, the use of social networks for of the data as well as the subjective nature of searching, and the use of expertise networks. evaluating the results. Some of the major challenges of information retrieval in Web searching relate to Conversational Search is an interactive and Web Structure, Crawling and Indexing, Searching, collaborative information finding interaction. The and Data Collection and Evaluation. users interact with and provide feedback to an intelligent agent that aids the search process. The The challenges of Web Structure refer to is- semantic retrieval model is used in conjunction with sues in defining the document collection of the web natural language understanding to provide faster as well as examining how the unique structure of and more relevant search results. The agent also these documents impacts how the information is works to connect users together, similarly to social retrieved. One challenge is determining whether search. it is more relevant to the user to retrieve a single 4

V. CONCLUSION son, 2017.

As the amount of unstructured information be- 2. Saini, Chandni & Arora, Vinay. (2016). In- ing generated continues to increase, it is important formation retrieval in web crawling: A survey. for information retrieval techniques to continue to 2635-2643. 10.1109/ICACCI.2016.7732456. improve in order to take full advantage of this in- formation. The Web is a massive repository of un- structured data and one of the foremost applications 3. Wazih Ahmad, Mohd & Ansari, M.A.. (2012). of Information Retrieval. As the Web is still a rel- A Survey: Soft Computing in Intelligent In- atively recent phenomenon, methods for improving formation Retrieval Systems. Proceedings - the ways in which we search the Web and analyze 12th International Conference on Computa- the quality of the results will continue to evolve. tional Science and Its Applications, ICCSA 2012. 26-34. 10.1109/ICCSA.2012.15.

VI. REFERENCES 4. Aslam, Jay et al. 1 Challenges in Information Retrieval and Language Modeling. (2003). 1. Elmasri, Ramez, and Sham Navathe. 27. Fun- damentals of Database Systems, 7th ed., Pear-