The Funcsearch Search Engine
Total Page:16
File Type:pdf, Size:1020Kb
Improving search engine technology (Master thesis) Paul de Vrieze s920851 7-Mar-2002 Abstract While searching the internet for textual pages works rather well, searching for interactive pages can sometimes be still problematic. This thesis presents a solution to this problem. The solution presented uses a classification system that classifies pages into different classes to extend search engine functionality. Possible classifications are: textual web pages, link pages, and interactive pages. The classification system specifically improves searching interactive pages. To explain that, first the difference between textual web-pages and interactive web-pages needs to be explained. In textual web-pages the more important parts of the page are formed by words. In interactive pages, tags basically determine the structure of the page. Traditional search engines are based upon information retrieval. Information retrieval is mainly designed for textual contents (and not html specific). As in interactive pages tags play a big role, the traditional approach works less than perfect. Most literature that looks at search engines tries to improve searching textual pages. Others do try to improve searching other kinds of pages. They do though look for the solution at metadata. They want to search metadata for wanted pages. While this idea works very well there is one problem. There is as yet almost no metadata publicly available about web-pages. This metadata should be created by the authors of the web-pages. As it is possible that authors of large sites will provide metadata, small web-sites probably won't in the forseeable future. To improve searching for interactive pages a classification system can be used. This classification system classifies pages into groups which are based upon which kind of page a certain page is. This classification system also works well for textual web-pages and can add some information there too. To allow for refinement in the searching, and improved performance the classifi- cation system is build as a tree. At the top are general groups. Those groups have children which are more specific. A general group might for example show a page as interactive. A more specific classification/group might then see it as a reservation page. The performance increase can be found in the fact that only children whose parents evaluate with the highest score, are evaluated. A prototype has been developed that incorporates these ideas. The prototype functions as a search engine that allows one to order the results based on its class. The results returned also include a summarry of the classification scores of the pages. This allows users to better choose the pages they want to examine further. The prototype sytem functions well, within its limitations. Those limitations primarilly concern speed, and are of a not serious nature, as it would not be a problem with a production system. 1 Preface This thesis is the result of my master project. The project started with the question how to improve searching for interactive pages (web services) on the internet. The reason for this is that searching those pages doesn't allways go perfect. The project involved writing a thesis, and making a prototype. During the project the chosen solution for improving the search for interactive pages proved to also improve searching for other pages. This resulted in a broadened research question, that includes improving the results for other pages too. 2 Contents 1 Introduction 5 1.1 The research question . 5 1.2 The difference between interactive pages and text pages . 5 2 Research question and biggest demands and properties of the solution 7 2.1 The research question . 7 2.2 Literature . 7 2.2.1 Backing Technologies . 7 2.2.2 Structure of the internet . 7 2.2.2.1 TCP/IP . 8 2.2.2.2 HTTP . 8 2.2.2.3 HTML . 9 2.2.2.4 XML/XHTML . 9 2.2.3 Search engine technology . 9 2.2.3.1 The robot . 10 2.2.3.2 The retrieval system . 11 2.2.4 Limitations . 12 2.2.4.1 HTML limitations . 12 2.2.4.2 robot/interface design limitations . 12 2.2.4.3 Information retrieval limitations . 12 2.2.4.4 The problem . 12 2.2.4.5 The solution . 14 2.2.5 General improvement techniques . 14 2.2.5.1 robot improvements . 14 2.2.5.2 ranking improvements . 14 2.2.5.3 user interface improvements . 15 2.2.5.4 meta search engines . 15 2.2.6 Specific improvements from literature . 16 2.2.6.1 The use of ontologies . 16 2.2.6.2 Public search infrastructure . 16 2.2.6.3 Tailored Resource Descriptions . 17 2.2.6.4 Interactive user interface . 17 2.2.6.5 Automatic RDF meta-data generation . 17 2.2.6.6 Resource Description Framework (RDF) . 18 2.3 The theory . 19 2.3.1 How to get the page structure? . 20 2.3.2 How to look at the page structure? . 20 2.3.3 How to recognise page structures . 20 2.3.4 The classification tree . 21 2.3.4.1 Insignificant classification . 21 2.3.4.2 Frameset Classification . 21 2.3.4.3 Text Classification . 21 2.3.4.4 Links Classification . 22 2.3.4.5 Interactive Classification . 23 2.4 Demands on the theory . 23 3 3 The prototype 25 3.1 From the theory to the prototype . 25 3.2 Justification of differences . 25 3.3 Detailed architecture (5.3) . 26 3.3.1 The place of the FunSearch engine in its environment . 26 3.3.2 The structure of the FunSearch engine . 27 3.4 Filling in the architecture . 28 3.4.1 Query Search Engine . 28 3.4.2 Load Pages . 28 3.4.3 Classify Pages . 29 3.4.4 Sort Pages . 29 3.4.5 User Interface . 29 3.5 Implementation . 29 3.5.1 Querying . 29 3.5.1.1 Interface SearchInterface . 29 3.5.1.2 Class AltavistaSearch . 30 3.5.1.3 Class Buffered Search . 30 3.5.2 Loading pages . 30 3.5.2.1 Class HtmlStore . 30 3.5.2.2 Class PageLoader . 31 3.5.3 Classifying pages . 32 3.5.3.1 Class Classifier . 32 3.5.3.2 Abstract Class Classification . 32 3.5.3.3 Class Counter . 32 3.5.3.4 Class FormClassification . 33 3.5.4 Sorting pages . 33 3.5.5 User Interface . 33 3.5.6 Class WebServer . 33 3.5.7 Interface WebResourceProvider . 34 3.5.8 Class ClassificationResource . 34 4 Evaluation 36 4.1 Evaluation of the prototype . 36 4.2 From the prototype to the theory . 38 4.3 Conclusion . 38 5 Suggestions 39 A Bibliography 40 B Samenvatting in het Nederlands 42 4 Chapter 1 - Introduction 1.1 The research question This thesis describes a research project that started with the following question: Can searching for interactive pages on the internetbe improved? The implicit demand on the solution was that searching other pages wasn't made harder. In this thesis I repeatingly use the terms interactive (web-)page and text page. With interactive page I mean any web-page that allows a user to give feedback (by forms) to the server. Pages that use form elements as links do not count as interactive. Text pages are pages with a content that is mainly textual (appart from layout). A rough analysis of the literature led to the conclusion that the current focus of research concerning search engines looks at the retrieval of text pages. Interactive pages can hardly be seen as text pages as I will make clear in section 1.2. As the point of the research was on interactive pages, a new solution was necessary. Further analysis of the problem pointed in the direction of grouping pages into interactive and not interactive pages. This division allows to treat interactive pages separately, and to allow searches for interactive pages only. As there are many different kinds of interactive pages it makes sense not only to divide between interactive and other web-pages, but also to divide between the different kinds of interactive pages. This line of thought leads to a classification system. After the decision to look to a web-page classification system for the solution it made sense to look at the possibilities of the classification of non-interactive pages too. This decision can be justified by the fact that text searching, while not as problematic as interactive page searching, can still be improved. This widening of the subject called for a new research question. The research question used during the remainder of the research is: How can searching on the internet be made more efficient. 1.2 The difference between interactive pages and text pages Traditional search engines see web-pages as little more than plain text files, where some information needs to be discarded. If one presumes most web-pages are documents in essence this presumption works. There are nevertheless more and more "different"pages like interactive pages that barely contain text elements, or where the text elements cannot clearly describe the pages. Interactive pages differ from textual pages in one big point. While tags in textual pages only determine the layout, some tags in interactive pages determine the nature of the page. An example of an interactive page is visible in figure 1.1 on the following page. This figure shows the hotels.com booking page. Sub-figure 1.1(a) shows the page as rendered by a web-browser. This is how a person will see the page. Sub-figure 1.1(b) shows a part of the source of that page.