<<

Improving search engine technology (Master thesis)

Paul de Vrieze s920851

7-Mar-2002 Abstract

While searching the internet for textual pages works rather well, searching for interactive pages can sometimes be still problematic. This thesis presents a solution to this problem. The solution presented uses a classification system that classifies pages into different classes to extend search engine functionality. Possible classifications are: textual web pages, link pages, and interactive pages. The classification system specifically improves searching interactive pages. To explain that, first the difference between textual web-pages and interactive web-pages needs to be explained. In textual web-pages the more important parts of the page are formed by words. In interactive pages, tags basically determine the structure of the page. Traditional search engines are based upon information retrieval. Information retrieval is mainly designed for textual contents (and not html specific). As in interactive pages tags play a big role, the traditional approach works less than perfect. Most literature that looks at search engines tries to improve searching textual pages. Others do try to improve searching other kinds of pages. They do though look for the solution at metadata. They want to search metadata for wanted pages. While this idea works very well there is one problem. There is as yet almost no metadata publicly available about web-pages. This metadata should be created by the authors of the web-pages. As it is possible that authors of large sites will provide metadata, small web-sites probably won't in the forseeable future. To improve searching for interactive pages a classification system can be used. This classification system classifies pages into groups which are based upon which kind of page a certain page is. This classification system also works well for textual web-pages and can add some information there too. To allow for refinement in the searching, and improved performance the classifi- cation system is build as a tree. At the top are general groups. Those groups have children which are more specific. A general group might for example show a page as interactive. A more specific classification/group might then see it as a reservation page. The performance increase can be found in the fact that only children whose parents evaluate with the highest score, are evaluated. A prototype has been developed that incorporates these ideas. The prototype functions as a search engine that allows one to order the results based on its class. The results returned also include a summarry of the classification scores of the pages. This allows users to better choose the pages they want to examine further. The prototype sytem functions well, within its limitations. Those limitations primarilly concern speed, and are of a not serious nature, as it would not be a problem with a production system.

1 Preface

This thesis is the result of my master project. The project started with the question how to improve searching for interactive pages (web services) on the internet. The reason for this is that searching those pages doesn't allways go perfect. The project involved writing a thesis, and making a prototype. During the project the chosen solution for improving the search for interactive pages proved to also improve searching for other pages. This resulted in a broadened research question, that includes improving the results for other pages too.

2 Contents

1 Introduction 5 1.1 The research question ...... 5 1.2 The difference between interactive pages and text pages ...... 5

2 Research question and biggest demands and properties of the solution 7 2.1 The research question ...... 7 2.2 Literature ...... 7 2.2.1 Backing Technologies ...... 7 2.2.2 Structure of the internet ...... 7 2.2.2.1 TCP/IP ...... 8 2.2.2.2 HTTP ...... 8 2.2.2.3 HTML ...... 9 2.2.2.4 XML/XHTML ...... 9 2.2.3 Search engine technology ...... 9 2.2.3.1 The robot ...... 10 2.2.3.2 The retrieval system ...... 11 2.2.4 Limitations ...... 12 2.2.4.1 HTML limitations ...... 12 2.2.4.2 robot/interface design limitations ...... 12 2.2.4.3 Information retrieval limitations ...... 12 2.2.4.4 The problem ...... 12 2.2.4.5 The solution ...... 14 2.2.5 General improvement techniques ...... 14 2.2.5.1 robot improvements ...... 14 2.2.5.2 ranking improvements ...... 14 2.2.5.3 user interface improvements ...... 15 2.2.5.4 meta search engines ...... 15 2.2.6 Specific improvements from literature ...... 16 2.2.6.1 The use of ontologies ...... 16 2.2.6.2 Public search infrastructure ...... 16 2.2.6.3 Tailored Resource Descriptions ...... 17 2.2.6.4 Interactive user interface ...... 17 2.2.6.5 Automatic RDF meta-data generation ...... 17 2.2.6.6 Resource Description Framework (RDF) ...... 18 2.3 The theory ...... 19 2.3.1 How to get the page structure? ...... 20 2.3.2 How to look at the page structure? ...... 20 2.3.3 How to recognise page structures ...... 20 2.3.4 The classification tree ...... 21 2.3.4.1 Insignificant classification ...... 21 2.3.4.2 Frameset Classification ...... 21 2.3.4.3 Text Classification ...... 21 2.3.4.4 Links Classification ...... 22 2.3.4.5 Interactive Classification ...... 23 2.4 Demands on the theory ...... 23

3 3 The prototype 25 3.1 From the theory to the prototype ...... 25 3.2 Justification of differences ...... 25 3.3 Detailed architecture (5.3) ...... 26 3.3.1 The place of the FunSearch engine in its environment ...... 26 3.3.2 The structure of the FunSearch engine ...... 27 3.4 Filling in the architecture ...... 28 3.4.1 Query Search Engine ...... 28 3.4.2 Load Pages ...... 28 3.4.3 Classify Pages ...... 29 3.4.4 Sort Pages ...... 29 3.4.5 User Interface ...... 29 3.5 Implementation ...... 29 3.5.1 Querying ...... 29 3.5.1.1 Interface SearchInterface ...... 29 3.5.1.2 Class AltavistaSearch ...... 30 3.5.1.3 Class Buffered Search ...... 30 3.5.2 Loading pages ...... 30 3.5.2.1 Class HtmlStore ...... 30 3.5.2.2 Class PageLoader ...... 31 3.5.3 Classifying pages ...... 32 3.5.3.1 Class Classifier ...... 32 3.5.3.2 Abstract Class Classification ...... 32 3.5.3.3 Class Counter ...... 32 3.5.3.4 Class FormClassification ...... 33 3.5.4 Sorting pages ...... 33 3.5.5 User Interface ...... 33 3.5.6 Class WebServer ...... 33 3.5.7 Interface WebResourceProvider ...... 34 3.5.8 Class ClassificationResource ...... 34

4 Evaluation 36 4.1 Evaluation of the prototype ...... 36 4.2 From the prototype to the theory ...... 38 4.3 Conclusion ...... 38

5 Suggestions 39

A Bibliography 40

B Samenvatting in het Nederlands 42

4 Chapter 1 - Introduction

1.1 The research question

This thesis describes a research project that started with the following question: Can searching for interactive pages on the internetbe improved? The implicit demand on the solution was that searching other pages wasn't made harder. In this thesis I repeatingly use the terms interactive (web-)page and text page. With interactive page I mean any web-page that allows a user to give feedback (by forms) to the server. Pages that use form elements as links do not count as interactive. Text pages are pages with a content that is mainly textual (appart from layout). A rough analysis of the literature led to the conclusion that the current focus of research concerning search engines looks at the retrieval of text pages. Interactive pages can hardly be seen as text pages as I will make clear in section 1.2. As the point of the research was on interactive pages, a new solution was necessary. Further analysis of the problem pointed in the direction of grouping pages into interactive and not interactive pages. This division allows to treat interactive pages separately, and to allow searches for interactive pages only. As there are many different kinds of interactive pages it makes sense not only to divide between interactive and other web-pages, but also to divide between the different kinds of interactive pages. This line of thought leads to a classification system. After the decision to look to a web-page classification system for the solution it made sense to look at the possibilities of the classification of non-interactive pages too. This decision can be justified by the fact that text searching, while not as problematic as interactive page searching, can still be improved. This widening of the subject called for a new research question. The research question used during the remainder of the research is: How can searching on the internet be made more efficient.

1.2 The difference between interactive pages and text pages

Traditional search engines see web-pages as little more than plain text files, where some information needs to be discarded. If one presumes most web-pages are documents in essence this presumption works. There are nevertheless more and more "different"pages like interactive pages that barely contain text elements, or where the text elements cannot clearly describe the pages. Interactive pages differ from textual pages in one big point. While tags in textual pages only determine the layout, some tags in interactive pages determine the nature of the page. An example of an interactive page is visible in figure 1.1 on the following page. This figure shows the hotels.com booking page. Sub-figure 1.1(a) shows the page as rendered by a web-browser. This is how a person will see the page. Sub-figure 1.1(b) shows a part of the source of that page. This is the page as seen by a web-browser, or any other web client (like a search engine). What stands out here is the big amount of tags here in comparison to the amount of text elements. Search engines use techniques from information retrieval. Information retrieval is based on texts, not on HTML. This means search engines need to convert the HTML of the page into a plain text that can be used for information retrieval. The conversion of HTML into text is done rather straightforward. What basically happens is that all tags are stripped. What's left over is a text. The result of such a conversion is visible in

5 sub-figure 1.1(c). This figure shows the source, but with all tags removed. As one can see, the resulting text doesn't really make much sense anymore. With web-pages with a textual content the above-mentioned conversion doesn't result in strange texts. Instead the conversion is rather clean, and the text is very readable.

Hotels.com - the whole world of hotels hotels.com - the whole world of hotels Search results for Europe Netherlands Tilburg: Hotel specials... City Specials Specials World-wide Travel services... Flying from Europe? Travel Insurance Global Car Hire Destination info... Travel books 20% off! Travel books Maps.com has over 3,500 maps. over 3,500 maps European Rail Pass Tilburg: Results page # 1, tell us your preferences to improve these (a) Web-page results. Ibis Tilburg 100% Dr Hub Van Doorneweg 105, Tilburg, 5026 RB Distance: 0km / 0m Change $50 - $100 per room night Type: Hotel Chain restaurant, bar, airport transfer, no smoking facilities, function facilities : outdoor pool View this hotel's web site Read reviews on this hotel. Add this hotel to my favorites. Send this hotel to a friend. Make a reservation at this hotel. View Web site . . .

(c) As seen by search engines

(b) It's source

Figure 1.1: Hotels.com booking page

6 Chapter 2 - Research question and biggest demands and properties of the solution

2.1 The research question

The main question of the research is how to improve searching pages. In the light of the classification solution that lead to the extension of the original question, this question can be answered by the following subquestions: (see figure 2.5)

• How to find pages that include pages of the desired classification?

• How to present the results?

• How to recognise different pages? (How to classify a page?)

The first two of these questions are questions that are common to all search engines. The answer to the first question can be easily answered in terms of traditional search engines. The set of pages that include the wanted pages are those that are returned by a normal search engine on the user query. This means that traditional design can be applied to get this set. The presentation of the results poses a bigger problem. To keep things easy, for the users and in the design, the user interface should be based on traditional search engine interfaces. Extensions are necessary to show users the classification of all results, and to allow them to sort on classifications. The optimisation of the user interface is not a part of this thesis though, and is subject to further research. The third question, how to recognise different pages is very relevant though, and can be considered the main subquestion of the research. The solution to the question presented in this thesis is by looking at the page structure. This solution will be explained in section 2.3.

2.2 Literature

This section will describes literature relevant to the research question.

2.2.1 Backing Technologies First I will review the technologies that make search engines possible. This means I will first explain the internet. Then the protocols and standards involved in getting a web-page to a user.

2.2.2 Structure of the internet The internet is a computer network that evolved from the ARPANET, that was developed in 1969 by the Advanced Research Projects Agency (ARPA) of the U.S. Department of defence[11]. It was build to survive a nuclear attack. The network continues to function when connections disappear, or new connectionss appear. The network can also almost grow without limits. The only limit is the address space. The current internet protocol, IP version 4, only has an address space of 32 bits. That means a total of 232 = 4294967296 ≈ 4, 000, 000, 000 is 4 billion addresses are available, although some of those are reserved. The new version 6 will provide an address space of 2128 ≈ 3.4 ∗ 1038 addresses. That should be sufficient for the near, and further future. This section discusses the internet, and especially the parts of it that are relevant to this thesis. This is also illustrated in figure 2.1.

7 XHTML HTML XML

HTTP

TCP UDP ICMP

IP

Figure 2.1: The internet protocols

2.2.2.1 TCP/IP When internet is discussed, one often hears about TCP/IP. TCP and IP are two separate protocols though, where TCP relies heavily on IP, but where TCP can be replaced easily with some other protocol such as UDP.

IP IP takes care of low level addressing and transport of data packets over a multi-subnet network. The IP protocol itself is useless without one of it's peers, TCP, UDP, or ICMP. UDP is a datagram protocol that can send a message to a certain port on a certain computer without any guaranty of arrival. It's biggest use can be found in multimedia uses such as broadcasting, where retransmission is no real option. ICMP is the control protocol that is used by IP for internal purposes. TCP will be discussed next.

TCP TCP is a protocol that (together with IP) provides a guaranteed full duplex channel between two hosts. This is much like what a telephone does. When packets are not received well, or not at all, TCP itself takes care of retransmission. This reliability without programmer intervention is why most protocols are TCP, and not UDP, based.

2.2.2.2 HTTP The hypertext transfer protocol[3] is a request, response protocol with the main purpose of sending files from the server to a client. Old versions of the protocol always closed the connection after a response from the server. New versions allow the connection to stay open under certain conditions. The commandset that is provided by HTTP is very limited. The most important of them take the form of the request of a resource specified by a URL, possibly with arguments. The HTTP server (or web server) doesn't look at the contents of a file, so for related files such as images or style-sheets, new requests must be made. Further it is also not possible for a single request to request multiple files. HTTP was designed to be the protocol to carry hypertext files written in the Hypertext Mark-up Language (HTML)[12] which is described next.

8 2.2.2.3 HTML HTML is the mark-up language used for web-pages. Web-pages are supposed to be "hyper- text"pages. Hypertext is a special kind of text that has the property of being able to point to other documents. The HTML language is a representation language. It is well suited to determine how a page should look. It doesn't have ways though to specify unambiguously what data means. The biggest problem HTML has for parsers (a parser is a computer program that reads the html) is the fact that it is not a strict language. It is permissible to omit lots of tags, and web-browsers still try to make the best of it. The problem is that different user agents do that in different ways, so ``imperfect'' pages can be rendered very different by different web-browsers. Further complicating things, search engines that want to look into the HTML structure must first "correct"the HTML. A complicating factor in this is that sometimes incorrect HTML is used to get a layout on a certain browser that is not possible otherwise.

2.2.2.4 XML/XHTML XML[16] is a general mark-up language that is designed as a lightweight version of SGML that is modelled after HTML. It has a very strict syntax that makes it a lot easier for computer programs to parse. XML is not an equivalent of HTML though. XML in itself is nothing more than a standard for storing/passing structured data. XML in itself is not very useful for anything, but together with a DTD(document type definition) or schema for a specific datastructure, it is very powerful. That's where XHTML[10] comes in. Where HTML is an application of SGML, XHTML is an application of XML, mapped within the constraints of XML. This means a standard XML parser that knows the XHTML DTD can parse XHTML. A programmer of an XHTML browser thus only needs to use an XML library to get a tree of XHTML tags. The only thing the programmer has to do is to add the meaning of the XHTML tags, and make a visual representation of the tree. XHTML was designed with compatibility in mind. The tags it supports are exactly the same ones, with the same meaning, as those in HTML. Actually, it is even possible for XHTML written keeping a few guidelines in mind to be parsed by HTML parsers. This means XHTML has the strictness and extensibility of XML, while still pertaining compatibility with the very popular HTML. XHTML is not yet in widespread use on the internet, so it is unacceptable for an application to only support XHTML. The most viable solution for search engine parsers is to have an engine that "corrects"the HTML input, and then gives an XHTML-like representation internally. The W3C (World Wide Web Consortium: http://www.w3.org) and others are seeking to extend the web with meta-data about documents. Those standards are not yet in the recommendation stage though, and rarely used. Most meta-data solutions try to insert meta-data in the file itself (or in a reference). This has the problem that most page writers are not librarians, and so, not adept in making good meta-data. Furthermore it allows for search engine spamming (see subsection 2.2.3.1).

2.2.3 Search engine technology A search engine consists of two parts: A spider or robot and a retrieval system. In figure 2.2 this structure is visualised. The spider is responsible for finding the web-pages, and the retrieval system is responsible for querying the database and presenting the results to the user. Why the need for two parts? Well, finding a page doesn't happen at the moment one asks a search engine for something as that would take a long time. It will only get worse as the internet grows bigger and bigger. A search engine like altavista[1] is not to be mistaken with an internet directory like yahoo[17]. Internet directories are human made hierarchies where pages are classified according to their

9

Web Server Client





  Robot Search

Interface



¤ ¤ ¤ ¤

¥ ¥ ¥ ¥



 







¨ ¨ ¨ ¨ ¢ ¢ ¢

© © © © £ £ £



¨ ¨ ¨ ¨ ¢ ¢ ¢

© © © © £ £ £



¨ ¨ ¨ ¨ ¢ ¢ ¢

© © © © £ £ £









 



¦ ¦ ¦

§ § § ¡ ¡ ¡ ¡



¦ ¦ ¦

§ § § ¡ ¡ ¡ ¡



¦ ¦ ¦

§ § § ¡ ¡ ¡ ¡



¦ ¦ ¦

§ § § ¡ ¡ ¡ ¡



¦ ¦ ¦

§ § § ¡ ¡ ¡ ¡



 

!"

Figure 2.2: General structure of a search engine subject. Every page in a directory needs to be put there by someone. Search engines though, require no human intervention whatsoever, but don't provide categories of pages. Next I will give an explanation of the how robots work, and how retrieval systems work.

2.2.3.1 The robot The job of the robot (sometimes called spider) is twofold. Its main job is to get a page from a server, make some kind of summary of the page, and store the summary in a large database. The other job the robot does is to look at the page to find other pages that can be visited by the robot and put them in a list of pages to be visited. The latter job can often be done at the same time as the summarising. A summary of a web-page as created by the robot can be seen as meta-data.

Which pages to visit Search engines look for pages to visit (to include in the database) on the pages they have visited. This is based on the premise that every page is linked to by at least one other page. The trick there is to have a good starting point that links a lot of other pages. Of course the premise that there is a path to every web-page from a certain start page is not necessarily true. That, and the fact that it takes time to follow the links to a certain page(it can take weeks before a certain page on the list is visited), is why most search engines offer the possibility to sign-on a web page. Some people try to abuse search engines by looking at the particular logics used. They try to trick the search engine in believing they should be ranked very high for a certain search term that might not be relevant or is the name of a competitor. As search engines try to get the best result possible, they don't like this "search engine spamming". To counter it, they create anti-spamming policies. As a result of these policies certain pages or hosts can be excluded from indexing.

Summary extraction After a spider loads a page, a summary needs to be made. Normally some kind of information retrieval is used. Information retrieval though is based on plain texts. Most search engines extend a little in this process. Altavista [1] for example puts extra emphasis on the contents of the title element of an HTML page. The search engine google [4], which some consider the best of the moment, uses an algorithm called the PageRank algorithm [18].

10 The PageRank algorithm gives each page an importance rank. This rank is based on how many pages link to those pages. It doesn't only look at the amount of links to a page though, but also at the rank of the ``voting'' pages themselves. This means that a page gets a high score if a lot of high score pages link to the page. This system functions that good that google is at the present the number one search engine. Other strategies value words appearing early in the page higher than words occurring later. For the information retrieval part of the summary extraction process, the web-page is first converted to a plain text as visible in figure 1.1(c). This happens by stripping all HTML tags from the document, where images are normally replaced by their text representation (specified in alt tags). This text can be represented as an n-dimensional vector < w1, . . . , wn > [18], where wi represents the ith word in the vocabulary. If that word doesn't appear in the document the value is zero, if it does, the value depends on the amount of times it appears in the document, and on how common the word is in the normal use of the language. This means that a very common word like "the"would get a very low score, while an uncommon word gets a high score. Because the amount of storage space a search engine has is limited while still huge, a search engine normally excludes too common words from the dictionary. Examples of those words are: a, an, and, the, or, etc. After the page has been summarised the summarry consisting of: the vector the URL, a summary, the title, and other data, is stored in the database. As the web consists of lots of terabytes of data, it is obvious that search engines have a big storage need.

Spider limitations Search engine spiders have their limitations. They require a lot of disk space and memory to store their data. Furthermore they need to continuously crawl the web for new, updated and removed pages. The biggest search engines consist of huge server farms, but they still cannot manage to visit a page more than once in a couple of weeks. Google [4] presently claims to visit a page once every four weeks. Because a server may be down for a certain reason, pages that are not available will not be removed from the database directly, but the spider tries a couple of times first[8]. These two facts lead to the conclusion that the database of a search engine is always out of date at best.

2.2.3.2 The retrieval system The retrieval system is what a user of a search engine sees. It contains the user interface, a database interface, and most important a ranking algorithm.

The user interface The user interface of the search engine determines for a big part users view of the system. Most search engines use lightweight interfaces that are fast to download even over slow connections. Normally the opening screen gives an area where the words to search for can be entered, and a search button. Furthermore most engines provide the possibility to go to an "advanced"mode, where things like number of results, languages, etc. can be specified.

The database interface The database interface constructs a database query from the keywords entered by the user and returns the summaries about the relevant pages.

The ranking algorithm The ranking algorithm is what most search engines consider their "trade secret". This algorithm tries to order the pages in a way that the page that is most relevant (according to the ranking system) gets the highest rank and will be returned to the user as the first result. Not only the score

11 for the search words for a page is relevant for these algorithms, but most search engines also try to determine the "quality"of a page and incorporate that in the algorithm. The ranking algorithm also has a big influence on the summary extraction process. This is because all data needed in the ranking algorithm needs to be made available in the summary extraction process.

2.2.4 Limitations Search engines have various limitations. They will be handled in the following subsections.

2.2.4.1 HTML limitations HTML has a number of problems that make the work of advanced search engines more difficult.

• HTML allows for incorrect structures. HTML browsers are designed in such a way that they are able to cope with a lot of errors in HTML pages. Sometimes it is even useful to violate the HTML specification to get a visual effect that is otherwise unavailable. This makes the processing for programs that expect compliant pages more difficult though.

• HTML only specifies how the page should be rendered, but not what the functions of the elements in the page are. It is very hard for a computer to recognise elements, like a navigation bar, that are very easy to recognise for humans.

• HTML has crippled interaction support. Interaction with web-pages in the form of

tags relies on a program on the web-server. There is no way though to specify which pages can result from a form. A search engine can not "fill in"a form by itself, as there are infinite possibilities to enter data, and for new pages to be returned. A lot of web-sites use wizard like forms that consists of multiple sub-forms. Search engines can though only look at the first element of the wizard, and other parts stay unavailable for the search engine forever. These forms are useful in the view of user interface design, but very bad for automatic service discovery by search engines.

2.2.4.2 robot/interface design limitations The biggest problem with the robot/interface design is the fact that there is a big time lag between the reading of a page and the query by someone looking for it. This has as a consequence that of the returned pages there are always some already gone. Google [4] tries to reduce the impact of this problem by providing a cache of most indexed pages.

2.2.4.3 Information retrieval limitations 2.2.4.4 The problem Most search engines use algorithms derived from Information Retrieval (IR). The problem with this is that because IR works on plain texts, the web-pages have to be reduced to plain text first, after which a summary is created. Each step can be seen as creating a model from the higher level document (web-page, and plain-text). Let us consider the following example showing the problems with this reduction in steps. Figure 2.3 shows the result of reduction in steps. Figure 2.3(a) shows the original picture. It is a picture of the sparks generated by a welding torch. Figure 2.3(b) shows the picture after it has been reduced in size to half its size. (For visibility it has also been scaled up again to the original size). Figure 2.3(c) shows the picture when it has been reduced in 5 equal steps. (The reductions

12 where done with a 256 colour palette. With normal 24 bits graphics, the differences are a lot smaller, although the same principle is valid.)

(a) Original (b) Scaled in 1 step (c) Scaled in 5 steps

Figure 2.3: Comparison of reductions in 1 and in 5 steps

As figure 2.3 shows, for pictures it is better to reduce them in one step. A small picture can actually be seen as a model of the bigger one. When a model is made of something there is always an information loss. Lets call this loss L This loss of information can be split into two separate components. One component of this loss is the loss of reduction (Lr). For images this means the loss of 75% of the pixels when reducing to half size (both dimensions). For computational ease the percentage of remaining pixels/information is used in the examples. The other component of the loss is the loss of accuracy (La). Because models can never hope to capture all parts of the original, but must approximate them. In the example of the picture, the scaling algorithm must choose an optimal representation of the original at this resolution. This optimal representation can have a deviation from the original. For the pictures it means that because there are only 256 different colours, the nearest colour must be chosen. The loss of information can be written as L = Lr + La using the two components discussed above. The loss of information for one step scaling using this formula can be given as Lo = 25% + Lao where Lo is the loss. For the five step scaling it is more complicated to derive the propper formula (Lf1 means the loss for fivestep scaling at the first step):

1802 L = L + L = ( )% + L ≈ 81% + L (2.1) f1 r1 af1 2002 af1 af1 1602 L = L + L = ( )% + L ≈ 79% + L (2.2) f2 r2 af2 1802 af2 af2 1402 L = L + L = ( )% + L ≈ 77% + L (2.3) f3 r3 af3 1602 af3 af3 1202 L = L + L = ( )% + L ≈ 73% + L (2.4) f4 r4 af4 1402 af4 af4 1002 L = L + L = ( )% + L ≈ 69% + L (2.5) f5 r5 af5 1202 af5 af5 Lf = Lrf1 ∗ Lrf2 ∗ Lrf3 ∗ Lrf4 ∗ Lrf5 + Laf1 + Laf2 + Laf3 + Laf4 + Laf5 (2.6)

Lf ≈ 25% + Laf1 + Laf2 + Laf3 + Laf4 + Laf5 (2.7)

13 On the assumption that La1 ≈ La2 ≈ La3 ≈ La4 ≈ La5, equation 2.7 can be transformed to:

Lf = 25% + 5Laf (2.8)

While Lao (loss of accuracy in onestep scaling) is not necessarily equal to Laf (loss of accuraccy Pi=n in one step of fivestep scaling) it is safe to assume that Lao < i=1 Lafi for any n ≥ 2. Given that the start and end points are the same for both transformations. This proves that there is always a loss in making a model from a model instead of doing it directly. With web-pages the loss is in losing all information stored in the tags of the page. There is no possibility anymore to consider the structure of a page.

2.2.4.5 The solution There are some solutions to this problem. One can adapt the IR summariser to be able to read tags, or to know the last tag. With this I mean that the summarising component either gets the tags delivered to it (for example by the lexical analyser), or that it can query for the last passed tag. The problem is that using information retrieval there is not really a way to use that information. That's the reason most search engines don't use this kind of strategy. A lot of search engines use another trick, they look at some tags. That is, they identify the title of the page by reading the contents of the tag, and they look at some meta tags in the web-page. They especially look at the summary. For finding a real solution to this problem one needs to really look at the HTML structure. Information retrieval has one big strength though. It is very good in retrieving documents on the basis of a number of keywords. This is a function a search engine must perform. The solution here so must be to have a summariser that looks at the structure of the web-page, divides it in parts with certain properties or weights, and then runs the information retrieval engine on those parts. The drawback of this solution is that is is easier said than done to look at HTML structure, and draw good conclusions from it. It is doable though.</p><p>2.2.5 General improvement techniques 2.2.5.1 robot improvements As search engines need to look at a lot of internet pages, they normally have very fast internet connections with a lot of bandwidth. This means that when a robot decides to index a server, it can make a lot of requests very fast. Because the internet connection of the robot is often bigger than that of the server, the robot can put a heavy load on the server. Normally robots are implemented to prevent the problems sketched in the last paragraph. They prevent it by using some way of random queueing. A way to do that is for example to have a robot not take the first page from a list, but have chose one at random from the first so-many (for example thousand).</p><p>2.2.5.2 ranking improvements The biggest issue with search engines is the ranking of pages. Traditional search engines used a rank solely based on the similarity of the page with the requested keywords. The biggest problem with that design is the fact that it is easy to spam a search engine that only uses such methods. While spamming can be made harder by using some detection methods, it still is a considerable problem.</p><p>14 It is also impossible for a computer to see it's quality looking only at that page. When looking at a lot of pages this becomes a lot easier. Google, for example uses an algorithm called PageRank that computes the quality/importance of the page by looking at the links to it, and the importance of the pages where those pages are.</p><p>2.2.5.3 user interface improvements Most user interface of search engines are designed on ease of use and speed. This is good enough for people who don't need to use them on a professional basis. Professionals though, need more complex interfaces as they are not satisfied with result sets that might contain the wanted page. This means there is a desire for a different interfaces for experts and novices. [9] A thing a professional might want to do is to improve the results by enhancing the query. The search engine can help the professional with suggesting related words, that the user might add to the query. Overmeer [9] suggests starting with an interface that puts an emphasis on selecting suitable keywords instead of an immediate representation of results. (see figure 2.4). This interface makes it more easy to create a query that only retrieves the wanted pages and little more.</p><p>Sites Pag’s Hits Keyword 167 231 350 bears − − − grizzly New</p><p>Sites Pag’s Hits Related 203 631 655 zoo More...</p><p>Sites Pag’s Hits Forbidden 457 631 967 teddybears New</p><p>Figure 2.4: Selection on keywords according to [9]</p><p>2.2.5.4 meta search engines Because it is not always certain that search engines return all pages available on the internet, meta search engines have been invented. Meta search engines have knowledge of more pages, because they know the union of what is available on each of the used search engines. Meta search engines don't have spiders on their own, but perform searching using the following steps:</p><p>1. For each used search engine, translate the user query to an equivalent query on that engine.</p><p>2. Query each search engine simultaneously.</p><p>3. Retrieve the result pages for each of the engines.</p><p>4. For every result page, parse it for the results.</p><p>5. Find duplicate results, and resolve conflicts.</p><p>6. Calculate score for all results.</p><p>15 7. Sort the results.</p><p>8. Present the sorted results to the user A big disadvantage of meta search engines is the fact that they have no record of the result pages, and so can not calculate a score in the traditional way by comparing the document vector to the query vector, but rank based on the ranks of the used search engines. Another problem with meta search engines is that they often don't offer the rich query possibilities search engines provide. Because the query needs to be made on all queried search engines, the meta search engine needs to look at the queries supported by those search engines, and use some conflict resolution when feature differences exist. Often this means that the meta search engine provides only the smallest denominator of the individual search engine capabilities. A third problem is the fact that the returned results are based on the average of all queried search engines while the result of a single engine may be better balanced[8].</p><p>2.2.6 Specific improvements from literature 2.2.6.1 The use of ontologies In [7] Modica, Gal, and Jamil describe a technique for computer aided ontology creation. While in itself not very much search engine related, the ontologies created could very well be used to classify pages. The ontology is created by first extracting form elements from a ``cleaned'' DOM [15] tree that was generated from the HTML page. Cleaned in this case means that all errors in the HTML are removed, and layout tags etc. have also been removed. For every form element a type, label and name is identified. The data collected this way is then used as a base ontology. As a lot of web services consist of multiple pages, a user is required to help the extraction tool to visit all relevant pages. All parts of the service are put together in the ontology. After the base ontology is created, the adaptation phase starts. In this phase, the user suggests browsing similar web sites. In the case of a car rental service, this might mean the base ontology is based on Avis, and for adaptation Alamo and Hertz are used. In the adaptation phase, each suggested site, goes through the extraction system like the base ontology does. The resulting ontology tough, is merged with the existing ontology. This merging uses heuristics to reduce ``noise'' like hyphens, different capitalisations, stop terms (like a, the, etc). Matching happens by using: substring matching, content matching and thesaurus matching. Substring matching is successful when terms match. Content matching happens with fields with select, radio, or check box options, and looks at the value sets. If those match for a high degree, a match is called. Thesaurus matching uses a thesaurus for synonyms. This thesaurus is expanded whenever a synonym is determined by the user. Matches found by the program, and unmatched tags are presented to the user for corrections. While in itself the FunSearch engine can not create ontologies. The ontologies created in such a way can very well be used for functions. The classification that could be done for these pages would use the ontology extraction as used in the adaptation phase, but instead of merging the two ontologies, now a match score would be calculated.</p><p>2.2.6.2 Public search infrastructure In [8] Overmeer suggests making a public modularised search engine infrastructure. The modules he identifies are: page fetch/storage, indexing, and the user interface. Each of those modules can be used separately, and possibly even for free. A number of advantages of this structure are identified. For the fetching module, this means:</p><p>16 • As fetching puts less strains of a web-site because of a better implementation, and less spiders, web-sites can reopen themselves for spiders again resulting in a better coverage.</p><p>• It is easier for web-masters to specify fetch frequency, authorisation, and other data, because it only needs to be specified once, not a lot of times for every different spider.</p><p>• Engines are cheaper to build as they need not store pages themselves anymore, but only need to process them into indexes.</p><p>• Internet traffic by spiders will be reduced, so more of the bandwidth is available for real use.</p><p>For the indexing module this means people can easier experiment with different indexing functionality without the need for a lot of expensive hardware. For the user interface the advantage indexing advantage weighs even more, as they often don't require any significant storage of information, or fast hardware (with limited use).</p><p>2.2.6.3 Tailored Resource Descriptions Cawsey in [2] states the usefulness of having personalised search result descriptions. This paper uses XSLT [14] to transform RDF [13] data for a page in a natural language description. When it is possible to obtain such rich RDF data, it is certainly a good idea to use personalised descriptions of web-pages.</p><p>2.2.6.4 Interactive user interface In [9] Overmeer suggests an improved user-interface for search engines that is more interactive than the traditional one. A description of this idea can be found in section 2.2.5.3 on page 15.</p><p>2.2.6.5 Automatic RDF meta-data generation If every site had reliable meta-data in a standardised format, the work of a search engine would be made a lot easier. In citejjbamg Jenkins et all describe an automatic meta-data generator. The system starts with a classifier that automatically classifies a web page according to the Dewey Decimal Classification classes. Their argument for this classification is that it is a universal classification scheme covering all subject areas and geographically global information. While the Dewey Decimal Classification is certainly universal, its use is limited to the classification of web-pages that are mainly electronic versions of paper documents (irrespective of whether a paper version exists). The Dewey Decimal Classification is not so useful for interactive pages, or pages without a specific subject (such as company or personal home-pages). Classification based on the Dewey Decimal Classification also means that the classification is on subject, not on type which is what the FunSearch engine classifies on. The meta-data the authors use is based on something they call the Wolverhampton Core, which is basically a subset of the Dublin Core[5]. Table 2.1 gives a list of all elements of the Wolverhampton Core with a description of the way to get each element in the description column. If one uses the ways described in this table to get the elements, one can easily create an RDF document for the web-page. The proposed classification could be used within the FunSearch engine to further classify the text classification. The informational part of this classification is fairly easy. For the search part it would be most useful if multiple paths could be chosen. Unfortunately the FunSearch engine uses a tree classification structure that was not designed to use multiple inheritance, so the design must be enhanced to account for this possibility.</p><p>17 Element Description Purpose 1 Unique accession Number assigned by the system. Uniquely identifies the re- number source. 2 Title Taken from the HTML element. Usually helps in discerning the subject matter. 3 URL1 The URL given to the system, used to Indicates the location of the extract the document for classification. document. 4 Abstract Either the first 25 words found in the body Provides further clues about the of the page, or if present, taken form the subject. Description META tag. (A much more sophisticated abstracting technique could be used here in future implementations). 5 Keywords Terms found within the document that Indicate key issues/topics. match terms found within the class rep- resentatives of DDC classes found to be appropriate. 6 Classmarks DDC clasmarks found to be appropriate Indicate subject area(s). as a consequence of the classification pro- cess. 7 Word count The number of words found on the page, Indicates extent, detail, down- including the title. load time. 8 Classification date The system date when the classification Indicates currency of metadata. took place (GMT or BST) 9 Last modified date Taken from the HTTP Last-modified Indicates currency of the infor- when classified header. (Gives Not known if equal to mation. the `epoch' --- 1st January 1970.)</p><p>The classifier only handles individual HTML documents to the URL, not URI, is appropriate. The URL is not used as an identifier within the search engine because it is possible for the same page to have more than one URL; this is one of the causes of repetitions in automated search engine results</p><p>Table 2.1: The `Wolverhampton Core' (copy from [6])</p><p>2.2.6.6 Resource Description Framework (RDF) RDF is a foundation for processing meta-data; it provides interoperablility between applications that exchange machine-understandable information on the web [13]. While RDF relies on a lot of features that XML[16] provides, it can be used without XML as long as the requested features are provided by the alternate technology. Basically RDF is a way to describe, and store resources. It does not make any assumptions about domains or semantics. To achieve this, RDF has a class system like used with object oriented programming. Classes are grouped in schema's, and can inherit other classes. As each class is described in terms of a parent class, a program understanding parent classes can at least know something about the child class. While RDF itself is a very useful technology, it is not used in the FunSearch engine. The main reason is that it is overkill, as the classification data stays within one program, and an application specific data structure has no major drawbacks. To use RDF though would probably by a performance penalty. The classification results could probably be very easy converted to RDF though as the classification cache is already saved using XML.</p><p>18 2.3 The theory</p><p>The problem of improving the performance of search engines can be broken down in some sub problems as already explained in section 2.1. The question of how to recognise interactive pages can be broken down in subquestions too. This results in the question tree given in figure 2.5</p><p>Improve searching� interactive pages�</p><p>How to find pages that� How to Recognize� how to show� include interactive pages� interactive pages� interactive pages�</p><p>How to look at the� How to recognize� How to get the� page structure� page structures� page structure�</p><p>How to recognize� How to recognize� How to recognize� How to cope with� Statistical analysis� Heuristics� interactive pages� text pages� pages with links� incorrect html�</p><p>Figure 2.5: Question tree</p><p>Recognition of web-pages must be done by looking at the structure (say the tags) of these pages. This raises the questions of how to get the structure, how to look at the structure, and how to recognise a structure as indicating a page of a certain kind. Each of these questions will be handled in their own sections (2.3.1, 2.3.2, and 2.3.3). There is another problem though. Recognising what kind of page a page is means that the page must be classified. The reason for classifying is the observation that there are different kinds of web-pages. Examples of these are various kinds of interactive pages, text pages, and pages with links. This theory presumes that in most cases the user knows what kind of page he wants, or at least which pages he doesn't want. If this is true, and there is a way for a user to know what kind of page a page is from the search results, the search results have a better quality for the user. To improve on that, if the results are limited to only those of a certain kind, the precision of the results is increased. To determine the kind of page at hand, a classification scheme can be used. As there are theoretically infinite classification possibilities, and search engines should be as fast as possible, there must be some limitation on it. A way to do this is to use a classification tree (as in fig 2.6 on page 21). In a classification tree, there is a root classification which includes all pages. We call those pages R. The assumption is that we can can make a number of subclassifications on this root classification S1,<a href="/tags/S2_(classification)/" rel="tag">S2</a>,S3,...,Sn for which the following points hold:</p><p>∀x∀y[x 6= y] → Sx ∩ Sy = ∅</p><p>∀xSx ⊂ S</p><p>S1 ∪ S2 ∪ S3 ∪ ...Sn = S</p><p>In short they must be perfect subsets with no overlap between each-other. Every subset Sn can itself also have subsets (T1,T2,...,Tn) for which the same rules apply. This way a limitless tree can be constructed. The leafs of this tree are the classifications which we</p><p>19 shall call Z1...n. As each classification can only have one parent set (∃x(Tx ⊃ Z) ∧ ∀y[y 6= x] → (Ty 6⊃ Z) where Z is a classification, and T is the set of parents) a certain classification needs only to be evaluated if it's parent has been evaluated to be applicable. In short the tree setup of the classifications allows for a huge reduction in the amount of classifications that needs to be classified at the cost of making and using evaluation functions for the supersets. The classification tree used in the prototype is described in section 2.3.4.</p><p>2.3.1 How to get the page structure? The problem of how to get the page structure is rather easily solved. As web-pages are logically structured as a tree of elements, they can be represented in the computer as such. Getting this structure of the page means that the web-page must be parsed and stored in a tree in computer memory. As mentioned in section 2.2.4.1 on page 12, HTML doesn't need to be well-formed to be visible in a web-browser. This means some form of error correction needs to occur. In the prototype that has been developed the strategies described in section 3.5.2.1 on page 30.</p><p>2.3.2 How to look at the page structure? Having a tree representation of a web-page the real problems occur. Now there must be some usable model of the structure to be used by the classifications. The elements of this model can be divided in to two categories: statistics, and heuristics. Statistics are abstract data that is created by going through the document and counting various things. Examples of statistics are: word count, tag count, link count and form count. As these statistics are rather general, and benefit a lot from being calculated at the same time, the best solution is to calculate them only once for each document, and have each classification use these calculations. Heuristics are specific algorithms that determine how close a page is to a certain classification. For convenience we express this closeness on a scale from 0 to 100% where 100% means a perfect fit, and 50% means indifferent. Heuristics can make use of the statistics, but also look at the main document. They might check for the presence or absence of certain words, or look at specific tag structures. Heuristics are, unlike statistics, specific to a certain classification. This means their calculation can be delayed until a specific classification or parent-classification is evaluated. In some cases it might be useful to have these specific heuristics to use specialised statistics made by their parent-classification.</p><p>2.3.3 How to recognise page structures Recognition of pages is rather straightforward. To recognise a page one needs to basically walk the tree. One starts with the root node. As the root node has a perfect score for all pages, we need to evaluate its child nodes. As we assumed that the sets of pages in a super-classification is equal to the union of the pages in its children, one of the children needs to evaluate positive. This child can also have children in which case we need to evaluate those too. The evaluation of a classification works by applying a certain heuristic to the page and looking whether the score for that particular heuristic is high enough (in general higher than 50%).</p><p>20 2.3.4 The classification tree This thesis also presents a classification tree. It is a limited tree that was build for testing purposes, and for use in the prototype. This tree is visible in figure 2.6. Following I will describe each of the classifications in this tree.</p><p>RootClassification</p><p>Insignificant Frameset Text Links Interactive</p><p>Short Text Medium Text Long Text Annotated Links Not Annotated Links Complex by Mailform with with with search function date selection list subscription</p><p>Figure 2.6: The classification tree</p><p>2.3.4.1 Insignificant classification Pages with very few words on them are classified as Insignificant. If the amount of words is smaller than 40 the page cannot be reliably classified. This classification takes care of that problem by classifying all very small pages as tiny. Most pages are by far larger that 40 words, so this classification shouldn't be a limitation to the system. The Tiny classification has no children.</p><p>2.3.4.2 Frameset Classification Pages can be split into multiple independent parts using framesets. The problem is that the actual contents of each of the frames are inside other files. Browsers that don't support frames can still display contents if there is a <noframes> part in the file. Since most browsers support frames though the representation that would be classified from the <noframes> part is different from the representation for the user, the choice has been made to classify all pages with framesets as Frameset The Frameset classification has no children.</p><p>2.3.4.3 Text Classification A lot of the web-pages on the internet are "enhanced"texts. Also articles on the internet are basically texts. That is why there is a Text classification. The score for this classification is the reverse score for the links classification. That means T ext + Links = 1. For description of the Links classification see subsection 2.3.4.4. In short a page is considered to be a text if the amount of links per word is below a certain value, and the standard deviation of the distance in words between the links is bigger than 1.5 times the average distance.</p><p>Short text Because it is often useful to know the size of a web-page, the text classification has tree subclassifications that indicate their size. If a web-page contains less than 250 words, it is assumed to be short. For these pages one can normally say that it are not articles or other reference documents.</p><p>21 Medium text Because it is often useful to know the size of a web-page, the text classification has tree subclassifications that indicate their size. If a web-page contains more than 250 words, and less than 2000 it is assumed to be medium sized.</p><p>Long text Because it is often useful to know the size of a web-page, the text classification has tree subclassifications that indicate their size. If a web-page contains more than 2000 words it is assumed to be long of size.</p><p>2.3.4.4 Links Classification A lot of web-pages have the basic purpose of pointing users to other relevant pages. There are two kinds of such pages. Pages that give a lot of links (undescribed links), and pages that give links with descriptions(described links). Links to the current page are discarded. The score for this classification depends on two factors: 6 lnkRateScore = <average distance in words to each reference> ∗ 2 to have every amounts below or equal to 6 give a return above 50%, and:</p><p> 0 <links> < 5   <standard deviation of the distance between tags> stdScore = 3 1.5−  e <average distance in words to each reference> <links> >= 5 where stdScore is zero when there are less than 5 links out of the page, and when the distance between the tags is rather regular the score is high. The score is calculated from those factors as follows: arctan (stdScore + lnkRateScore) score = 1 2 π 1 1 The result of the arctan function is always between − 2 π and 2 π. Because both lnkRateScore 1 and stdScore are positive by definition the effective reach is within {0,..., 2 π}. By dividing by 1 2 π the reach is normalised to lay between 0 and 1. A score for lnkRateScore + stdScore that 1 1 equals 1 gets an arctan of 4 π. 4 π Is corrected to 50%. Not annotated links When a page is classified as links it must be either annotated or not. A page is classified as not annotated links when the amount of words per link is smaller than 10. When it is more, the page is classified as annotated links. The scores for these two classifications so are each-others opposite.</p><p> annotated links When a page is classified as links it must be either annotated or not. A page is classified as annotated links when the amount of words per link is bigger than 10. When it is less, the page is classified as not annotated links. The scores for these two classifications so are each-others opposite.</p><p>22 2.3.4.5 Interactive Classification There exist two kinds of forms. Forms with and without associated actions. When there are no actions associated, they are most probably using Java-script. Forms without associated actions are ignored for this classification. A page is classified as interactive when it has at least one form with an associated action. Further it gets a score of 50% when there are 2 visual controls, and a higher score with more. The reason for this check is that a form with 1 control can actually not be interactive. Either the control is a button, so there is nothing to be entered and the form is not interactive, or the control is an input element that cannot be send to the server because a button is needed for that.</p><p>Complex The complex classification more or less functions as the garbage bag of the subclassifications of the interactive classification. Basically any form that has 7 or more input elements (not buttons) is classified as complex. Textareas count double, because their complete lack of structure. It is surprising to see how many forms are of this kind, and it is very hard to say anything about them without using some kind of ontology.</p><p>By mailform A lot of forms on the internet are not actually interactive. Whatever is returned to such a form is put in an email that is send to an address specified by the author of the page. Often that address is put inside a hidden tag. This kind of form is called a mailform. Mailforms are not truly interactive because there is no way for a user to see the reaction on these forms within a short time, let alone that the new page already gives a result to the entered data. This classification looks at the occurrence of words within the form as: mail, send, recipient, etc. and checks that other words do not occur within the forms. If there are at least 2 such words in the form, the form is classified as a mailform.</p><p>With search function A lot of pages on the internet contain an embedded search functionality. A form is classified to be a search form if the tags or the tag names contain either the word search or the word go.</p><p>With date selection At reservation sites there is often the possibility to specify start and end dates. This classification recognises such possibilities and classifies such forms as Date Selection forms. This works by recognising words such as: year, month, Jan, Feb, mar, etc.</p><p>With list subscription A lot of web-sites try to bond visitors to them by offering the possibility to get regular e-mails from the site maintainers. That kind of email is called a mailing list. When the user says he wants to get those mails, he subscribes to the list. This classification tries to identify forms that give a possibility to subscribe to a mailing list. This classification looks for the words join and subscribe, in conjunction with words like mail, send, etc. If there is such a combination they are classified as being a list subscription form.</p><p>2.4 Demands on the theory</p><p>The system described by the theory should satisfy certain demands. The first one being that it should be actually possible to implement such a system. Other demands are:</p><p>23 • The system use experience should be at least as good as the experience with a regular search engine. This focuses down on the following issues:</p><p>{ The responsiveness of the engine should be within an acceptable range, even if the system is under a ``rush-hour'' load. { The user interface should be understandable for people who are used to regular search engines. { The user should be able to easily get an overview of the search results. This means for example that one result should not take have of the space available in the browser (and thus limiting the simultaneous view of only two results).</p><p>• The search results should not be worse than those of a regular search engine.</p><p>• The system should have an ``economic validity''. That means the costs of this system should not be too much higher than those of a conventional search engine.</p><p>The demands described above will be specified further and evaluated in chapter 4. The evaluation will be based on a prototype that was developed and is described in chapter 3.</p><p>24 Chapter 3 - The prototype</p><p>This chapter describes the prototype system that was developed to demonstrate the viability of the theory. First I will discuss the link between the prototype and the theory. Next I will justify the differences between the prototype and the theory. The prototype architecture will be described third. After that I will go into some detail of each part of the architecture, and finally I will describe the implementation details of the most important parts of the prototype.</p><p>3.1 From the theory to the prototype</p><p>To prove the validity of the theory a prototype was developed. This prototype implements an example classification tree that shows the most important parts of the possible classifications. For a production system this tree should be more complete though. The prototype is also the most important instrument to measure the validity of the theory. For that reason the prototype implements most parts of the theory and tries to give an answer to most demands on the theory (as described in 2.4).</p><p>3.2 Justification of differences</p><p>A prototype is per definition not a production system and thus can have limitations that would make a production system unusable. There are several such limitations in the prototype that was developed. The biggest difference the prototype has with an ideal system is the fact that it works on top of an existing search engine. This has several disadvantages:</p><p>• Every query first needs to be passed to the search engine, and the results need to be collected. This slows down the response of the prototype.</p><p>• As the results of the query are queried from the original search engine, the classification of those results needs to take place at query time. For classification the pages also need to be downloaded. As not all webservers are very fast this can take a considerable time. The prototype has a time limit for loading pages to deal with this particular problem.</p><p>• As the prototype system can not get the match quality of the search results on the query, that quality cannot be used in the sorting of results. This means result sorting cannot be optimal.</p><p>While the above disadvantages cannot be disregarded, they are not such a serious problem for a prototype. There is also the fact that developing a full-scale system is very expensive. For a full search engine one would need a robot crawling the internet. The results acquired by that robot would also need to be stored somewhere. All that needs a rather big computer capacity. To lessen the above named capacity problems one could decide to only index a specific subpart of the internet such as all documents within a certain domain. The problem with such an approach is that there is little guarantee of a heterogeneity as in the full internet. Using a full-scale system also has another disadvantage. It is very likely that implementing a full system with adaptations to classify search results takes more time to develop than a prototype that adapts on an existing search engine in the way used by the prototype. This is even true if a standard search engine is used.</p><p>Another limitation of the prototype is the limited amount of classifications offered. While it is surely not good to have too many classifications the prototype doesn't really offer enough. The</p><p>25 prototype is a prototype though and not a production system. The classifications offered by the prototype have the main function of proving that it is possible to do this classifying. In that they succeed.</p><p>3.3 Detailed architecture (5.3)</p><p>In this section I will describe the architecture of the prototype called the FunSearch engine.</p><p>3.3.1 The place of the FunSearch engine in its environment The FunSearch search engine is not a full search engine. It is a search engine enhancement. The reason for this design is that it focuses the effort on the enhancement, not on the search engine. A lot of functions implemented by search engines are not implemented in the FunSearch system. The FunSearch engine effectively piggy-backs on an existing search engine. For the default implementation Altavista[1] is used, but any search engine could be used (see section 3.5.1).</p><p>Web Server Client</p><p></p><p></p><p>   FunSearch  Robot Search</p><p>Interface Engine</p><p></p><p>¤ ¤ ¤ ¤</p><p>¥ ¥ ¥ ¥</p><p></p><p> </p><p> </p><p></p><p></p><p></p><p>¨ ¨ ¨ ¨ ¢ ¢ ¢</p><p>© © © © £ £ £</p><p></p><p>¨ ¨ ¨ ¨ ¢ ¢ ¢</p><p>© © © © £ £ £</p><p></p><p>¨ ¨ ¨ ¨ ¢ ¢ ¢</p><p>© © © © £ £ £</p><p></p><p></p><p></p><p></p><p> </p><p> </p><p></p><p>¦ ¦ ¦ </p><p>§ § § ¡ ¡ ¡ ¡</p><p></p><p>¦ ¦ ¦ </p><p>§ § § ¡ ¡ ¡ ¡</p><p></p><p>¦ ¦ ¦ </p><p>§ § § ¡ ¡ ¡ ¡</p><p></p><p>¦ ¦ ¦ </p><p>§ § § ¡ ¡ ¡ ¡</p><p></p><p>¦ ¦ ¦ </p><p>§ § § ¡ ¡ ¡ ¡</p><p></p><p> </p><p> </p><p>!"</p><p> </p><p>Figure 3.1: Structure of the FunSearch Engine context</p><p>As one can see in figure 3.1, the FunSearch engine acts transparently to the users. It takes care of querying the search engine, and retrieving the results. This setup is comparable with that of a meta search engine, although meta search engines normally have multiple search engines as back-ends. The FunSearch engine architecture has advantages, and disadvantages. Advantages of building the FunSearch engine on top of Altavista are:</p><p>• As the database of Altavista is used, no expensive hardware is necessary, and a desktop computer can be used for development.</p><p>• Altavista takes care of query parsing, this means no logic is necessary for that in the FunSearch engine.</p><p>• Because the data intensive functions are performed at Altavista, the FunSearch engine can stay relatively lightweight. This means it can be tested quickly and often. This means an easier development.</p><p>• Because Altavista is used as back-end, performance increase can be easily measured by looking at the results returned by altavista, and those returned by the FunSearch engine.</p><p>26 The FunSearch engine architecture also has disadvantages. The biggest disadvantage is the fact that the system is strictly query-response oriented. This means that everything happens after the user has made a request. This to the contrary of the normal design where the main part of the job is done by an autonomous process (the spider) that fills a database. For a normal search engine, the query response part is only responsible for retrieving results for a query, sorting them and presenting them to the user. With the FunSearch engine, it is responsible for everything. Everything means:</p><p>• Query Altavista with the keywords, and get the result pages.</p><p>• Parse the Altavista result pages to a collection abstract SearchResult objects.</p><p>• Retrieve each of the web-pages associated with the search results.</p><p>• Parse each retrieved web-page, and classify it.</p><p>• Handle pages that are removed from the internet. (This could also mean renamed/moved)</p><p>• Handle pages that are situated on servers that don't respond (they create a timeout after a long long time)</p><p>• sort the classified pages</p><p>• present the sorted list of classified pages to the user.</p><p>As an addition, the engine must also be fast, as the user is waiting for it to respond, and wants results as fast as possible. To increase performance the FunSearch engine does the following:</p><p>• Cache answers given by altavista to a certain query</p><p>• Maintain a list of pages that are not available so there is no need to attempt to load them.</p><p>• Cache classifications so the pages will not be loaded and classified twice.</p><p>• Maintain a timeout of 15 seconds when loading pages. All pages that are still being loaded after this timeout are put in the cache but not anymore presented to the user in this response.</p><p>• Stop waiting for page loading to finish or timeout when at least 10 + the requested amount of results have been loaded and classified.</p><p>Although the FunSearch engine is built on top of Altavista, any search engine could have been used. It would also not be difficult to adapt the FunSearch engine to use a different backing engine.</p><p>3.3.2 The structure of the FunSearch engine The FunSearch engine is build in a query response fashion. As can be seen in figure 3.2 and in detail in figure 3.3 on page 35 at the end of this chapter. Everything starts when a user posts a query to the web interface. First the query is forwarded to the search engine. following that, the pages are loaded, classified, and sorted. The results of that are returned to the user. Next I will describe each part of the process in detail. Finally I will describe the HTML/XML parser used.</p><p>27 Query Sorted Classifications</p><p>User Interface Sorted Classifications Query</p><p>Query Search Engine Sort Pages</p><p>Search Results Classifications</p><p>Loaded Pages Load Pages Classify Pages</p><p>Figure 3.2: Data-Flow Diagram of the FunSearch Engine</p><p>3.4 Filling in the architecture 3.4.1 Query Search Engine The query part is not really sophisticated, as the system just passes on the query to the backing search engine. Parsing results is also not really difficult, although it is delicate process. This is because the representation of results at a search engine can change. This can surely break the search engine parser.</p><p>3.4.2 Load Pages When the results from the search engine are read in, all pages that where indicated as results will be loaded in a concurrent process. The reason for the concurrency is the fact that a typical web client spends most of its time waiting for the response of the web server to come in. It is not wise not to do something else in the meantime. That's why multiple pages are loaded at the same time. Since pages that have arrived also are parsed and stored in a tree representation experimentation showed that an amount of 40 concurrent threads is a good amount. The advantage of having a big amount of threads is that no single unresponsive page (where the server doesn't reply within a reasonable time) can block the system. The page loading system uses a fixed amount of threads to load pages. This is implemented by maintaining a list of pages that are not yet in a thread, and a list of all threads. As soon as one thread dies, the next page is removed from the first list, and a thread is started to load that page. There is one feature that improves the responsiveness of the system a lot. That is, the loader system doesn't continue to start loading new pages 15 seconds after it started. Threads loading pages that already started, will continue, but their results will not be returned to the process that requested the loading of the pages, but they will automatically be classified and put into the cache of classified pages. The limit of 15 seconds seems reasonable, since most responsive pages are loaded after that time, and the user doesn't need to wait for exceptional times. Pages that do not load at all, but generate an exception (probably because the server serving them is off-line, or the page doesn't exist anymore) will be put into a cache of unresponsive pages. This cache is necessary as unresponsive pages don't get into the cache of classified pages, but the attempt to load them often takes a lot of time of doing nothing. The page loading not only stops after 15 seconds but also when "enough"pages have been</p><p>28 loaded. "Enough"in this case means <the first result on the page> + <the amount of results per page> +10. To give a reasonable result when not all pages are loaded the pages that where returned first by the search engine are also loaded first. Although that doesn't mean they are returned first, it means they have the biggest chance to be among the returned pages. This measure especially improves the responsiveness of the system a lot.</p><p>3.4.3 Classify Pages As soon as a page has been loaded, the classification starts. Classification does not happen in a concurrent process, because it is processor intensive. Instead it happens sequentially, but it already starts at the moment the first page is loaded. The classification doesn't wait for all pages to be loaded. The classification itself was described in detail in section 2.3</p><p>3.4.4 Sort Pages After the pages are classified they need to be sorted on relevance. The first sort key is there score for the requested classification. If that score is the same the page that has the highest rank with the backing search engine is sorted highest.</p><p>3.4.5 User Interface The user interface translates the list of classified pages into HTML. A special feature is that the interface offers instant help for the classifications when the user moves his mouse over the classification. For further help he can click on the classification to get a description of it. The other thing the user interface offers is a query form where a new query can be entered.</p><p>3.5 Implementation</p><p>The implementation of the Meta-search engine uses a modular design. The main modules are the search engine, the HTML/XML parsing layer, the classification and the representation. These modules will be handled in this order in this section.</p><p>3.5.1 Querying The Search engine uses an adaptable design that allows an easy change of search engine. This flexibility is performed by the SearchInterface interface. Classes AltavistaSearch and BufferedSearch implement this interface</p><p>3.5.1.1 Interface SearchInterface</p><p>The SearchInterface interface has the following methods:</p><p> public String getEngineIdentifier(); public void setQuery(String Query); public String getQuery(); public int doQuery(); public int doQuery(int startindex); public SearchResult getResult(int index); public int getResultCount(); public int queryFor(int index);</p><p>29 The String getEngineIdentifier() function returns the name of the search engine that is being used (eg. "google","altavista", etc.) The SearchEngineInterface class uses states, this means the query first has to be set by setQuery (String Query) (it can be retrieved by getQuery()). The doQuery() function queries for the first x results, where x is default amount of results returned at once by this implementation. doQuery(int startIndex) does the same, but starts with result number startIndex. Both functions return the amount of results now in the buffer (this doesn't need to be x, because there may be less results in total). The individual search results can be retrieved by using SearchResult getResult(int index). The index is in the original order returned by the underlying engine. The default implementation provided by AbstractSearchInterface automatically performs the doQuery function if required. int getResultCount() returns the amount of results in the buffer. Finally the int queryFor(int index) function makes a query where the result with number <index > must be part of.</p><p>3.5.1.2 Class AltavistaSearch</p><p>The AltavistaSearch class Implements SearchEngineInterface. It uses www.altavista.com to resolve the queries. The maximum amount of results returned per query is 50 (this is determined so by altavista.com). This class uses HTTP to get the results. This works well, although it is an imperfect solution. Problems arise for example when the layout of the web-pages is changed. When that happens, the code needs to be adapted to count for that. The need to parse the returned pages, further decreases the speed of the functions.</p><p>3.5.1.3 Class Buffered Search</p><p>The BufferedSearch performs two function. It first of all caches all results (they can be saved as XML files if wanted by the application), and next it loads all results up to a specified maximum, that can be bigger than the maximum returned at one time by the "parent"SearchEngineInterface. The BufferedSearch so greatly increases both speed and usability of other SearchEngineInterface implementations.</p><p>3.5.2 Loading pages After the result of the underlying search engine is known, we want to classify the concerning pages. That means we first need to load the pages into a computer readable format. This is where HtmlStore comes in.</p><p>3.5.2.1 Class HtmlStore</p><p>An HtmlStore can be seen as an abstraction of an HTML/XML tree. It has functions to reach all the elements in the tree, to save the tree as an ASCII file, and to load an HTML/XML file into memory. The class uses its knowledge of HTML (there is a configuration file for that in XML) to read XML files. For XML it needs to be informed of the structure, or with the consequence of loosing a little error-correction possibilities a general tree can be used. The HtmlStore class uses it's knowledge of HTML to try to correct a lot of errors in the files that it parses. It detects missing end tags and tries to insert them at appropriate places. Next it knows about "default"tags, that can be left out, such tags are for example tbody, body, col, etc.</p><p>The parsing algorithm The FunSearch engine uses a proprietary XML/HTML parser. This parser uses knowledge of the document structure to fill in gaps, or to identify invalid tags. Invalid tags can automatically be removed from the tree. This parser works as follows:</p><p>30 1. First of all, a filter is used for loading the file. This filter deletes double spaces and changes all whitespaces (tab/carriage return/linefeed/space) into spaces. Comments though, are passed unchanged.</p><p>2. Next the resulting stream is split into separate tags, and the text elements between them.</p><p>3. After a tag object or text object is created, the object that represents the document structure is asked to put the new tag in the tree, where the current position is used as an other argument. The structure object looks to the object at the current position and checks which sub-objects are allowed for this object. If the new tag is one of the allowed tags, it is made a sub-tag of the current object. If not there are two possibilities. First the current object can have a default tag, if so, the method looks whether or not this default tag has the new tag as possible sub-tag. If that is so, the default tag is added, and becomes the current object, and the new tag is created as a sub-tag of the default tag. If the default tag has a default tag itself, that default tag is considered next (recursively). Second, when there are no default tags, the method looks at the parents of the current object. If the direct parent has the new tag as a possible sub-tag, it is assumed that a close tag was forgotten, and the new tag becomes a sub-tag of the parent. If the parent doesn't have the new tag as a possible sub-tag, the parents parent is considered etc.. If no parent can be found for the new tag, it is added as an unknown tag as a child of the current tag. The stepping back is necessary to encount for certain parts of the HTML specification. For example <P> tags need not be closed. It is important though to identify the place where they need to be ended. As the block-level <P> tag only can contain in-line tags, and in-line tags cannot contain block-level tags, any block level tag automatically means the end of the paragraph.</p><p>4. When a new tag is added to the tree, the new current tag needs to be identified. If the new tag is text, an empty tag (as the <img> tag in HTML) or a tag that uses the XML short ending, the parent tag is/stays the new current tag, else the new tag will become the new current tag.</p><p>5. Now step 3 is repeated again as long as there are more tags.</p><p>The tree that is generated from the file is equivalent to a DOM[15] tree. Depending on the application invalid tags can be removed, ignored, or used as an invalidator of the whole tree. This however needs to happen after tree creation. The removal system is very useful when one is only interested in a subset of all tags. An example could be when one wants to look at forms in an HTML page. Then there are two options. When one already has a tree parsed with the HTML document structure, one can reapply a structure, to get a new tree. One can also use the subset document structure to parse the file. Once the tree is constructed, invalid tags can be removed, and a clean tree is created.</p><p>3.5.2.2 Class PageLoader</p><p>While the HtmlStore class takes care of the actual loading of one HTML page, the PageLoader class performs a more complex function. The PageLoader class is initialised with an array of url's to load, and optionally auxiliary data in the form of an array of Object classes. The next thing to specify is the amount of threads to use to load pages. When the PageLoader next is asked to start, it starts the specified amount of threads to load pages. At the moment one of these threads finishes, a new thread will be started to load another</p><p>31 page. One addition is that it is possible to set a time limit for the main thread (the controlling thread) to return. Standard this timeout is set to 15 seconds. When this timeout has happened, no new threads will be started. Running threads will still continue though, but will not be put in the vector of returned pages. Those pages though will be automatically classified. Because the classification algorithm automatically caches classifications, the page is so put in the cache. The PageLoader class itself implements Runnable, so it can be started as a thread. The vector of loaded pages can be gotten by the getPages() function. The auxiliary data is available through the getOutData() function and the URLs by the getUrls() function. The elements in those vectors belong together. It is necessary to use those vectors because it is very unlikely the pages will be loaded in their order. It is also not necessary that all pages are loaded, either because of the timeout, or because pages are not loadable. That is also the reason that one can pass auxiliary data to the PageLoader class. For example in my program the auxiliary data function is used to store SearchResults so the loaded pages can later be reordered in the original order There is one extra function the PageLoader class performs. That is, it whenever a page doesn't load (for example because it doesn't exist, or it's host is unreachable) it keeps the address in a list. If it is asked another time to load this page, it will check the time of the failure, and the current time. If the difference is smaller than 1 day, it will not try to load the page. Especially when hosts are unreachable this function increases the response time of repeated page-loads a lot. Since it is very uncommon for all pages pointed to by a 200 page search to be loadable, it is important that an application checks the size of one of the vectors that contain the results.</p><p>3.5.3 Classifying pages After the pages are loaded the most important function needs to be performed. The loaded pages need to be classified.</p><p>3.5.3.1 Class Classifier</p><p>The Classifier class is used to classify a page. When it's run() method is called it builds the classification tree. The basic function of this class is to provide abstraction from the Classification class, to store extra data, and to cache classifications. The caching of classifications improves the performance a lot because the pages in the cache don't need to be loaded from the internet (by the PageLoader class).</p><p>3.5.3.2 Abstract Class Classification</p><p>The Classification class doesn't perform any function by itself, but its children do. The children each represent a different classification. They calculate their own score, and use the Counter class for their calculations. Beside that they have the describeHtml(java.io.PrintStream out) function that writes a description of this classification to the PrintStream mentioned in the parameter.</p><p>3.5.3.3 Class Counter</p><p>The Counter class does the heavy work in the classification process. It walks through the web-page and collects all kinds of statistics. These statistics are used by the subclasses of Classification. The statistics that are collected by this class are:</p><p>• words: The amount of words in the page. A word is seen as everything that is not a tag and is separated by a tab, line-feed, space or carriage return.</p><p>32 • tags: The amount of tags on the page. A tag is an HTML structure element. An example of a tag is <P> which designates the beginning of a new paragraph. Tags that only have a layout function are not counted because they have only meaning for the text, but say nothing about complexity of the page, and don't have any meaning for structure.</p><p>• references: The amount of reference tags. Reference tags (<a href> tags) are the tags that refer to other pages on the internet. They could also be called hyper-links.</p><p>• average distance between references: This is an amount calculated from the amount of references and the amount of words. So the distance is measured in words.</p><p>• standard deviation of the distance between references: This gives an indication about the distribution of the references along the page.</p><p>• forms: The amount of forms in the page. Forms are places where user can give input that will be send to the server to request a custom made page. Forms are essential for enabling servers to interact with users. Pages that are part of a web-based e-service must contain forms. This means this counter can easily recognise any interactive page.</p><p>3.5.3.4 Class FormClassification</p><p>The FormClassification class is a subclass of the Classification class, but it is worth mentioning because it adds an extra dimension. The FormClassification class stores information per form. Where each web-page can have multiple forms. This class uses a formula to get the page score from the form scores.</p><p>3.5.4 Sorting pages The results from classifying are next sorted using the standard java sorting tool. The heigher their score, the heigher their position.</p><p>3.5.5 User Interface After the pages have been sorted, the user still needs to be able to get the results. That's where representation comes in. The program uses a web-server for it's representation. The reason for being web based is besides the wide availability of web-browsers, the fact that it seams to be logical to have a web based search engine where you can just click on the link you want to follow. The reason to have a server and not a web application that runs within another web-server is that the performance of the engine is increased significantly by caching. Loading the cache is slow though, not to mention other "startup"costs. So the extra performance of having a web-server greatly outweighs the costs of writing a small web-server</p><p>3.5.6 Class WebServer</p><p>The WebServer class is a small modular web-server. It works together with the WebResource- Provider interface. It is not a full HTTP/1.1 implementation, but basic functions like HEAD, GET, and POST are supported. The WebServer class itself doesn't know what to return on a request. That's where the WebResourceProvider interface comes in. The WebServer class maintains a list of WebResourceProviders. When a query comes in, the resource part of the URL is compared to the list. If there is a match, the WebResourceProvider that matches is used to return a web-page. If not, the root WebResourceProvider is used.</p><p>33 3.5.7 Interface WebResourceProvider</p><p>The WebResourceProvider class is the interface that classes must implement to be a resourceProvider for the WebServer class. Classes that implement this interface don't need to know HTTP, except for the headers that are sent. They can however resort to default headers. In that case they don't even need to know about that. The main class that implements WebResourceProvider is the AbstractResourceProvider class, that makes it easier to implement the class.</p><p>AbstractResource is a "default"implementation of the WebResourceProvider interface. It provides for most of the functions defined in the interface. Only the writeBody function is declared abstract, because that should be specific to each resource provider or even resource.</p><p>3.5.8 Class ClassificationResource</p><p>The ClassificationResource Class is the web interface of the FunSearch search engine. It presents two views. The first, main, view consists of two parts. The first part is a search interface. The second part is a view of the search results. The second view is meant to provide users with an explanation of the various classifications, and to show their relations. The interface tries to increase it's responsiveness by returning results to the user when either 15 seconds have passed since it started loading pages, or if there are ten more pages available than requested. In the latter case the engine still loads pages that are in its queue. Furthermore it contains a cache that maintains a list of pages that it could not load. Those pages will not be tried again within 24 hours. This greatly increases the responsiveness of the interface.</p><p>34 W eb S erv er : W eb S erv er c on n ec t i on H an dler : C on n ec t i on H an dler pageloader : PageLoader</p><p>Listening</p><p>[ c on n ec t ion at t em pt ]</p><p>R ea d H ttp</p><p>R eq u est</p><p>Q u er y S ea r c h </p><p> engine</p><p> page loaded / Clas s ify page Lo a d p a ges</p><p>W a it f o r p a ges</p><p>[ pageloader in dic at es fin is h ed or en ou gh pages loaded ]</p><p>[ all pages c lass ified ]</p><p>[ pages loaded b u t u n c las s ified ]</p><p>C l a ssif y </p><p> r em a ining p a ges</p><p>S o r t P a ges</p><p>P r int r esu l ts to </p><p> str ea m</p><p>C l o se </p><p>C o nnec tio n</p><p>D ie</p><p>Figure 3.3: Activity Diagram of the FunSearch engine</p><p>35 Chapter 4 - Evaluation</p><p>In evaluating search engines, there are two standard measurements: recall and precision. To explain these principles, I first need to introduce some variables. P is the set of all relevant web-pages in existence at a certain point of time t. R is the set of all results returned for a search at the same time t. C is the set of all relevant results (R ∩ P = C). p, r, and c are defined as the count of the capitalised sets (eg. p is the amount of elements in P ). c Recall is defined as p . A high recall means that most of the pages that should be returned by a perfect search engine are returned. c Precision is defined as follows: precision = r . A high precision means that most of the returned pages are relevant. Often precision is a problem with traditional search engines when searching interactive pages.</p><p>4.1 Evaluation of the prototype</p><p>To evaluate the prototype we need to define a number of categories to evaluate on. These categories are:</p><p>• Recall</p><p>• Precision</p><p>• Responsiveness. While not that important for a prototype it must still be reasonable.</p><p>• Understandability of the user interface. A system is only useful when people use it. To do that, they must understand the system. The easier the interface, the more willing they are to learn it.</p><p>The prototype has two modes: original sorting mode, and classification sorting mode. These modes have different recall and precision values so I will discuss the precision and recall of those modes separately. First I will discuss responsiveness and understandability.</p><p>Responsiveness The prototype has a reasonable responsiveness. If the query has already been cached, the responsiveness is outstanding. When there's no cache the responsiveness is worse. The prototype does employ certain measures to increase responsiveness:</p><p>• The system caches queries and classifications. This way time-consuming actions only need to be performed once, not every time.</p><p>• The system shows preliminary results if the amount of results is 10 higher than the amount of requested results. This means result 1 to 20 will be presented to the user as soon as 30 results are classified. Results 21 to 40 will be presented as soon as 50 results are classified. Because the system caches the results, results 21 to 40 are in most cases available almost as soon as the user finished reviewing the first 20 results.</p><p>• The system returns anyway after 15 seconds. There are often pages contained within the search results that are on unresponsive servers, and cannot be loaded. The network timeout is rather long, so the system uses the timeout. Pages that where still loading at the time of the timeout are still classified and put in the cache, but not presented to the user.</p><p>36 Figure 4.1: The FunSearch engine user interface</p><p>• The system tries to load 40 pages at one time. After a couple of tests this appeared to be the optimal setting to keep the classification system busy but not overwhelmed. When loading 40 pages it is prevented that unresponsive servers lock the system into unresponsiveness. As said these measures make the systems responsiveness go to a reasonable level.</p><p>Understandability of the user interface The prototype has a user interface that looks a lot like that of any other search engine (see figure 4.1). The only strange thing here is the classification box that allows for specifying the classification to sort on. Although there was no user test done on the interface, a little asking around indicated that the user interface is understandable. To further help things, the prototype also has a build in help that explains all classifications.</p><p>Recall in original order mode The recall in original order mode does not differ from the recall of a normal search engine. The same pages are returned, and in the same order as on a normal search engine.</p><p>Precision in original order mode While the precision in the technical sense of the word doesn't change from what is achieved by a normal search engine, one can look at it another way. Precision can also be defined as the number of pages that were classified right. That way the prototype performs rather well. Although the classification system uses a percentual score, the classifications that get chosen as main classification tend to be correct.</p><p>37 Recall in classification sorting mode While in classification sorting mode all results that are presented in the original search engine are still used one can respecify recall as looking at those pages that have a score higher than 50% for the classification to be the retrieved pages (R). While full evaluation of the recall can only be done by doing a user panel review, the preliminary results are promising.</p><p>Precision in classification sorting mode The precision of the prototype in classification sorting mode can be evaluated analogous to the recall in this mode. The pages with a score above 50% can be taken as R. Depending on the classification the precision in this mode is higher for the prototype than for the backing search engine.</p><p>4.2 From the prototype to the theory</p><p>As seen in the last section the prototype functions well at classifying pages. While performance and responsiveness are not that good, in a production system there would be no problem as the classifying would take place at the robot part of the search engine. Precision and recall probably also would increase in a production system if compared to the prototype because the full collection of pages in the database is then available.</p><p>4.3 Conclusion</p><p>Taking everything into account the conclusion can be made that classifying web-pages does improve searching on the internet in general and searching for interactive pages in specific. The classification system gives users more information about the results, and enables them to search for specific kinds of pages. The costs for this functionality in terms of user experience in a production system will be very low. The searching using these classifications will not take more time (at least not noticeable), and the increase in complexity of the search interface is very low. All in all this means that a classification system is a good addition to existing search engine technology. Especially since the use of a classification system does not exclude the use of other search performance increasing measures.</p><p>38 Chapter 5 - Suggestions</p><p>There are a couple of things that can be done to enhance the FunSearch engine and make it production ready:</p><p>• First of all, the classification could happen within the robot of a search engine. This would especially improve the searching capabilities, as ordering of results not needs to happen on a subset of all possible results (the FunSearch engine only gets the first 200 results from altavista), but happens on all. Further more, the response time of the engine would improve as the time intensive parts of the system are performed in the robot and not in the user interface part of the search engine.</p><p>• A XML-based, configurable ontology tree structure could be implemented for the interactive classifications. While a lot of work, it would be necessary for a production system to have more classifications than the current system has.</p><p>• The system is written in java. To get a good performance with many concurrent users, or as part of a robot that tries to summarise as many pages as possible in a certain time, the system should be implemented in a native language such as C++. Porting the application to C should be not too difficult though. Another option would be to use a native compiler for java.</p><p>• A more extensive help system than the one presently implemented should be included in the system. As the FunSearch engine is only a prototype, the added complexity of such a system is not warranted in the current implementation though.</p><p>39 Appendix A - Bibliography</p><p>[1] http://www.altavista.com.</p><p>[2] Alison Cawsey. Presenting tailored resource descriptions: will xslt do the job? Computer Networks, 33(1-6):713--722, June 2000. .</p><p>[3] R. Fielding, U.C. Irvine, J.Gettys, J.Mogul, H.Frystyk, and T. Berners-Lee. Hyptertext transfer protocol -- http/1.1. ietf rfc 2068, January 1997.</p><p>[4] http://www.google.com.</p><p>[5] The Dublin Core Metadata Initiative. Dublin core metadata element set, version 1.1: Reference description. http://dublincore.org/documents/1999/07/02/dces/, July 1999.</p><p>[6] Charlotte Jenkins, Mike Jackson, Peter Burden, and Jon Wallis. Automatic rdf metadata generation for resource discovery. Computer Networks, 31(11-16):1305--1320, May 1999. also available at http://www.scit.wlv.ac.uk/~ex1253/rdf_paper/.</p><p>[7] Giovanni Modica, Avigdor Gal, and Hasan M. Jamil. The use of machine-generated ontologies in dynamic information seeking. In Proc. Sixth International Conference on Cooperative Information Systems, September 2001.</p><p>[8] Mark A.C.J. Overmeer. My personal search engine. Computer Networks, 31(21):2271--2279, November 1999. .</p><p>[9] Mark A.C.J. Overmeer. A search interface for my questions. Computer Networks, 31(21):2263--2270, November 1999. .</p><p>[10] Steven Pemberton, Murray Altheim, Daniel Austin, Frank Boumphrey, John Burger, andrew W. Donoho, Sam Dooley, Klaus Hofrichter, Philipp Hoschka, Masayasu Ishikawa, Warner ten Kate, Peter King, Paula Klante, Shin'ichi Matsui, Shane McCarron, Ann Navarro, Zach Nies, Dave Raggett, Patrick Schmitz, Sebastian Schnitzenbaumer, Peter Stark, Chris Wilson, Ted Wugofski, and Dan Zigmond. Xhtml[tm] 1.0: The extensible hypertext markup language. http://www.w3.org/TR/2000/REC-xhtml1-20000126, January 2000.</p><p>[11] William Stallings and Richard Van Slyke. Business Data Communications. Prentice-Hall, Inc., 3th edition, 1997. isbn: 0-13-594581-X.</p><p>[12] W3C. Html 4.01 specification. http://www.w3.org/TR/1999/REC-html401-19991224, December 1999.</p><p>[13] W3C. Resource description framework (rdf) model and syntax specification. http: //www.w3.org/TR/1999/REC-rdf-syntax-19990222, February 1999.</p><p>[14] W3C. Xsl transformations (xslt) version 1.0. http://www.w3.org/TR/1999/ REC-xslt-19991116, November 1999.</p><p>[15] W3C. Document object model (dom) level 1 specification (second edition). http: //www.w3.org/TR/2000/WD-DOM-Level-1-20000929, September 2000.</p><p>[16] W3C. Extensible markup language (xml) 1.0 (second edition). http://www.w3.org/TR/ 2000/REC-xml-20001006, October 2000.</p><p>40 [17] http://www.yahoo.com.</p><p>[18] Dell Zhang and Yisheng Dong. An efficient algorithm to rank web resources. Computer Networks, 33(1-6):449--455, June 2000. .</p><p>41 Appendix B - Samenvatting in het Nederlands</p><p>Terwijl het zoeken op internet voor textuele paginas behoorlijk goed werkt, kan het zoeken naar interactieve pagina's nog lastig zijn. Deze scriptie presenteerd een oplossing voor dit probleem. De gepresenteerde oplossing gebruikt een classificatiesysteem dat pagina's classificeerd in verschillende classen om de functionaliteit van zoekmachines uit te breiden. Mogelijke classificaties zijn: textuele webpagina's, webpagina's met links en interactieve webpagina's. Het classificatiesysteem verbeterd in het bijzonder het zoeken naar interactieve pagina's. Om dit uit te leggen moet eerst het verschil tussen interactieve webpagina's en textuele webpagina's worden uitgelegd. In textuele webpagina's worden de belangrijktste delen gevormd door woorden. Bij interactieve pagina's bepalen juist de tags (controleelementen van html) de structuur van de pagina. Traditionele zoekmachines zijn gebaseerd op ``information retrieval''. Information retrieval is echter gebaseerd op textuele inhoud (en is niet specifiek voor html). Omdat bij interactieve pagina's tags een grote rol spelen werkt de traditionele aanpak niet optimaal. De meeste literatuur over zoekmachines probeert het zoeken naar textuele pagina's te verbeteren. Anderen proberen het zoeken van andere soorten webpagina's te verbeteren, maar gebruiken meta-data als oplossing. Zij willen de meta-data doorzoeken voor de gevraagde pagina's. Hoewel dit idee werkt is er een probleem. Tot nu toe is er bijna geen publiek toegankelijke metadata over web-pagina's. Die meta-data zou gecreerd moeten worden door de auteurs van de webpagina's. Hoewel het denkbaar is dat auteurs van grotere sites meta-data beschikbaar stellen, zullen kleinere sites dit waarschijnlijk voorlopig niet doen. Voor het verbeteren van het zoeken naar interactieve pagina's kan een classificatiesysteem gebruikt worden. Dit classificatiesysteem classificeerd pagina's in groepen die zijn gebaseerd op wat voor soort pagina een bepaalde pagina is. Dit classificatiesysteem werkt ook goed voor textuele pagina's en kan ook daar informatie toevoegen voor de gebruiker. Om gedetailleerder zoeken mogelijk te maken en om de performance te verbeteren is het classificatiesyteem als een boomstructuur opgezet. Bovenin zitten de algemene groepen. Die groepen hebben ``kinderen'' die specifieker zijn. Een algemene groep zou een pagina bijvoorbeeld als interactief kunnen benoemen. Een meer specifieke classificatie/groep zou die pagina als een reserveringspagina kunnen zien. De verbetering in performance kan gevonden worden in het feit dat alleen de kindclassificaties wiens ouderclassificaties de hoogste score hebben behaald ge-evalueerd worden. Er is een prototype ontwikkeld dat deze ideeen uitwerkt. Het prototype functioneerd als een zoekmachine die het mogeijk maakt de resultaten te sorteren gebaseerd op de classificatie. De resultaten bevatten ook een samenvatting van de scores voor de verschillende classificaties. Dit vergemakkelijkt het voor gebruikers om te kiezen welke pagina's ze verder willen bekijken. Het prototype functioneerd goed binnen zijn beperkingen. Die beperkingen liggen bovenal in de snelheid en omdat ze bij een productiesysteem niet van toepassing zouden zijn, zijn ze niet serieus van aard.</p><p>42</p> </div> </div> </div> </div> </div> </div> </div> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.1/jquery.min.js" integrity="sha512-aVKKRRi/Q/YV+4mjoKBsE4x3H+BkegoM/em46NNlCqNTmUYADjBbeNefNxYV7giUp0VxICtqdrbqU7iVaeZNXA==" crossorigin="anonymous" referrerpolicy="no-referrer"></script> <script src="/js/details118.16.js"></script> <script> var sc_project = 11552861; var sc_invisible = 1; var sc_security = "b956b151"; </script> <script src="https://www.statcounter.com/counter/counter.js" async></script> <noscript><div class="statcounter"><a title="Web Analytics" href="http://statcounter.com/" target="_blank"><img class="statcounter" src="//c.statcounter.com/11552861/0/b956b151/1/" alt="Web Analytics"></a></div></noscript> </body> </html>