Context Aware Web-mining Engine Christopher Goh Zhen Fung Koay Tze Min River Valley High School Raffles Institution Singapore Singapore
[email protected] [email protected] Abstract—The context of a user’s query is often lost when using extracted to download more layers of web documents. These search engines such as Google, as they limit the number of search documents are ranked according to their similarity to the query, terms that can be input. Therefore, they may not return the most using natural language processing with machine learning relevant or desired results often. This project developed a context- techniques, to return search results of higher relevancy. aware web mining engine (CAWE), which allows the user to use the entire content of multiple documents as the query, to search and rank We compared the performance of CAWE with Google, the web documents by their similarity to it. CAWE combines the search most widely used search engine in the world, holding 70.69% results from Google, Yahoo and Bing search engines, and further of the search engine market share as of November 2015 [1]. crawls the links found in the downloaded web documents. The web Most popular search engines (Bing, DuckDuckGo, Baidu, documents are then ranked according to their content’s relevance to Yahoo, etc.) make use of similar PageRank techniques like the query documents. This ability for document matching is enabled Google, with variants of semantic matching [2]. by natural language processing techniques. Experiments showed that CAWE performed better than Google in finding more relevant web B. Drawbacks of Current Popular Search Engines documents.