Context-Aware Web-Mining Engine
Total Page:16
File Type:pdf, Size:1020Kb
Context Aware Web-mining Engine Christopher Goh Zhen Fung Koay Tze Min River Valley High School Raffles Institution Singapore Singapore [email protected] [email protected] Abstract—The context of a user’s query is often lost when using extracted to download more layers of web documents. These search engines such as Google, as they limit the number of search documents are ranked according to their similarity to the query, terms that can be input. Therefore, they may not return the most using natural language processing with machine learning relevant or desired results often. This project developed a context- techniques, to return search results of higher relevancy. aware web mining engine (CAWE), which allows the user to use the entire content of multiple documents as the query, to search and rank We compared the performance of CAWE with Google, the web documents by their similarity to it. CAWE combines the search most widely used search engine in the world, holding 70.69% results from Google, Yahoo and Bing search engines, and further of the search engine market share as of November 2015 [1]. crawls the links found in the downloaded web documents. The web Most popular search engines (Bing, DuckDuckGo, Baidu, documents are then ranked according to their content’s relevance to Yahoo, etc.) make use of similar PageRank techniques like the query documents. This ability for document matching is enabled Google, with variants of semantic matching [2]. by natural language processing techniques. Experiments showed that CAWE performed better than Google in finding more relevant web B. Drawbacks of Current Popular Search Engines documents. CAWE has numerous plausible potential applications • Popular search engines tend to rank web documents such as terrorist profiling for countering terrorism and disaster deemed most widely used first (social networking monitoring for humanitarian relief operations. sites, organization websites, etc.). This can be useful Keywords-natural language processing, machine learning, web- in some cases, but not for research and information mining, data extraction, document search engine acquisition, where highly relevant information is sought after. I. INTRODUCTION • Usually, only a limited-length query of up to 32 A. Purpose of Research Area words can be input into the search engine, which makes document matching impossible. Since its invention, the Internet has become an Many websites use search engine optimization unimaginably vast repository of information. By itself, this • information is disorganized and messy, such that users have to techniques that can artificially inflate their search rely on search engines such as Google to index and search engine rankings through the web. While search engines have proved to be These properties give results that are not truly ranked based useful, they are not context-aware. For example, searching for on relevance, and make it tedious for users to search through a “crane” when you want to find out about the crane vehicle pages of results and identify relevant sites. might net you results of the crane bird. Of course, the user can add keywords to give search engines more context, such as II. RESEARCH MATERIALS AND METHODS searching for “crane vehicle” instead. However, there is a limit A. Engine Creation to how much context it can be given (i.e. limited number of CAWE was programmed in Java with IntelliJ IDEA, a Java search terms allowed). The context of the query is thus often IDE [3]. CAWE has two main phases, the Searcher and the lost, and the returned results are not always the most relevant, Ranker as illustrated in Fig. 1. and might miss the desired documents hidden within the links Phase 1: Searcher of the returned results. To find truly relevant information, a Further Applications Phase 2: Ranker user often visits multiple websites and manually assesses each 1. Keyword Extraction Name Mining from Query one - an extremely tedious and time-consuming process, 1. Ranking by User Query Unigram Similarity Profile Search especially if the content in these sites are lengthy. Text / Document 2. Web Crawling 2. Ranking by Humanitarian Aid This project aimed to develop an engine that lets users find 3. Feature Name Similarity documents in the Internet that are more relevant to their query, Identification Defence Intelligence by ensuring that the context of their query is better captured. (e.g. counter-terrorism) Titled the Context-Aware Web-mining Engine (CAWE), it Output Documents Academic Research extracts keywords from an arbitrary-length query from the ranked based on relevance user, before using the keywords on multiple search engines to search for the first layer of web documents. These web documents are then downloaded, with the URLs within them Figure 1. Flowchart of CAWE’s Web-mining Process b) Web Crawling 1) Phase 1: Searcher The input keywords are used to form search queries on 3 The searcher takes in an arbitrary-length query or query search engines (Google, Bing, Yahoo), where the searcher document(s), and outputs a list of web documents. fetches URLs from every search result using the Jsoup HTML Parser [6]. Every URL fetched is added to the multithreaded a) Keyword Extraction from Query processing queue, where raw HTML/PDF/Word Document Keywords are needed to activate multiple search engines content is downloaded from the URLs. All URLs from the raw to provide our first layer of web documents. These keywords documents are extracted and added to the processing queue are extracted from the query documents. To do so, the again, to download their content in a breadth-first search contents of the query files are first tokenized into unigrams manner. The depth of the crawling is configurable by the user. using the tokeniser from the Apache Lucene libraries. Stop c) Feature Identification words (insignificant words like ‘the’) are excluded. The unigrams are then ranked by importance, with the top 10 Content-tagged text content (e.g. paragraph-tagged HTML unigrams suggested to the user as keywords for search text content with <p>, header text, etc.) is extracted from the engines. The importance of each unigram is computed based raw document with Jsoup (for HTML), Apache POI (for on term frequency-inverse document frequency (TF-IDF) and Microsoft Office) [7] and Apache PDFBox (for PDF) [8] for a factor that indicates if it is a name entity. ranking. This removes most possibly irrelevant content such as navigation text or advertisements. TF-IDF is a measure of importance for a term in a collection 2) Phase 2: Ranker of documents, where the higher the value of the term, the The Ranker ranks the downloaded web documents greater its importance. The TF-IDF of term t is calculated as according to each one’s relevance to the query documents. shown in (1). The relevance is measured by calculating the cosine similarity TFIDFt = TFt × IDFt (1) between each document and the query. It takes into account two factors, unigram similarity and name similarity, before Term Frequency (TF) refers to the number of times a term giving a final rank. appears in a document. The greater the term frequency, the Cosine similarity is a measure of the similarity between more important it is in that document. As different documents two vectors of an inner product space that calculates the have different numbers of terms, the term frequencies are cosine of the angle between them [9]. In this case, the vectors normalized as shown in (2). represent the two documents to be compared. Each dimension $% of the vectors corresponds to an individual term in the & Normalised TFt = ' (2) documents, and the TF-IDF of each term is assigned as the where t is the term, n is the total number of terms in the values in the vectors, similar to the previous TF-IDF document calculation method. The cosine similarity of two vectors, A and B, is given in (4). Document frequency (DF) is the number of documents that a particular term appears in a collection of documents. Less A ⋅ B ' A B important words such as “people” are so commonly used that cosine similarity = cos θ = = >?@ > > A B ' AA ' BA they appear in almost every document, so their document >?@ > >?@ > (4) frequencies are often higher than more important keywords occurring in fewer documents. Therefore, inverse document where Ai and Bi are the components of vectors A and B frequency (IDF) assigns higher weight to less frequently used respectively terms. The IDF of a term t is calculated as shown in (3). In this case, the cosine similarity between two documents N ranges from 0 to 1, 0 being totally unrelated (no similar IDF = log + (3) terms), and 1 being exactly similar. The closer the similarity DF+ score is to 1, the more relevant the documents are to each where N is the total number of documents in the other collection a) Ranking by Unigram Similarity In CAWE, the DF of words are pre-trained on 441803 text articles extracted from Wikipedia, the world’s largest online Like the query, every web document is first tokenized and encyclopaedia [4]. Also, the names of people, organizations cleared of stop words, before the TF-IDF values of all its and locations are given more weight than other terms. These remaining unigrams are calculated and used to form the names are extracted using the name entity recognizer (NER) document’s vector. Next, the vector for the query content is from the Stanford CoreNLP libraries [5]. The user may enter aligned to vector of each web document, and the cosine these suggested keywords into CAWE, or they may come up similarity between them is calculated with their own, to be used as the starting point for the web b) Ranking by Name Similarity crawler.