International Journal of Web Engineering 2013, 3(1): 1-10 DOI: 10.5923/j.web.20130301.01
Enhancing Semantic Search Engine by Using Fuzzy Logic in Web Mining
Salah Sleibi Al-Rawi1,*, Rabah N. Farhan2, Wesam I. Hajim2
1Information Systems Department, College of Computer, Anbar University, Ramadi, Anbar, Iraq 2Computer Science Department, College of Computer, Anbar University, Ramadi, Anbar, Iraq
Abstract The present work describes system architecture of a collaborative approach for semantic search engine mining. The goal is to enhance the design, evaluation and refinement of web mining tasks using fuzzy logic technique for improving semantic search engine technology. It involves the design and implementation of the web crawler for automatizing the process of search, and examining how these techniques can aid in improving the efficiency of already existing Information Retrieval (IR) technologies. The work is based on term frequency inverse document frequency (tf*idf) which depends on Vector Space Model (VSM). The time average consumed for the system to respond in retrieving a number of pages ranging between 20-120 pages for a number of keywords was calculated to be 4.417Sec. The increase percentage (updating rate) was calculated on databases for five weeks during the experiment and for tens of queries to be 5.8 pages / week. The results are more accurate depending on the recall and precision measurements reaching 95% - 100% for some queries within acceptable retrieved time and a spellcheck technique. Ke ywo rds Semantic Search Engine, Information Retrieval, Web Mining, Fuzzy Logic
guided concept is to reduce the probability of inserting 1. Introduction wrong words in a query. As for feature-rich semantic search engine, searches are Due to the complexity of human language, the computer divided into a classic search and information search. If you cannot understand well and interpret users' queries and it is search for a term that has more than one meaning, it will give difficult to determine the information on a specific website you the chance to choose what you were originally looking effectively and efficiently because of the large amount of for, with its disambiguation results. information[1]. There are two main problems in this area. Leelanupab, T.[5], explored three aspects of First, when people use natural language to search, the diversity-based document retrieval: 1) recommender systems, computer cannot understand and interpret the query correctly 2) retrieval algorithms, and 3) evaluation measures, and and precisely. Second, the large amount of information provided an understanding of the need for diversity in search makes it difficult to search effectively and efficiently[2]. results from the users' perspective. He was developing an Berners-Lee et al.[3] indicated that the Semantic Web is interactive recommender system for the purpose of a user not a separate Web but an extension of the current one, in study. Designed to facilitate users engaged in exploratory which information is given well-defined meaning, better search, the system is featured with content-based browsing, enabling computers and people to work in cooperation. aspectual interfaces, and diverse recommendations. While Al-Rawi Salah et al.[4] Suggested building a Semantic the diverse recommendations allow users to discover mo re Guided Internet Search Engine to present an efficient search and different aspects of a search topic, the aspectual engine - crawl, index and rank the web pages by applying interfaces allow users to manage and structure their own two approaches . The first was implementing Semantic search process and results regarding aspects found during principles through the searching stage, which depended on browsing. The recommendation feature mines implicit morphology concept - applying stemming concept - and relevance feedback information extracted from a user's synonyms dictionary. The second was implementing guided browsing trails and diversifies recommended results with concept during input the query stage which assisted the user respect to document contents. to find the suitable and corrected words. The advantage of One way to improve the efficiency of semantic search engine is Web Mining (WM). WM is the Data Mining * Corresponding author: [email protected] (Salah Sleibi Al-Rawi) technique that automatically discovers or extracts the Published online at http://journal.sapub.org/web information from web documents. It consists of the Copyright Β© 2013 Scientific & Academic Publishing. All Rights Reserved following tasks[6]:
2 Salah Sleibi Al-Rawi et al.: Enhancing Semantic Search Engine by Using Fuzzy Logic in Web M ining
Resource finding, information selection and The tf*idf algorithm is based on the well-known VSM, pre-processing, generalization and analysis. which typically uses the cosine of the angle between the The present work describes system architecture of a document and the query vectors in a multi-dimensional space collaborative approach for semantic search engine mining. as the similarity measure. Vector length normalization can The aim of this study is to enhance the design, evaluation and be applied when computing the relevance score, Ri,q, o f refinement of web mining tasks using fuzzy logic technique page Pi with respect to query q:[9] ( , ) for improving semantic search engine technology. All main 0.5+0.5 max ( , ) = π‘π‘π‘π‘ ππ ππ ππππππ ππ parameters in this proposal is based on the term frequency βπ‘π‘π‘π‘π‘π‘π‘π‘π‘π‘ β πποΏ½ β οΏ½ (6) π‘π‘π‘π‘( , ) ππ 0.5+0.5 inverse document frequency (tf*idf) which depends on max VSM. π π ππ ππ π‘π‘π‘π‘ ππ ππ οΏ½βπ‘π‘π‘π‘π‘π‘π‘π‘π‘π‘ β ππππ οΏ½οΏ½ βπ‘π‘π‘π‘ πποΏ½β οΏ½ππππππππ οΏ½οΏ½ where ( , ) : the term frequncy of Qj in Pi max : the maximum term frequency of a keyword in Pi : π‘π‘π‘π‘ ππ ππ π‘π‘π‘π‘ ππ 2. Theory and Methods log ππ (7) ( , ) ππππππ =1ππ 2.1 Vector Space Model (VSM) ππ The full VSM is very expensiveοΏ½βππ πΆπΆ ππ ππtoοΏ½ implement, because the The VSM suggests a framework in which the matching normalization factor is very expensive to compute. In tf*idf between query and document can be represented as a real algorithm, the normalization factor is not used. That is the number. The framework not only enables partial matching relevance score is computed by: ( , ) but also provides an ordering or ranking of retrieved ( , ) = 0.5 + 0.5 max (8) documents. The model views documents as "bag of words" π‘π‘π‘π‘ ππ ππ and uses weight vectors representation of the documents in a Theπ π ππtf*idfππ βweightπ‘π‘π‘π‘π‘π‘π‘π‘π‘π‘ β ππ(termοΏ½οΏ½ frequencyβ π‘π‘π‘π‘ βinverseπποΏ½ β οΏ½ππππππ ππdocumentοΏ½οΏ½ collection. The model provides the notions "term frequency" frequency) is a numerical statistic which reflects how (tf) and "inverse document frequency" (idf) which are used to important a word is to a document in a collection or corpus. It compute the weights of index terms with regarding to a is often used as a weighting factor in IR and text mining. The document. The notion "tf" is computed as the number of tf*idf value increases proportionally to the number of times a occurrence of that term, normally weighted by the largest word appears in the document, but is offset by the frequency number of occurrence. Mathematically this issue may be of the word in the corpus, which helps to control for the fact treated as follows: that some words are generally more common than others[10]. Let N be the number of documents in the system and ni be One of the simplest ranking function is computed by the number of documents in which the index term ti appears. summing the tf*idf for each query term; many more Let freqi,j be the raw frequency of term ki in the document dj. sophisticated ranking functions are variants of this simple Then the normalized term frequency is given by: [7] model. , It can be shown that tf*idf may be derived as; = (1) , max ( , , ) = ( , ) Γ ( , ) (9) ππππππππ ππ ππ1, where: D is the total number of documents in the corpus. where the max freqπ‘π‘ ππ l,j is the maximumππ number of π‘π‘π‘π‘ ππππππππ A high weightπ‘π‘π‘π‘ β ππinππππ tf*idfπ‘π‘ ππ isπ·π· reachedπ‘π‘π‘π‘ π‘π‘ byππ a highππππππ π‘π‘termπ·π· frequency occurrence of term tl in document dj . The idf, inverse document frequency is given by: (in the given document) and a low document frequency of = log (2) the term in the whole collection of documents; the weights ππ hence tend to filter out common terms. Since the ratio inside ππ The classical term weightingππππππ schemeππππ is based on the the log function is always greater than 1, the value of following equation : idf (and tf*idf) is greater than 0. As a term appears in more , = , Γ (3) documentsππππππβ²π π then ratio inside the log approaches 1 and making There are also some variations of equation 3, such as the and tf*idf approaching 0. If a 1 is added to the π€π€ππ ππ π‘π‘π‘π‘π‘π‘ ππ ππππππππ one proposed by Salton et al[7]: denominator, a term that appears in all documents will have 0.5 , ππnegativeππππ , and a term that occurs in all but one document , = 0.5 + Γ log (4) max ππ1ππ, will have an equal to zero[11]. The above concepts ππππππππ ππ ππππππ The retrievalππ ππ process is based on computation of a were exploited to build an algorithm with which it can be π€π€ οΏ½ ππππππππ πποΏ½ ππππ ππππππ similarity function, also known as the cosine function, which used to reduce the frequency in the page. The proposed measures the similarity of the query and document ranking algorithm will be implemented as given in vectors.[8] Algorith m (1). =1 , Γ , In the suggested system all or most of the priority , = = (5) Γ| | π‘π‘ parameters were taken into consideration as they are listed οΏ½ππ ππ 2 Γππ ππ ππ ππ 2 ππ βπποΏ½ =β1 π€π€, π€π€=1 , and defined in Table (1). Each is given a value according to ππ οΏ½πποΏ½ππ οΏ½ πποΏ½ π‘π‘ π‘π‘ whereπ π π π π π οΏ½ ππandππ οΏ½ are vector representationοΏ½βππ π€π€ ππ ππ ofοΏ½β theππ documentπ€π€ ππ ππ j its importance. These values are estimated in reasonable and query q. way. ππ ππΜ πποΏ½
International Journal of Web Engineering 2013, 3(1): 1-10 3
Ta b l e 1. Parameters Importance Values n : Number of parameters of Style
Tag Value Descript io n Ta b l e 2. Style Values
0.10 Defines informat ion about the document St y le Value