Semantic Search Engine, Information Retrieval, Web Mining, Fuzzy Logic

Semantic Search Engine, Information Retrieval, Web Mining, Fuzzy Logic

International Journal of Web Engineering 2013, 3(1): 1-10 DOI: 10.5923/j.web.20130301.01 Enhancing Semantic Search Engine by Using Fuzzy Logic in Web Mining Salah Sleibi Al-Rawi1,*, Rabah N. Farhan2, Wesam I. Hajim2 1Information Systems Department, College of Computer, Anbar University, Ramadi, Anbar, Iraq 2Computer Science Department, College of Computer, Anbar University, Ramadi, Anbar, Iraq Abstract The present work describes system architecture of a collaborative approach for semantic search engine mining. The goal is to enhance the design, evaluation and refinement of web mining tasks using fuzzy logic technique for improving semantic search engine technology. It involves the design and implementation of the web crawler for automatizing the process of search, and examining how these techniques can aid in improving the efficiency of already existing Information Retrieval (IR) technologies. The work is based on term frequency inverse document frequency (tf*idf) which depends on Vector Space Model (VSM). The time average consumed for the system to respond in retrieving a number of pages ranging between 20-120 pages for a number of keywords was calculated to be 4.417Sec. The increase percentage (updating rate) was calculated on databases for five weeks during the experiment and for tens of queries to be 5.8 pages / week. The results are more accurate depending on the recall and precision measurements reaching 95% - 100% for some queries within acceptable retrieved time and a spellcheck technique. Ke ywo rds Semantic Search Engine, Information Retrieval, Web Mining, Fuzzy Logic guided concept is to reduce the probability of inserting 1. Introduction wrong words in a query. As for feature-rich semantic search engine, searches are Due to the complexity of human language, the computer divided into a classic search and information search. If you cannot understand well and interpret users' queries and it is search for a term that has more than one meaning, it will give difficult to determine the information on a specific website you the chance to choose what you were originally looking effectively and efficiently because of the large amount of for, with its disambiguation results. information[1]. There are two main problems in this area. Leelanupab, T.[5], explored three aspects of First, when people use natural language to search, the diversity-based document retrieval: 1) recommender systems, computer cannot understand and interpret the query correctly 2) retrieval algorithms, and 3) evaluation measures, and and precisely. Second, the large amount of information provided an understanding of the need for diversity in search makes it difficult to search effectively and efficiently[2]. results from the users' perspective. He was developing an Berners-Lee et al.[3] indicated that the Semantic Web is interactive recommender system for the purpose of a user not a separate Web but an extension of the current one, in study. Designed to facilitate users engaged in exploratory which information is given well-defined meaning, better search, the system is featured with content-based browsing, enabling computers and people to work in cooperation. aspectual interfaces, and diverse recommendations. While Al-Rawi Salah et al.[4] Suggested building a Semantic the diverse recommendations allow users to discover mo re Guided Internet Search Engine to present an efficient search and different aspects of a search topic, the aspectual engine - crawl, index and rank the web pages by applying interfaces allow users to manage and structure their own two approaches . The first was implementing Semantic search process and results regarding aspects found during principles through the searching stage, which depended on browsing. The recommendation feature mines implicit morphology concept - applying stemming concept - and relevance feedback information extracted from a user's synonyms dictionary. The second was implementing guided browsing trails and diversifies recommended results with concept during input the query stage which assisted the user respect to document contents. to find the suitable and corrected words. The advantage of One way to improve the efficiency of semantic search engine is Web Mining (WM). WM is the Data Mining * Corresponding author: [email protected] (Salah Sleibi Al-Rawi) technique that automatically discovers or extracts the Published online at http://journal.sapub.org/web information from web documents. It consists of the Copyright © 2013 Scientific & Academic Publishing. All Rights Reserved following tasks[6]: 2 Salah Sleibi Al-Rawi et al.: Enhancing Semantic Search Engine by Using Fuzzy Logic in Web M ining Resource finding, information selection and The tf*idf algorithm is based on the well-known VSM, pre-processing, generalization and analysis. which typically uses the cosine of the angle between the The present work describes system architecture of a document and the query vectors in a multi-dimensional space collaborative approach for semantic search engine mining. as the similarity measure. Vector length normalization can The aim of this study is to enhance the design, evaluation and be applied when computing the relevance score, Ri,q, o f refinement of web mining tasks using fuzzy logic technique page Pi with respect to query q:[9] ( , ) for improving semantic search engine technology. All main 0.5+0.5 max ( , ) = parameters in this proposal is based on the term frequency ∑ ∈ � ∙ � (6) ( , ) 0.5+0.5 inverse document frequency (tf*idf) which depends on max VSM. �∑ ∈ �� ∙ �∙ � �� where ( , ) : the term frequncy of Qj in Pi max : the maximum term frequency of a keyword in Pi : 2. Theory and Methods log (7) ( , ) =1 2.1 Vector Space Model (VSM) The full VSM is very expensive�∑ to� implement, because the The VSM suggests a framework in which the matching normalization factor is very expensive to compute. In tf*idf between query and document can be represented as a real algorithm, the normalization factor is not used. That is the number. The framework not only enables partial matching relevance score is computed by: ( , ) but also provides an ordering or ranking of retrieved ( , ) = 0.5 + 0.5 max (8) documents. The model views documents as "bag of words" and uses weight vectors representation of the documents in a The tf*idf ∑weight ∈ (term�� frequency∙ –inverse� ∙ � document�� collection. The model provides the notions "term frequency" frequency) is a numerical statistic which reflects how (tf) and "inverse document frequency" (idf) which are used to important a word is to a document in a collection or corpus. It compute the weights of index terms with regarding to a is often used as a weighting factor in IR and text mining. The document. The notion "tf" is computed as the number of tf*idf value increases proportionally to the number of times a occurrence of that term, normally weighted by the largest word appears in the document, but is offset by the frequency number of occurrence. Mathematically this issue may be of the word in the corpus, which helps to control for the fact treated as follows: that some words are generally more common than others[10]. Let N be the number of documents in the system and ni be One of the simplest ranking function is computed by the number of documents in which the index term ti appears. summing the tf*idf for each query term; many more Let freqi,j be the raw frequency of term ki in the document dj. sophisticated ranking functions are variants of this simple Then the normalized term frequency is given by: [7] model. , It can be shown that tf*idf may be derived as; , = (1) max ( , , ) = ( , ) × ( , ) (9) 1, where: D is the total number of documents in the corpus. where the max freq l,j is the maximum number of A high weight ∗ in tf*idf is reached by a high term frequency occurrence of term tl in document dj . The idf, inverse document frequency is given by: (in the given document) and a low document frequency of the term in the whole collection of documents; the weights = log (2) hence tend to filter out common terms. Since the ratio inside The classical term weighting scheme is based on the the log function is always greater than 1, the value of following equation : idf (and tf*idf) is greater than 0. As a term appears in more , = , × (3) documents′ then ratio inside the log approaches 1 and making There are also some variations of equation 3, such as the and tf*idf approaching 0. If a 1 is added to the one proposed by Salton et al[7]: denominator, a term that appears in all documents will have 0.5 , negative , and a term that occurs in all but one document , = 0.5 + × log (4) max 1, will have an equal to zero[11]. The above concepts The retrieval process is based on computation of a were exploited to build an algorithm with which it can be � � similarity function, also known as the cosine function, which used to reduce the frequency in the page. The proposed measures the similarity of the query and document ranking algorithm will be implemented as given in Algorith m (1). vectors.[8] =1 , × , In the suggested system all or most of the priority , = = (5) ×| | parameters were taken into consideration as they are listed � 2 × 2 ∙� =∑1 , =1 , and defined in Table (1). Each is given a value according to �� � � where � and � are vector representation�∑ of�∑ the document j its importance. These values are estimated in reasonable and query q. way. ̅ � International Journal of Web Engineering 2013, 3(1): 1-10 3 Ta b l e 1. Parameters Importance Values n : Number of parameters of Style Tag Value Descript io n Ta b l e 2. Style Values <head> 0.10 Defines informat ion about the document St y le Value <title> 0.10 Defines the title of a document Bold 0.20 Defines a default address or a default target for <base> 0.05 Italic 0.20 all links on a page Defines the relationship between a document Underline 0.20 <link> 0.05 and an external resource Font Size 0.10 <met a> 0.05 Defines met adata about an HTML document Font Type 0.10 <style> 0.10 Defines style informat ion for a document Color 0.05 Blue 0.05 The style value is extracted from a collection of elements Red 0.05 as illustrated in Table (2).

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us