On Designing Semantic Lexicon-Based Architectures for Web Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
International Journal on Advances in Internet Technology, vol 3 no 1 & 2, year 2010, http://www.iariajournals.org/internet_technology/ 170 On Designing Semantic Lexicon-Based Architectures for Web Information Retrieval Vincenzo Di Lecce, Marco Calabrese Domenico Soldo DIASS myHermes S.r.l. Politecnico di Bari – II Faculty of Engineering Taranto - Italy Taranto, Italy e-mail: [email protected] e-mail: {v.dilecce, m.calabrese}@aeflab.net Abstract—In this work, a novel framework for designing Web user query to several search engines, collect their responses Information Retrieval systems with particular reference to and finally propose them to the user according to certain semantic search engines is presented. The key idea is to add the criteria. Meta-search engines, however, have still to deal semantic dimension to the classical Term-Document Matrix with the problem of mixing information coming from thus having a three-dimensional dataset. This enhancement different sources, which is an awkward task to accomplish, allows for defining a lexico-semantic user interface where the unless some semantic approach is pursued. query process is performed at the conceptual level thanks to Semantic search engines should attempt to understand the the use of a Semantic Lexicon. WordNet Semantic Lexicon is user query at the ontology level. They should also offer a used here as golden ontology for handling polysemy and pictorial representation of the retrieved dataset, letting the synonymy, hence it is useful for disambiguating user queries at user have the impression to move within a semantic search the semantic level. A layered multi-agent system is employed for supporting the design process. Particular emphasis is given space. In order to be really effective, they require a strong to formal system knowledge representation, the interface layer theoretical knowledge model, which has to be sufficiently managing user-system interaction and the markup layer robust to allow for indexing heterogeneous data scattered performing the semantic tagging process. across the Web. In this work, a novel framework for designing textual Keywords-component; information retrieval; semantic information retrieval systems with particular reference to lexicon; WordNet; MAS; semantic query semantic search engines is presented. A semantic dimension is added to the classical term-document matrix thus having a I. INTRODUCTION three-dimensional dataset. In this view, the user is forced to adopt a new semantic query paradigm, which is closer to Since its advent, the World Wide Web (hereinafter WWW human understanding than to traditional keyword-based or simply the Web) has increased dramatically in size and techniques. The query process is performed at the conceptual number of interlinked resources. This trend enforces search level thanks to the use of a Semantic Lexicon considered as a engine developers to adopt Web document indexing golden ontology useful for the sense disambiguation task. A techniques, which exchange scalability for fair layered Multi-Agent System (MAS) is employed for precision/recall performances. Surveying the literature of the supporting the whole design process. latest years it is easy to notice the growing consensus about This article is an extension of a previous work presented in the need for involving semantics in retrieval systems. the ICIW 2009 Conference [1] specifically focused on Approaches that employ low-level features as indexing semantic tagging of Web resources using MAS architecture. parameters are prone in fact to a number of pitfalls, like the In the present paper, the critical point of including semantics inherent ambiguity of polysemous query words. At the in text retrieval systems is handled under a more general and present time, commercial search engines provide relevant complete perspective that involves Web ontology modeling. responses if the user is good enough when submitting the The final aim is to bridge the gap between traditional search right query. Therefore, the access to high-quality information engines based on term-document indexing and emerging on the Web may be still problematic for unskilled users. semantic requirements by means of a suitable model, which Traditional search engines are conceptually based on a embeds terms, documents and semantics into a single term-document look-up table (also known as Term- knowledge representation. Document Matrix or TDM for short). Lexical terms The outline of the paper is as follows: Section II reports conveyed by the user query play the role of entries; related work in Information Retrieval with particular documents populate a (ranked) list of weblinks that match reference to search engines and semantic tagging aspects; user query terms according to a given metric. The user is Section III describes WordNet architecture [2] and its required to discern among the given options and choose the usefulness for the scope of this work; Section IV proposes one that is supposed to be closest to his/her intentions. the new three-dimensional information retrieval framework; A more sophisticated type of Web information retrieval Section V presents the used multi-agent system architecture; systems is represented by meta-search engines, which relay Section VI comments the carried out experiments and 2010, © Copyright by authors, Published under agreement with IARIA - www.iaria.org International Journal on Advances in Internet Technology, vol 3 no 1 & 2, year 2010, http://www.iariajournals.org/internet_technology/ 171 prototypal implementations; conclusions are sketched in between the set of query keywords and the set of document Section VII. terms. Although the query search may be restricted by using Boolean and/or operators (thus providing a more selective II. RELATED WORK filtering on the search space), the quality of document Information Retrieval (IR) is finding material (usually retrieval is significantly affected by the ranking strategy. A documents) of unstructured nature (usually text) satisfying simple comparison among the principal Web search engines an information request from within large collections (usually shows in fact how different the retrieved document could be, stored on computers) [3]. Automated IR systems are even in response to the same user query word. conceptually related to object and query. In the context of IR The well-known Page Rank Algorithm [6] has been one systems, an object is an entity, which keeps or stores of the keys to success for the Google Web search engine. It information in a database, i.e. in a structured repository. User represents undoubtedly one of the most single important queries are then matched to objects stored in the database. A contributions to the field of IR in the latest years. The Page document is, therefore, an opportune collection of data Rank Algorithm employs a fast convergent and effective objects. random-walk model for ranking graph nodes like Often the documents themselves are not kept or stored hyperlinked Web resources [7]. It is based on the bright directly in the IR system, instead they are represented in the assumptions that weblinks may be interpreted as “votes” system by document surrogates automatically generated by given from the source page to the destination page. The vote the same IR system by means of a document analysis. expressed by a link is in fact weighted by the “reference” Nowadays there are two approaches to document analysis: (Page Rank value) of the pages from where the links come, statistical and semantic. in accordance with the formula provided by the authors: The statistical approach was initially proposed by Lhun. In 1958 he wrote: “It is here proposed that the frequency of ⎛ ⎞ word occurrence in an article furnishes a useful PR()T1 PR(Tn ) PR()A = (1− d )+ d ∗⎜ +K+ ⎟ (1) measurement of word significance. It is further proposed that ⎝ C()T1 C()Tn ⎠ the relative position within a sentence of words having given values of significance furnish a useful measurement for where PR(X) function gives the Page Rank value of page X, determining the significance of sentences. The significance A is the webpage pointed by T , T ,... T webpages, C(T ) is factor of a sentence will therefore be based on a combination 1 2 n i the number of links outgoing from page Ti and finally d is a of these two measurements” [4]. It is interesting to note that properly set constant value. The previous formula is this approach is still used in many modern IR systems. recursive. By highly ranking the most referenced pages, Page On the other hand, a Semantic Information Retrieval Rank represents a good prior filter to the enormous system exploits the notion of semantic similarity (based on heterogeneous search space. In addition to this, the simple lexical and semantic relations) between concepts to graphic view provided by Google home page can be easily determine the relevancy of a certain document. One way of understood by a great variety of users. In many real cases, incorporating semantic knowledge into a representation is Google apparent precision, however, can be partially mapping document terms to ontology-based concepts. In [5], ascribed to the poor syntax underpinning the user query and for example, a formal ontology-based model for representing to the self-influence it has had on users in the way they Web resources is presented. Starting from semantic Web formulate the query. Everyone can experience how much the standards as well as established ontologies the authors retrieval performances decrease with more complex human- reformulate the IR task into a data retrieval task assuming like queries. that more expressive resources and query models allow for a The adoption of more sophisticated retrieval functions can precise match between content and information needs. In this help reduce the misbalance now pending on the ranking work instead, the term-concept mapping is provided by a algorithms. golden ontology expressed in the form of a Semantic Lexicon like WordNet. The usefulness of this choice will be B. Semantic IR techniques explained throughout the text further on.