LAICOS: an Open Source Platform for Personalized Social Web Search

LAICOS: An Open Source Platform for Personalized Social Web Search ∗ Mohamed Reda∗ Hakim Hacid Mokrane Bouzeghoub Bouadjenek Sidetrade PRiSM Laboratory PRiSM Laboratory 114 Rue Gallieni, 92100 Versailles University Versailles University Boulogne-Billancourt, France mokrane.bouzeghoub@ [email protected] [email protected] prism.uvsq.fr ABSTRACT sents an interesting opportunity for enhancing existing data In this paper, we introduce LAICOS, a social Web search management systems or proposing new ones. engine as a contribution to the growing area of Social In- Information Retrieval (IR) is performed every day in an formation Retrieval (SIR). Social information and personal- obvious way over the Web [1], typically under a search en- ization are at the heart of LAICOS. On the one hand, the gine. However, finding relevant information on the Web still social context of documents is added as a layer to their tex- becomes harder for end-users as: (i) usually, the user doesn’t tual content traditionally used for indexing to provide Per- necessarily know what he is looking for until he reaches it, sonalized Social Document Representations. On the other and (ii) even if the user knows what he is looking for, he hand, the social context of users is used for the query expan- doesn’t always know how to formulate the right query to sion process using the Personalized Social Query Expansion find it (except if in cases of navigational queries [7]). In ex- framework (PSQE) proposed in our earlier works. We de- isting IR systems, queries are usually interpreted and processed using document indexes and/or ontologies, which are scribe the different components of the system while relying 1 on social bookmarking systems as a source of social infor- hidden for users. The resulting documents are not neces- mation for personalizing and enhancing the IR process. We sarily relevant from an end-user perspective, in spite of the show how the internal structure of indexes as well as the ranking performed by the Web search engine. query expansion process operated using social information. To improve the IR process and reduce the amount of irrel- evant documents, there are mainly three possible improve- Categories and Subject Descriptors: H.3.3 [Informa- ment tracks: (i) query reformulation using extra knowledge, tion Systems]: Information Search and Retrieval i.e. expansion or refinement of the user query, (ii) post fil- General Terms: Algorithms, Experimentation. tering or re-ranking of the retrieved documents (based on the user profile or context), and (iii) improvement of the IR Keywords: Information Retrieval, Social networks. model, i.e. the way documents and queries are represented and matched to quantify their similarities. 1. INTRODUCTION In this demonstration, we introduce LAICOS, an open source Personalized Social Web Search Engine prototype, The Web 2.0 has strengthened end-users position in the Web in which social information and personalization are at the through their integration in the heart of the ecosystem of heart of the IR process. Actually, this prototype is relying on content generation. This has been made possible mainly social bookmarking systems as a source of social information through the availability of tools such as social networks, so- for personalizing and enhancing the IR process but can be cial bookmarking systems, social news sites, etc. impacting extended to use any source of social metadata, i.e. tweets, the way information is produced, processed, and consumed comments, etc. by both humans and machines. As a result, on the one hand, the user is no longer able to digest the large quantity of information he has access to, and is generally overwhelmed 2. ANATOMY OF LAICOS by it. On the other hand, this abundance of information captures in general an explicit feedback of users and repre- Figure 1 illustrates the different components of LAICOS, which are: (i) a set of connectors, crawlers, and a database ∗ storing the data, (ii) data indexing engines, and (iii) a query This work has been mainly done when the authors was at processing engine. In the following, we briefly describe some Bell Labs France, Centre de Villarceaux. of these components. 2.1 Crawlers in LAICOS Permission to make digital or hard copies of all or part of this work for personal or LAICOS possesses two crawlers: (i) a crawler for web classroom use is granted without fee provided that copies are not made or distributed 2 for profit or commercial advantage and that copies bear this notice and the full cita- pages based on Heritrix , which was specifically designed for tion on the first page. Copyrights for components of this work owned by others than web archiving and crawling. For each crawled web page, (ii) ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- a folksonomy crawler engine will download all the annota- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. 1We also refer to documents as web pages or resources. KDD’13, August 11–14, 2013, Chicago, Illinois, USA. 2 Copyright 2013 ACM 978-1-4503-2174-7/13/08 ...$15.00. http://crawler.archive.org/index.html 1446 • Tags Docs: it stores the posting list of web pages for tags. In particular, for each tag, this structure stores: Web Interface for the id of the web page which is tagged with this tag search and results and the amount of users who have used this tag to Crawling Query-time process Folksonomy components annotate this web page. crawler PSQE Web crawler • Users Tags: it stores the posting list of tags for users. Searchers Ranking In particular, for each user, this structure stores: the id of the tag used by this user and the amount of web pages tagged by this user with this considered tag. Document collection and their Retrieval PSDV Retrieval social annotations • Bookmarks: it stores the posting list of tags for a Indexing document and a user. In particular, for each unique process pair of web page and a user, this structure stores: the Social Indexer Social Index ids of tags used by this user to annotate this web page. Textual Textual Indexer Index Structure Contents Structure Contents Figure 1: Architecture of LAICOS. Docs id (32) Tags id (32) (44) Number of users (4) (44) Number of users (4) Number of tags (4) Number of web pages (4) 3 tions that are associated to it through APIs . Annotations Byte offset in Byte offset in Tags Docs are recovered from social bookmarking systems especially, Docs Users (4) (4) the delicious website4. Users id (32) Docs Users userid (32) 2.2 Social Indexes in LAICOS (44) Number of web pages (4) (40) Number of tags (4) Number of tags (4) Byte offset in Crawled web pages and their social annotations are stored bookmarks(4) into a repository. Two indexing engines are responsible for Byte offset in indexing and keeping up to date the following index struc- Users Tags (4) tures: (1) A textual content based index structure, which is based on indexing the collection of crawled documents using Tags Docs docid (32) Users Tags tagid (32) the inverted index structure. The Apache Lucene search en- (36) Number of users (4) (36) Number of web pages (4) 5 gine leverages this part . (2) A social based index structure, Bookmarks (32) tagid(32) which is based on the crawled annotations assigned by users to web pages in social bookmarking websites. We imple- Table 1: Details on the format and compression ment our own indexing engine and structure for this part. used for each index data structure. The numbers The textual-content based index structure is well described in parenthesis are the size of each entry in bytes. in [12]. Hence, we only describe the social-based index structure of LAICOS. This index consists of the following seven main data structures (See Table 1 for the internal format 2.3 Query pre-processing engine in LAICOS and compression used for each structure): In IR systems, queries are usually pre-processed by being reformulated. This process includes either: (i) the reduction • Docs: it stores web pages’ ids (md5 hash of a web of queries, which is a technique to reduce long queries to page name), the number of tags and users associated more effective ones [11], or (ii) expansion of queries, which to the web page, as well as the offset in the Docs Users consists of enriching the user’s initial query with additional posting list. information [10]. In LAICOS, queries are interpreted and processed using • Tags: it stores tags’ ids (md5 hash of the tag text), the Personalized Social Query Expansion (PSQE) frame- the number of web pages and users associated to the work [5]. In order to achieve a social and a personalized tag, and the offset in the Tags Docs posting list. expansions of a query, PSQE considers: (i) the semantic similarity between candidate terms and the query, and (ii) • Users: it stores users’ ids (md5 hash of the user user the extent to which the candidate terms are likely to be in- name), the amount of web pages and tags associated teresting to the user. The experiments performed on the to the user, and the offset in the Users Tags posting PSQE framework in [5] show that it enhances web search list. significantly. • Docs Users: it stores the posting list of users for web 2.4 IR model in LAICOS pages. In particular, for each web page, this structure stores: the id of the user who tags this web page, the Modeling in IR consists of two main tasks [1]: (i) the def- amount of tags he has used to annotate this web page, inition of a conceptual model to represent documents and and the offset in the Bookmarks posting list.

Load more