Information Systems 56 (2016) 1–18

Contents lists available at ScienceDirect

Information Systems

journal homepage: www.elsevier.com/locate/infosys

Social networks and , how are they converging? A survey, a taxonomy and an analysis of social information retrieval approaches and platforms

Mohamed Reda Bouadjenek a,b,n, Hakim Hacid c, Mokrane Bouzeghoub d a INRIA & LIRMM, University of Montpellier, France b Department of Computing and Information Systems, University of Melbourne, Parkville 3010, Victoria, Australia c Zayed University, United Arab Emirates d PRiSM Laboratory, University of Versailles-Saint-Quentin-en-Yvelines (UVSQ), France article info abstract

Article history: There is currently a number of research work performed in the area of bridging the gap Received 18 October 2014 between Information Retrieval (IR) and Online Social Networks (OSN). This is mainly done Accepted 9 July 2015 by enhancing the IR process with information coming from social networks, a process Available online 28 August 2015 called Social Information Retrieval (SIR). The main question one might ask is What would Keywords: be the benefits of using social information (no matter whether it is content or structure) into Information Retrieval the information retrieval process and how is this currently done? Social networks With the growing number of efforts towards the combination of IR and social networks, it Social Information Retrieval is necessary to build a clearer picture of the domain and synthesize the efforts in a structured and meaningful way. This paper reviews different efforts in this domain. It intends to provide a Social recommendation clear understanding of the issues as well as a clear structure of the contributions. More pre- cisely, we propose (i) to review some of the most important contributions in this domain to understand the principles of SIR, (ii) a taxonomy to categorize these contributions, and finally, (iii) an analysis of some of these contributions and tools with respect to several criteria, which we believe are crucial to design an effective SIR approach. This paper is expected to serve researchers and practitioners as a reference to help them structuring the domain, position themselves and, ultimately, help them to propose new contributions or improve existing ones. & 2015 Elsevier Ltd. All rights reserved.

1. Introduction facilitating his interactions with other users who have similar tastes or share similar resources. Social platforms and net- With the emergence of the social Web, the Web has works (such as MySpace,1 ,2 and LinkedIn3), colla- evolved from a static Web, where users were only able to borative tagging sites (like delicious,4 CiteULike,5 and Flickr6), consume information, to a Web where users are also able to and microblogging sites (like Twitter7 and Yammer8)are produce information. This evolution is commonly known as Social Web or Web 2.0. Thus, the Web 2.0 has introduced a new freedom for the user in his relation with the Web by 1 https://myspace.com/ 2 https://www.facebook.com 3 https://www.linkedin.com/ n Corresponding author. Tel.: +33643837691. 4 https://delicious.com/ E-mail addresses: [email protected] (M.R. Bouadjenek), 5 http://www.citeulike.org/ [email protected] (H. Hacid), 6 https://www.flickr.com/ [email protected] (M. Bouzeghoub). 7 https://twitter.com http://dx.doi.org/10.1016/j.is.2015.07.008 0306-4379/& 2015 Elsevier Ltd. All rights reserved. 2 M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 certainly the most adopted technologies in this new era. 2. A taxonomy to categorize these contributions in order These platforms are commonly used as ameans to commu- to structure this wide domain. nicate with other users, exchange messages, share resources 3. An analysis of some of these contributions and tools (photos and videos), comment news, create and update with respect to several criteria considered as crucial to profiles, interact and play online games, etc. In addition to design an effective SIR approach or to appreciate one. dedicated social platforms, traditional content providers sites like newspapers, tend to be more social since they provide to This study is useful for researchers, engineers, and practi- users means for sharing, commenting, constructing, and tioners in this domain to help in understanding trends, linking documents together [1,2], e.g., through the buttons challenges and expectations of a SIR approach. The rest of “share on social networks”. This has been also facilitated by this paper is organized as follows: in Section 2,wediscuss 9 10 initiatives like OpenID and OpenSocial. the main concepts used throughout this paper. In Section 3, These collaborative tasks that make users more active we introduce our taxonomy of SIR approaches and tools, in generating content are among the most important fac- while describing each of these categories in Sections 4–6.In tors for the increasingly growing quantity of available data. Section 7, we elaborate on an analysis of some of these In such a context, a crucial problem is to enable users to contributions and tools based on several dimensions. Finally, fi nd relevant information with respect to their interests we give some future directions in Section 8 and we conclude and needs. This task is commonly referred to as Informa- in Section 9. tion Retrieval (IR). IR is performed every day in an obvious way over the Web [3], typically using a . However, classic models of IR do not consider the social 2. Background dimension of the Web. They model web pages11 as a mixture of static homogeneous terms generated by the Information Retrieval is the process of recovering same creators, i.e., the authors of the web pages. Then, the stored information from large datasets to satisfy users' ranking algorithms are often based on (i) a query and information needs. Salton [7] and Baeza et al. [3] defined document text similarity (e.g., the cosine measure and the IR as follows: Okapi BM25 [4]), and (ii) the existing hypertext links that connect these web pages (e.g., PageRank [5], and HITS [6]). Definition 1 (Information Retrieval). Information Retrieval Therefore, classic models of IR and even the IR paradigm (IR) is the science that deals with the representation, sto- have to be adapted to the socialization of the Web in order to rage, organization of, and access to information items in fully leverage the social context that surrounds web pages and order to satisfy the user requirements concerning this users. Indeed, exploiting social information has a number of information. advantages (for IR in particular): first, feedback information in social networks is provided directly by the user, so users' An IR system is evaluated in its accuracy and ability to interests accurate information can be harvested as people retrieve high quality information/documents, which max- actively express their opinions on social platforms. Second, a imize users' satisfaction, i.e. the more the answers corre- huge amount of social information is published and available spond to the users' expectations, the better the system is. with the agreement of the publishers. Exploiting this infor- Information Retrieval is a very well established domain. mation should not violate user privacy, in particular social Several articles and books are devoted to IR, e.g., [3,7]. tagging information, which does not contain sensitive infor- Since our focus is on the social part of IR, we do not go into mation about users. Finally, social resources are often acces- more detail in the description of the IR domain. sible, as most of social networks provide APIs to access their This section introduces and defines the basic concepts data (even if often, a monetized contract must be established used throughout this paper. In Section 2.1 we define what before any large scale use). is a social network, and discuss the main models of social There is currently a number of research works under- relationships.12 Then, in Section 2.2, we introduce the taken to improve the IR process with information coming notion of Social Information Retrieval that bridges social from social networks. This is commonly known as “Social networks and information retrieval. Information Retrieval” (SIR). This paper intends to provide a clear understanding of the various efforts performed in the domain of SIR, and in this perspective, we propose 2.1. Social networks and social network analysis

1. An objective review of some of the most representative Nowadays, social networks are at the heart of the Web fi research contributions and existing tools in this domain 2.0. A social network is de ned as follows: to understand the principles of SIR as they are currently fi formulated. De nition 2 (Online social network). An online social network is the social structure, which emerges from human interactions through a networked application. 8 https://www.yammer.com/

12 Note that we do not intend to provide a complete review of the 9 http://www.openid.net/ SNA domain, but rather to provide the main underlying principles, which 10 http://www.opensocial.org/ could be leveraged in IR systems and which are helpful to understand the 11 In this paper, we also refer to web pages as documents. analysis provided in this paper. M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 3

Example 3 (Online social network). Facebook is certainly to the content of Alice as he considers her from a different the most popular social network that handles relation- perspective, e.g., his perspective. ships between individuals. There are many social network sites that manage in addition to users, objects. These 2.1.3. k-partite relationships include documents on CiteULike, images on , videos The two previous models of social relationships involve on YouTube, etc. generally only one type of nodes, i.e. users. In this last model, social relationships imply k types of nodes, e.g., The structure of a social network can be constructed in users, resources, and tags. Social bookmarking websites two ways: (i) either explicitly declared by the user, e.g., are representatives of such models. friendship links in Facebook, or (ii) implicitly inferred from Social bookmarking websites are based on the tech- the behavior and the common interests of users, e.g., Social niques of social tagging or collaborative tagging.The Network of Web services [8]. To understand the underlying principle behind social bookmarking platforms is to social structures and phenomena, a set of techniques and provide the user with means to annotate resources on the methods exist, which are known as Social Networks Ana- Web, e.g., URIs in delicious,videosinyoutube, images in lysis (SNA) techniques [9]. SNA introduces methods and flickr, or academic papers in CiteULike.Thus,thereare metrics (e.g., centrality and influence) for analyzing a Social three different kinds of entities: users, tags, and resour- Network, e.g., measuring the role of individuals and groups ces, involved in 3-partite relationships. The annotations of individuals in a social network. (also called tags) can be shared with other users. This Each social network might be characterized by the unstructured (or free structured) approach to classifica- relationships that link its users, e.g., friend relationships, tion with users assigning their own labels is often refer- follower–following relationships, and publisher–subscriber red to as a folksonomy [10,11]. relationships. Hence, we distinguish mainly three models of It is important to notice that most of the existing Social social relationships that we describe hereafter.13 Networking platforms are not restricted to manage only one kind of social graph but they may manage several 2.1.1. Symmetric relationships social relationships. For example, delicious which is a tag- Many social networks manage symmetric relationships ging based platform, manages (1) a social graph of ternary that translate the same consideration of relations between relationship between users, tags, and documents and (2) a entities, i.e. users, participating in the relation. Social net- social graph of publisher–subscriber to provide to users, works that include these relationships allow for example with means to view all the bookmarks saved by interesting users to maintain a list of friends and thus create friendship people, such as friends, co-workers, and favorite bloggers. relations. The Friendship relation is of the form Alice con- Finally, besides these social relationships, a social content siders Bob as a friend and she explicitly adds Bob to her list of friends. This relation is instantiated once Bob accepts the is generated and exchanged between users. This content is request. Thus, the friendship strength of a link between two the second important component of online social net- users can reflect for example, the degree of trustiness, and works, which is leveraged in several domains, such as SIR the degree of mutual interest. This model is more dedicated that we are discussing in this paper. to create and maintain personal relationships with trusted In the next section, we discuss the concept of social persons and with whom one is expected to share personal information retrieval, which links IR and social networks. elements and contents. Facebook is a good example of a social network that 2.2. Social information retrieval includes these relationships. Let us consider the users Alice and Bob on Facebook,ifAlice has a relationship with Bob, With the intrinsic social property of the Web, a large then Bob has the same relationship with Alice. range of applications and services make the user becoming more interactive with Web resources, and a lot of infor- 2.1.2. Asymmetric relationships mation that concerns both users and resources is con- Similarly, many social networks manage social relation- stantly generated. This information can be very useful in ships that can link two users with two different perspectives information retrieval tasks for both user and resources depending on the users. These relationships illustrate the modeling. However, classical models of IR are blind to this concept of followers–following or publisher–subscriber and social context that surrounds both users and resources. are generally at the heart of microblogging platforms, e.g., Therefore, the domains of IR and SNA have been bridged and Yammer. These social networks allow a user to resulting in Social Information Retrieval (SIR) models [12]. create and maintain a list of following people by permitting Very often, SIR models extend conventional IR models in him to subscribe to their information stream. This model of order to incorporate social information. However, as we social networks is more dedicated to the dissemination shall discuss in the next sections, new IR paradigms and of information than to mutual share of information. For concepts have emerged in order to provide a new way of example, if we consider a relationship between a user Alice operating information retrieval. and a user Bob on Twitter. Alice may subscribe to the content The meaning of the concept Social Information Retrieval fi published by Bob,whileBob does not necessarily subscribe can be very broad, but we propose the following de nition: Definition 4 (Social Information Retrieval). Social Informa- 13 We only describe the main social relationships models that we tion Retrieval is the process of leveraging social informa- believe are the most adopted in social networks. tion (both social relationships and the social content), to 4 M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 perform an IR task with the objective of better satisfying 3. A taxonomy for social information retrieval the users' information needs. There are many contributions in the domain of SIR. SIR aims to provide relevant content and information to Each of these contributions considers a particular social users in the domains of information retrieval, research, and network type and uses social information to perform an IR recommendation; covering topics such as social tagging, task differently. For example, on one hand, tags in folkso- collaborative querying, social network analysis, subjective nomies have been found useful for Web search, persona- relevance judgments, Q&A systems, and collaborative fil- lized search and contextualized . Indeed, tering [12]. Gupta et al. [13] provide a survey summarizing different Several existing platforms investigate this track in order to properties of tags along with their usefulness for IR like tag improve the search paradigm as illustrated in Fig. 1.These semantics, recommendations using tags, tag profiling, etc. include Social Bing, Googleþ, , Yahoo! Answers,etc. Also, Heymann et al. [14] analyze folksonomies, and con- The research in this domain has emerged and became very clude that social bookmarking systems can provide search results not currently provided by a conventional search present in the daily (virtual) life of users. Investigating the IR engine (approximately 25% of URLs posted by users are domain from this perspective seems to be very promising to new, unindexed pages). On the other hand, micro-blogging improve the representation, the storage, the organization, and systems like Twitter have been also found useful to users to the access to information. The large number of work in this share or submit specific questions to be answered by domain is certainly a good indicator of the interest its taking. friends, families, colleagues, or even an unknown person In order to provide a better overview and understanding (using a hashtag for a specific topic). of all the work done in the SIR domain, we propose a tax- From these two examples, we understand that different onomy to categorize and classify the proposed methods. models of social networks have been used differently for IR Basically, this taxonomy considers these different contribu- tasks. Thus, through the bibliographical study that we have tions from the way they exploit social information. Each performed, we distinguished several SIR approaches that category in the taxonomy is then discussed and examples can be categorized according to the way social information will be given. This taxonomy is discussed in the next section. is used. Therefore, we propose a taxonomy for grouping

Fig. 1. Examples of two social search engine that allow users to both submit questions to be answered and answer questions asked by other users. (a) Yahoo! Answers. (b) Aardvark.

Social Information Retrieval

Social Social Web Search Social Search Recommendation

Query Results Item Indexing Q&A Tools Reformulation Ranking Recommendation

Query Query Personalized Personalized Social Content User Reduction Expansion (QE) Ranking Indexing Search Recommendation

Non Personalized Non Personalized Collaborative Topis Personalized QE Ranking Indexing Search Recommendation

Non Personalized QE No contribution identified

Fig. 2. A taxonomy for Social Information Retrieval Models. M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 5 these different initiatives and building a common under- while Query Expansion (QE) [18] enhance the query with standing of this domain. Fig. 2 summarizes this taxonomy of additional information likely to occur in relevant documents. SIR models, which is mainly composed of three categories: To the best of our knowledge, there are no contribu- tions in Query Reduction using social information, but all 1. Social Web Search, in which social information is used in the existing work focuses on Query Expansion. In this latter, order to improve the classic IR process, e.g., documents we distinguish two types of approaches: (i) non-persona- re-ranking, query reformulation, and user profiling. We lized QE and (ii) personalized QE. discuss this category of SIR approaches in Section 4. 2. Social Search, in which it is a matter of finding infor- mation only with the assistance of social resources, such 4.1.1. Non-personalized social query expansion as by asking friends, or unknown persons on-line In traditional query expansion methods, the database for assistance [15]. This third category is discussed in used for the expansion is often constructed according to the comparison between terms' distributions in the retrieved Section 5. 3. Social Recommendation, in which the user's Social Net- documents and in the whole document collection/database. work is used to provide better recommendation, e.g., Here, this database of terms is enhanced using social infor- using a social trust network [16]. This category is mation without any personalization. Therefore, the under- lying idea is to leverage the interactions of users with the discussed in Section 6. system to implicitly and collaboratively build a database of Several contributions closely related to the domain of terms. This database is expected to feed the expansion pro- Social Information Retrieval are discussed in this paper. The cess. This yields to a user-based vocabulary source for query expansion. objective is not to discuss the whole set of contributions but Many methods have been proposed in this area. Lioma to point the closest ones as illustrative contributions. In the et al. [19] provide Social QE by considering query expansion following, we discuss and detail these different categories, as a logical inference and by considering the addition of tags while giving some illustrative examples. as an extra deduction to this process. In the same spirit, Jin et al. [20] propose a method in which the used expansion terms are selected from a large amount of social tags in 4. Social web search folksonomy. A tag co-occurrence method for similar terms selection is used to choose good expansion terms from the We consider this category to include techniques that candidate tags directly according to their potential impact on improve the conventional IR process using social informa- the retrieval effectiveness. The work in [21] proposes a uni- tion. In existing IR systems, queries are usually interpreted fied framework to address complex queries on multi-modal and processed using document indexes and/or ontologies, “social” collections. The approach they proposed includes a which are hidden for users. The resulting documents are query expansion strategy that incorporates both textual and not necessarily relevant from an end-user perspective, in social elements. Finally, Lin et al. [22] propose this to enrich spite of the ranking performed by the search engine. thesourceoftermsexpansioninitiallycomposedofrelevant To improve the classic IR process and reduce the amount feedback data with social annotations. In particular, they of irrelevant documents, there are mainly three possible propose a learning term ranking approach based on this improvement tracks: (i) query reformulation, i.e. which source in order to enhance and boost the IR performances. includes expansion or reduction of the query, (ii) post-fil- tering or re-ranking of the retrieved documents (based on 4.1.2. Personalized social query expansion the user profile or context), and (iii) improvement of the IR In query expansion, providing merely a uniform expan- model, i.e. the way documents and queries are represented sion to all users is often not really suitable nor efficient as and matched to quantify their similarities. Here, we con- relevance of documents is relative for each user [23].Thus,a sider the use of social information in these three tracks. simple and uniform query expansion is not enough to pro- vide satisfactory search results for each user. Personalized 4.1. Query reformulation social query expansion refers to the process of expanding the same query differently for each user using social information. In IR systems, users express generally their needs through a set of keywords that summarize their informa- Example 6. Let us consider the query Q¼ “Computer sci- tion needs. Thus, different users are expected to use dif- ence”, the user Bob may have the expanded query ferent keywords to express the same need (e.g., syno- Q′ = “Computer science technology programming java”, nyms), and vice versa (i.e., the same keywords can be used whereas the expanded query q′ = “ Computer science by different users to express different information needs). technology Internet information” may be more suitable for Query reformulation can then bring a solution to this Alice, depending on their topics of interest. problem. It is defined as follows: Several efforts have been made to tackle this problem Definition 5 (Query reformulation). Query reformulation is of personalized query expansion, in particular in the con- the process which consists of transforming an initial query Q text of folksonomies. Hence, authors in [24,25] consider to another query Q′. This transformation may be either a SIR from both the query expansion and results ranking. reduction or an expansion.Query Reduction (QR) [17] reduces Briefly, this strategy consists of adding to the query q, k the query such that superfluous information is removed, possible expansion tags with the largest similarity to the 6 M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 original tags in order to enrich its results. For each query, et al. [32,33] propose S-BIT and FS-BIT, which are extensions the query initiator u, ranks results using BM25 and tag of the well-known HITS approach [6]. Yanbe et al. [34] similarity scores. Bertier et al. [26] propose TagRank algo- proposed SBRank, which indicates how many users book- rithm, an adaptation of the PageRank algorithm, which marked a page, and use the estimation of SBRank as an automatically determines which tags best expand a list of indicator of Web search. tags in a given query. This is achieved by creating and All these algorithms are in the context of folksonomies, maintaining a TagMap matrix, a central abstraction that and a number of them are reviewed and evaluated in [35]. captures the personalized relationships between tags, The work in [36] proposes a method to use microblogging which is constructed by dynamically computing the esti- data stream to compute novel and effective features for mation of a distance between taggers, based on cosine ranking fresh URLs, i.e., “uncrawled” documents likely to be similarity between tags and items. relevant to queries where the user expects documents which Biancalana et al. [27] proposed Nereau, a query expan- are both topically relevant as well as fresh. The proposed sion strategy where the co-occurrence matrix of terms in method consists of a machine-learning based approach, that documents is enhanced with meta-data retrieved from predicts effective rankings for queries–url pair. Recently He social bookmarking services. The system can record and et al. [37] propose a new method to predict popularity of interpret users' behavior, in order to provide personalized items (i.e., webpages) based on users' comments, and to search results, according to their interests in such a way incorporate this popularity into a ranking function. that allows the selection of terms that are candidates of the expansion based on original terms inserted by the user. The 4.2.2. Personalized ranking input queries are analyzed according to collected data, then Several approaches have been proposed to personalize the system makes different query expansions, each one to a ranking of search results using social information [24,38–43]. different semantic field, before carrying out the search. The Almost all these approaches are in the context of folkso- final result is a page in which results are grouped in dif- nomies and follow the common idea that the ranking score ferent blocks, each of them categorized through keywords of a document d retrieved when a user u submits a query Q is to facilitate for the user the choice of the result that is most driven by (i) a term matching, which calculates the similarity coherent with his interests. This method has been improved between Q and the textual content of d to generate a user in [28] using three-dimensional matrices, where the added unrelated ranking score; and (ii) an interest matching, which dimension is represented by semantic classes (i.e., cate- calculates the similarity between u and d to generate a user gories comprising all the terms that share a semantic related ranking score. Then a merge operation is performed property) related to the folksonomy extracted from social fi bookmarking services. to generate a nal ranking score based on the two previous Finally, Bouadjenek et al. [29] propose an approach that ranking scores. A number of these algorithms are reviewed considers ranking terms for expansion purposes. The ranking and evaluated in [44], while considering different social process takes into account: (i) the semantic similarity contexts. between tags composing a query, and (ii) a social proximity between the query and the user for a personalized expansion. 4.3. Indexing and modeling using social information

4.2. Results ranking In the social Web, a social context is often associated to web pages, which can tell a lot about their content. As an In IR, ranking results consist in the definition of a example, this social context includes annotations, com- function that allows quantifying the similarities among ments, like mentions, etc. Consequently, as pointed in [45], documents and queries. We distinguish two categories for social contextual summarization is required to strengthen social results ranking that differ in the way they use social the textual content of web pages. Several research work information. The first category uses social information by [46–49] reported that adding a tag to the content of a adding a social relevance to the ranking process, while the document enhances the search quality as they are good second uses it to personalize search results. summaries for documents [49] (e.g., document expansion [50,14]). In particular, social information can be useful for 4.2.1. Non-personalized ranking (ranking using social documents that contain few terms where a simple indexing relevance) strategy is not expected to provide good retrieval perfor- Social relevance refers to information socially created mances (e.g., the homepage14). that characterizes a document from a point of view of Throughout our analysis of the state-of-the-art, we interest, i.e. its general interest and its popularity. Two noticed that social information has been mainly used in two formal models for folksonomies and ranking algorithm ways for modeling and enhancing document representa- called folkRank [30] and SocialPageRank [31] have been tion: (i) either by adding social meta-data to the content of proposed. Both are an extension of the well-known PageR- documents, e.g., document expansion, or (ii) by persona- ank algorithm adapted for the generation of rankings of lizing the representation of documents, following the entities within folksonomies. SocialPageRank intends to compute the importance of documents according to the 14 http://www.google.com/ mutual enhancement relation among popular resources, There are only a very few terms on the page itself but thousands of up-to-date users and hot social annotations. In the same annotations available on delicious are associated to it. Eventually, social spirit, relying on social bookmarking systems, Takahashi annotations of the Google homepage are more useful for indexing. M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 7 intuition that each user has his own vision of a given entities, by considering the interactions or contributions of document. users.

Thus, social search is associated to platforms that are 4.3.1. Document expansion (non-personalized indexing). defined as search engines specifically dedicated to social Some works investigate the use of social metadata for data management such as Facebook. The main ingredient to enriching the content of documents. In [47,51,48,52], perform a social search is the user interactions, including authors index a document with both its textual content and (i) social content (e.g., comments and tweets) and (ii) social its associated tags modeled as in the Vector Space Model relations (e.g., finding a person with a certain expertise). (VSM) [53]. However, each method uses a different algo- Hence, social search systems index either social content and rithm for weighting social metadata, e.g., tf-idf [47,52], term offer a means for users to search that content [56],orsocial quality [47,51], etc. Also, Zhang et al. [46] propose a fra- relations and allow the user finding persons, who are likely mework to enhance document representation using social fi annotations. The framework consists in representing a Web to respond to speci cneeds[57].Wedividesocialsearch document in a dual-vector representation: (i) enhanced into three main categories: (1) question/answering tools, textual content vector and (ii) enhanced social content (2) content search, and (3) collaborative search. These vector. Each component being calculated from the other. categories are detailed in the following.

4.3.2. Personalized indexing and modeling of documents 5.1. Social question/answering (Q&A) Given a document, each user has his own under- standing of its content. Therefore, each user employs a Despite the development of techniques and methods for different vocabulary and words to describe, comment, and Web search assistance such as navigational queries [58] and annotate this document. For example, if we look at the query auto-completion [59,60], for helping users to express homepage of Youtube, a given user can tag it using “video”, their needs, many queries still remain unanswered. Dror “Web” and “music” while another can tag it using “news”, et al. [61] argue that this is mainly due to two reasons: “movie”, and “media”. (i) the intent behind the query not being well expressed/ Following this observation, Bouadjenek et al. [45] pro- captured and (ii) the absence of relevant content. posed a framework for modeling personalized representa- To tackle these issues, Question/Answering Systems (Q&A) tions of documents based on social annotations. This fra- have emerged to connect people for helping each other mework is based on matrix factorization, where the idea is answering questions. Examples of such systems include to provide a personal representation of a given document Yahoo Answer!,15 WikiAnswers,16 Aardvark,17 Chacha,18 and for a particular user, which is then used for query proces- Ask.com.19 Basically, Q&A systems provide a means for sing. Also, Amer-Yahia et al. [54] investigate efficient top-k answering several types of questions such as recommen- processing in collaborative tagging sites. The idea is that the dation, e.g., building a new play list, any ideas for good score of an answer is computed as its popularity among running songs?, opinion seeking, e.g., I am wandering if I members of a seeker's network. Basically, the solution is to should buy the Kitchen-Aid ice cream maker?,factual create personalized indexes based on clustering strategies, knowledge, e.g., Does anyone know a way to put Excel charts which achieve different compromises between storage into LaTeX? problem solving, e.g., How do I solve this Poisson space and processing time. distribution problem? Morris et al. [15] conducted a survey Finally, in [55] the authors proposed a dual personalized in which they study the type of asked questions, the fre- ranking function, which adopts two profiles: an extended quency of these questions, and the motivations for users user profile and a personalized document profile. Briefly, for asking their social network rather than using a traditional each document, the method computes for each individual search engine. user a personalized document profile to better summarize The main problem facing such systems is in the response his/her perception about this document. The solution pro- time as well as the quality of the answers. For example, posed is to estimate this profile based on the perception Zhang et al. [62] report that on the Java Developer Forum, similarities between users. the average waiting time of a high expertise user to get a reply for a question is about 9 h, compared with 40 min for a low expertise user. As for 's Live Q&A site, Hsieh 5. Social search and Count [63] state that 80% of queries receive an answer, with an average response time of 2 h and 52 min. As for Social platforms like Twitter and Facebook allow users to Aadvark [57], 87.7% of questions received at least 1 answer, share and publish information with their friends and often and 57.2% received their first answer in less than 10 min. with general public. In addition to this, users use them to Dror et al. [61], study a dataset of 4 month period of Yahoo answer very precise and highly contextualized queries, or queries for which the relevant content has not been 15 authored yet, e.g., asking about a conference event using its http://answers.yahoo.com/ 16 http://wiki.answers.com/ hashtag on Twitter. We refer to such a process of finding 17 Aardvark has been acquired by Google on February 11, 2010. In information as Social search,andwedefine it as follows. September 2011, Google announced it would discontinue a number of its products, including Aardvark. fi 18 De nition 7 (Social search). Social search is the process of http://www.chacha.com/ finding information only with the assistance of social 19 http://www.ask.com/ 8 M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18

Answer! interactions, and state that the average time for approach to integrating this information into a standard answering queries is 10 min, while almost all answers were retrieval model. Finally, Vosecky et al. [69] propose a fra- given within 60 min of the question's creation time. How- mework for collaborative personalized Twitter Search, ever, in order to properly rank answers in Q&A systems, an which exploits the user's social connections in order to interesting method has been proposed by Dalip et al. [64]. obtain a comprehensive account of her preferences. This The authors proposed a learning approach for ranking using framework includes a novel user model structure to man- specific features of Q&A domain, which was able to sig- age the topical diversity in Twitter and to enable query nificantly outperform state of the art baselines. disambiguation.

5.2. Social content search 5.3. Social collaborative search

Social platforms allow users to provide, publish and One of the weaknesses of search engines available today spread information, e.g., commenting or tweeting about an (e.g., Google, Yahoo!, Bing) is the fact that they are designed event. In such a context, a huge quantity of information is for a single user who searches alone. Thus, users cannot created in , which represents a valuable source benefit from the experience of each other for a given search of relevant information. Hence, many users use social task. Morris [70,71] conducts a survey on 204 knowledge media to gather recent information about a particular workers in a large technology company in which she event by searching collection of posts, messages and sta- revealed that 97% of respondents reported engaging in one tuses. Therefore, social content search systems come as a of collaborative search task described in the survey. For mean to index content explicitly created by users on social example, 87.7% of respondents reported having watched media and provide a real-time search support [65]. over someone's shoulder for query suggestion, and 86.3% of There are several social content search engines, which respondents reported having e-mailed someone to share index real-time content spreading systems. This includes the results of a Web search. 20 21 22 TwitterSearch, Social Bing, collecta [66], OneRiot [67], etc. In such a context, Morris [72] developed SearchTogether, Social content search systems deal with a different kind of a collaborative search interface, where several users who content than classic search engines. Indeed, posts and sta- share an information need collaborate and work together tutes published on social media are often short, frequent, and with others to fulfill that need. The authors discuss the way do not change after being published, while web pages are SearchTogether facilitates collaboration by satisfying criteria rich, generated more slowly, and evolve after creation [56]. like awareness, division of labor, and persistence. Similarly, Dealing with such content is challenging, because it requires Filho et al. [73] proposed Kolline, a search interface that real-time and recency sensitive queries processing. Sensitive aims at facilitating information seeking for inexperienced query refers to a query where the user expects documents, users by allowing more experienced users to collaborate which are both topically relevant as well as fresh [36,68].A together. study has been performed by Teevan et al. [56] that gives an Paul and Morris [74] investigate sensemaking for colla- overview of “What is the motivation behind a user to use a borative Web search, which is defined as the act of under- social content search system rather than a classic search standing information. The study revealed several themes engine?”. This study reveals that social content search systems regarding the sensemaking challenges of collaborative Web are interrogated with queries, which are shorter, more pop- search, e.g., Awareness, Timeliness and sensemaking hand- ular, and less likely to evolve as part of a session than Web off. Based on their finding, they proposed CoSense, a system queries. The main goal is to find temporally relevant infor- that supports sensemaking for collaborative Web search mation (e.g., breaking news, real-time content, and popular tasks that provides enhanced group awareness by including trends) and information related to people (e.g., content a time-line view of all queries executed during the search directed at the searcher, information about people of inter- process. Even though these features help to enhance par- est, and general sentiment and opinion). ticipants' communication and sensemaking during their Dong et al. [36] propose a method to use microblogging search activities, users still have to sort among different data stream to compute novel and effective features for documents and analyze them one by one to find relevant ranking fresh URLs, i.e., “uncrawled” documents likely to be information. relevant to recency sensitive queries. The proposed method Finally, SearchTeam23 is a concrete example of a colla- consists of a machine-learning based approach, that pre- borative search system currently available. Collaborative dicts effective rankings for queries-url pair. Also, Efron et al. search in SearchTeam is conducted within a SearchSpace. [68] propose to estimate the temporal density of relevant Collaborators search the Web together, save and edit their documents, starting with an initial set of results from a results into the SearchSpace, and pick up next time where baseline retrieval model. A reranking of results is then used. they left off. Results are organized into folders within the Their contributions lie in a method to characterize this SearchSpace. Within these folders, collaborators can com- temporal density function using kernel density estimation, ment on search results, “like” them, post their thoughts, add with and without human relevance judgments, and an search results, post-links, or upload documents. Users can log in using their Facebook, Twitter, Google, LinkedIn, and 20 https://twitter.com/search-home 21 http://www.bing.com/social 22 OneRiot has been acquired by Walmart in September 2011. 23 http://searchteam.com/ M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 9

Yahoo accounts, and can pick collaborators from their preferences of similar users. It is clear that all these services contacts and social networks. have been improved by exploiting their social information. In summary, in this section we discussed the Social In the following, we categorize social recommender Search category of SIR. We showed that the search para- systems according to the type of output they intend to digm has evolved to provide users with tools and methods recommend: (i) items recommendation, (ii) users recom- for asking more sophisticated and contextualized queries. mendation, and (iii) topics recommendation. Social content search systems allow to mine and find rele- vant information within posts and users comments, while 6.1. Items recommendation Q&A systems allow users to answer very contextualized queries, whereas social collaborative search systems come Items recommendation has probably attracted the most as a mean to help users share their experiences and find- attention in recommender systems. Classical items recom- ings. In the next section, we present Social Recommenda- mendation methods are based on the assumption that users tion in which recommendation is done based on the user's are independent entities and identically distributed. Thus, social network. they do not suppose any additional structure, including the social network that surrounds users. This does not reflect the real behaviors of users, since they normally ask friends 6. Social recommendation for recommendations before acting, e.g., buying a product. Many researchers have then started exploring social rela- The second category of SIR models considers filtering tions to improve recommender systems (including implicit and recommendation domains (e.g., content-based filtering, social information, which can be employed to improve collaborative filtering, recommender systems). Basically, traditional recommendation methods [75]), essentially to recommendation aims at predicting the interest that users tackle the cold-start problem [76,77,16,78]. However, as would give to an item/entity they had not yet considered pointed in [79], only a small subset of user interactions and explicitly. There are two main methods of recommenda- activities are actually useful for social recommendation. fi tion: (i) an approach based on recommending items that In collaborative ltering based approaches, Liu and Lee are similar to those in which the user has shown interest in [80] proposed very simple heuristics to increase recom- the past, which is known as “content-based” approach (CB), mendation effectiveness by combining social networks and (ii) an approach that intends to recommend items to information. Guy et al. [81] proposed a ranking function of items based on social relationships. This ranking function the user based on other individuals who are found to have has been further improved in [82] to include social content similar preferences or tastes, which is known as “colla- such as related terms to the user. borative filtering” approach (CF). We define social recom- AnothercommonapproachtoCFattemptstofactorizean mendation as follows: (incomplete) matrix R of dimension N M containing

Definition 8 (Social Recommendation). Social recommen- observed ratings ri,j (which represent the rate of the user i to T dation is a set of techniques that attempt to suggest: (i) items the item j) into a product RU≈ V of latent feature matrices (e.g., movies, music, books, news, web pages), (ii) social U and V. In this context, following the intuition that a per- entities (e.g., people, events, groups), or (iii) topics of interest son's social network will affect his behaviors on the Web, Ma (e.g., sport, culture, and cooking) that are likely to be of et al. [16] propose to factorize both the users' social network interest to the user through the use of social information. and rating records matrices. The main idea is to fuse the user-item matrix with the users' social trust networks by On one hand, in recent years, many personalized sharing a common latent low-dimensional user feature recommendation features based on the user's social net- matrix. This approach has been improved in [83] by taking work have been developed and integrated to popular web into account only trusted friends for recommendation while sites. This is mainly done by prompting the user to connect sharing the user latent dimensional matrix. Almost a similar his social networks' accounts to their services; the ultimate approach has been proposed in [84,85] who include in the goal being collecting as much data as possible that char- factorization process, trust propagation and trust propaga- acterize the user. For example, videos recommendation in tion with inferred circles of friends in social networks YouTube based on the Googleþprofile, or movies recom- respectively. In this same context, other approaches have mendation on IMDb through the registration using a Face- proposed to consider social regularization terms while fac- book account. This functionality has been reported to lead torizing the rating matrix. The idea is to handle friends with to an improvement of the recommendation services. dissimilar tastes differently in order to represent the taste On the other hand, many social platforms understood diversity of each user's friends [86–88]. A number of these the power of learning over their data for building recom- methods are reviewed, analyzed and compared in [89]. mender services. For example, targeted advertising in some social platforms like Facebook, group recommendation 6.2. Users recommendation again on Facebook, follower recommendation on Twitter,or topics and web pages recommendation on delicious.More- As mentioned above, social network platforms have over, there are other social web services, whose recom- adopted the strategy of suggesting friends (or group of mendation is at the heart of their business. For example, friends) to increase the connectivity among their users. social news aggregation services, like Digg, which presents Hence, content-based approaches were proposed in [90] to stories expected to be most interesting to a user, based on match the content of user profiles and determine user 10

Table 1 Summary of the considered dimension. ( ) means the dimension (i.e. functionality) is provided, (—) means the dimension is not provided. These marks do not imply any “positive” or “negative” information about the tools except for the presence or the absence of the considered dimension. The mark (?) means we can say nothing about the dimension because of lack of information.

Social Information Retrieval

Social Web Search Social Search Social Recommendation

QE Ranking Indexing Q&A Content Collaborative Item User Topic ..Budee ta./IfrainSses5 21)1 (2016) 56 Systems Information / al. et Bouadjenek M.R. [22] [29] [39] [31] [43] [46] [47] [57] ChaCha [66] TwitterSearch [72] [74] [16] [87] [90] [103] [99] [92]

1 Social Symmetric –––––– – ? –––– –– a CS ––––– – ? – – –––– –– Networks Asymmetric Relationships TR –––––– – ? –– – – –– –– Ternary – ? –– – – –––– Their own SN –––––– – ? – –––– ––

2 Social Content ? –––– Network Data Structure – –– – ? – ? Metadata –––––– – – – –––– –– c –– 3 Datasource Social Net ? Web –––– – ? – – – –––– –– Provided by users –––––– – –– –– –––

4 Personalization Profile – – –– ? – ? – – –––– –– Content-based –––––– ––? – ? –– –– Colla-filtering –––––– ––? – ? –– –

– – – –

5 Complexity & adaptability Scalability 18 Dynamicity – – –– –– ––– Data sparsity – ––– – Cold start –––– ––

6 Evaluation Formal – ? ? –– –– ––– – ? – ? – ––– End-user –– 7 Socialization – –––– – – 8 Privacy –– –– – – –––– –– 9 Industrialization –– ––– – –––– ––

a CS¼Content Subscription; TR¼Trust relations. M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 11 similarities for recommendation. Groh et al. [91] generated estimate the bookmarks quality. In the same spirit, the fra- user neighborhood information from known social net- mework proposed by Symeonidid et al. [92] acts in the same work structures and demonstrated that collaborative fil- way. This framework models the three types of entities by a tering based on such neighborhood outperforms classic 3-order tensor, on which multiway fi collaborative ltering methods. Symeonidid et al. [92] and dimensionality reduction is performed using both the proposed a ternary semantic analysis unified framework to Singular Value Decomposition method and the Kernel-SVD perform users recommendation. Guy et al. [90] describe an smoothing technique. Also, Wei et al. [103] propose to user interface for providing users with recommendations leverage a quaternary relationship among users, items, tags of people to invite into their social network. The approach is based on aggregated information collected from various and ratings to provide recommendation. They propose a fi sources. Hannon et al. [93] utilize content and collabora- uni ed framework for user recommendation, item recom- tive-based approaches to evaluate a range of different user mendation, tag recommendation and item rating prediction profiling and recommendation strategies. Many of these by modeling the quaternary relationship among users, strategies have been implemented in [94]. Finally, in the resources, tags, and ratings as a 4-order tensor and cast the context of tagging systems, Wang et al. [95] propose to recommendation problem as a multi-way latent semantic connect users with similar tastes by measuring their analysis problem. similarities based on the tags they share in an inferred It is clear that social networks represent a valuable network of tags. source of information to improve and develop effective and efficient recommendation algorithms. In the next 6.3. Topics and tags recommendation section, we give an overview of many platforms, methods and approaches of SIR as well as their potential drawbacks. Recently, topic and tag recommendation has attracted The main objective is to understand the impact of the significant attention to produce high quality hot-lists to users. Recommendation of tags also allows users to choose social dimension on the IR process, the weaknesses and the right tags as tagging is not constrained by a controlled the possible future contributions in this domain. vocabulary and annotation guidelines. Hence, the work in [96] provides a comprehensive evaluation and comparison of several state-of-the-art tag recommendation algorithms in three different real world datasets. A content-based 7. Analysis of SIR methods and platforms collaborative filtering technique has been proposed in [97] to automate tags assignments to blogs. Hotho et al. [30] The objective in this section is to analyze the richness propose to project the three-dimensional correlations to and weaknesses of some SIR tools and approaches. Among three 2D correlations. Then, the two-dimensional correla- the discussed SIR tools and methods in the previous sec- tions are used to build conceptual structures similar to tions, only few of them have spawned to a concrete com- hyperlink structures that are used by Web search engines. mercial prototype. The selected tools for analysis are pro- The work in [98,92] has shown to generate high quality vided in Table 1, and their description was provided in the tags recommendations that outperform baseline methods previous sections. We have selected these approaches and fi such as the most-popular models and collaborative ltering tools following these main criteria: (i) they are the most [96].Also,Kresteletal.[99] proposed an approach to use popular ones when this analysis is performed (judging from Latent Dirichlet Allocation to expand tag sets of objects our research), (ii) they are the most referenced ones and are annotated by only a few users. The work in [100] proposed published in top venue conferences,24 and finally, (iii) this to learn tag relevance by taking into account three kinds of limit is motivated by the fact that our objective is not to correlations: tag co-occurrence, tag visual correlation, and image conditioned tag correlation. Specifically, they adop- analyze all the existing approaches and tools, but the most ted the Rankboost [101] to learn an optimal combination of representative ones as illustration of the underlying prin- these multi-modality correlations, and generated a ranking ciples and ideas. function for tag recommendation. Recently, Zhu et al. [102] For this analysis, we choose the dimensions listed in proposed a method of tag recommendation based on the Table 1 covering various aspects of the social information neighbor voting graph of tags. They casted the social tag used and the problem to be solved. We consider these relevance learning problem as an adaptive teleportation dimension as crucial for designing an effective and efficient random walk process on the voting graph. SIR approach. For illustration purposes, we only discuss Finally, several approaches have been proposed making some of the above tools for each dimension. Table 1 sum- these three types of recommendations, i.e. social items, marizes the analysis dimensions and the classification of users, and topics recommendations, unified under one fra- the different tools according to these dimensions. In the mework. Carmel et al. [47] propose a framework for social following sections, we discuss these dimensions, their bookmarks weighting, which allows estimating the effec- meaning, their importance, and the extent to which they tiveness of each of the bookmarks individually for several IR tasks. To do this, they propose several recommendation are considered by the tools. strategies such as tag recommendation, user recommenda- tion, and document recommendation. The obtained values 24 A highly selective conference like SIGIR, WWW, CIKM, VLDB, from the three strategies are merged in order to effectively WSDM, etc. 12 M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18

7.1. Social networks between entities of their social graphs. These relation- ships can be within entities of the same type, e.g., This dimension is related to the kind of social networks friendship relations between users and similarity leveraged by a SIR approach. Each SIR approach relies on between resources, or between entities of different almost one kind of social network either by using its social types, e.g., an authorship relation between a user and a content or by exploiting its social relations. Depending on document, a relation of description between a term and the considered approach, its application and its purpose, it a user, etc. Social relations are useful for building a SIR can use approach, and have been used in different ways, e.g., for recommendation [16] by handling trust relations, for Social networks with symmetric relationships in which extracting correlated terms by leveraging their relations users explicitly declare their social relations of friends. over document [22], etc. Users can also express their opinions, comment news, Social metadata refer to information embedded inside and share resources. This represents a valuable source of social platforms and data such as geo-location informa- information for user modeling and profiling [57,66], tion and time-stamps. Metadata can be easily exploited which can be reused to build many interesting services, for a social search task (e.g., forwarding a geographically e.g., services of recommendation and personalization. contextualized query to the right person [57]) or a social Social networks with asymmetric relationships, which can recommendation purpose (e.g., recommending the right be divided into two categories: (i) content subscription event at the right place [104]). social networks, e.g., microblogging systems, in which most of the considered SIR approaches relies on using the tem- 7.3. Data sources poral aspect of these networks, e.g., answering recency sensitive queries [36] and a (ii) trust-based Social Network, The data sources dimension refers to the source of e.g., epinion,25 in which SIR approaches use the trust degree information used by SIR approaches. These latter are most between users, e.g., for collaborative filtering [16]. of the time not only based on social data sources, but also Social networks with ternary relations, e.g., bookmarking on other sources of information like: systems. As discussed in Section 2.1, the generated structures in these systems have been proven to be a The content of web pages, which contains valuable valuable knowledge for building many SIR approaches. information that can be extracted by performing classic Exploiting annotations and social metadata explicitly textual treatments [22,47,46]. provided by users has been used in different way, for Ontologies, e.g., FOAF ontology (Friend Of A Friend), which can be used to infer a social network [105]. example to extract correlated terms [22], build users' profiles [39], compute social relevance [31],andenhance Information explicitly provided by users. Some SIR documents [46]. approaches ask users to explicitly provide information Their own social networks built upon information gath- to enrich their internal data model [57]. ered from (i) the user, by explicitly providing information about him, (ii) the aggregation and crawling of both social Note that basically, social information is only used to networks and the Web, e.g., building a FOAF (Friend Of A improve and enhance information retrieval tasks. Hence, Friend) ontology, which is a machine-readable structure we believe that the use of trusted sources of information that describes people, or (iii) by an inference process (e.g., information provided by authors of web pages) is based on the user behavior, e.g., Aardvark [57]. necessary to the proper functioning of SIR approaches (actually, almost all SIR approaches combine such data 7.2. Social data sources [47,46]). A trade-off should be found to properly weight each data source, while avoiding overfitting. This dimension concerns the kind of social information 7.4. Personalization leveraged by SIR approaches. We distinguish the following three kinds of social information, which are embedded As discussed in the previous sections, some SIR approa- within a social network: ches are based on personalization, e.g. [29,45,38,39,57,26,43]. Personalization allows differentiating between individuals by Social content, which is the content generated by users emphasizing on their specific domains of interest and their through their interactions and activities, i.e. activities of preferences. It is a key point in IR and its demand is con- publishing, annotating, commenting or rating content stantlyincreasingbynumeroususersforadaptingtheir and entities. This social content is most of the time useful results [106]. Several techniques exist to provide persona- to: enrich entities and thus enhance their logical repre- lized services among which: (i) the user profiling, (ii) content sentation [46,47,36], extract correlated terms, e.g., based approaches, and (iii) collaborative based approaches. through co-occurrence of terms [22], and pull out user profiles [39,57]. 7.4.1. Profile based methods Social relations explicitly or implicitly declared by users. A user profile is a collection of data describing a specific Indeed, social networks exhibit various relationships, user, which are explicitly or implicitly provided by him. Therefore, a profile refers to the digital representation of a 25 http://www.epinions.com/ person's identity, which includes mainly the description of M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 13 his characteristics and his domains of interest and exper- 7.5.2. Dynamicity tise. We distinguish two types of profiles that differ in the Dynamicity refers to the ability of a method to consider way they are constructed: new data and to quickly update its model. Considering new data is a key problem for SIR methods since they are based Profile constructed offline: Some SIR approaches con- on social information, which is growing quickly with the struct profiles offline and maintain them incrementally, intense activity of users who are constantly in the process which make them more efficient regarding the execu- of commenting, editing, publishing and sharing information. tion time. However, profiles computed offline decreases Some SIR methods have a model, which can be easily the dynamics of the approach since new data are not updated [29,39,57], whereas other methods not, e.g., instantly taken into account. methods based on machine learning techniques, which are Profile constructed on the fly (online): In contrast, profiles among the most difficult to update since we have to rebuild computed online increase the dynamics of a SIR the model each time in order to consider new data approach and its efficiency, while degrading the [16,83,86,22]. execution time. 7.5.3. Data sparsity Examples of work focusing their strategies on user profil- Data sparsity is a term used to designate how much ing include [39,43,29,38,45]. The main issue with the data we have for the dimensions of a dataset. The term of usage of profiles is related to the ability of updating the sparse data is most of the time associated with matrices, user's preferences quickly, especially because users inter- where a sparse matrix is a matrix populated primarily act a lot on social media. Thus, this problem has to be with zeros [107]. Therefore, this dimension refers to the considered in order to make use of dynamic profiles. ability of an approach to process over sparse data. Indeed, sparse data leads to the problem of poor results quality 7.4.2. Content-based methods since there is no enough information to process. However, Personalization in content based approaches aims at some approaches handle effectively this problem by con- providing to users, information similar to that they pre- sidering other sources of data, e.g., content of web pages or viously consumed. This category of methods best suits for other social networks [16,29,57,22]. recommender systems where items and users' preferences are usually described with keywords. The algorithms try 7.5.4. Cold-start problem then to recommend items that are similar to those that a The cold start is a potential problem of systems to handle user liked in the past. In particular, various candidate items effectively new entities [78], e.g., users and items. In other are compared with items previously rated by the user and words, it concerns the issue that the system cannot draw any the best-matching items are recommended [92,81]. inference for users or items about which it has not yet gathered sufficient information. Recommender systems are 7.4.3. Collaborative-filtering based methods among the most affected systems by the cold-start problem, Personalization based on collaborative filtering process since they need a lot of information to make predictions. aims to provide to users, information consumed by many However, many approaches deal efficiently with this pro- similar users. This process of collaborative filtering can be blem by relying on other data sources [39,16,83,86]. summarized to two steps: (1) look for users who share the same interests and behavior with a given user, i.e. the user 7.6. Evaluation who is currently using the SIR approach, and (2) provide to the considered user, contents and information based on Evaluation is a critical part of any research work. This the users found in step (1). Examples of representative dimension intends to track the evaluation methodology efforts in this category include [16,87,52]. followed by the proposed methods. We distinguish two main categories of approaches for evaluation: (i) formal 7.5. Complexity and applicability evaluation, where the evaluation is performed off-line using a public or a specifically built dataset, and (ii) end- The complexity of a method tends to characterize it user evaluation, where we generally ask users to judge and from several perspectives of its applicability. We consider provide feedback on a method. In general, almost all the four dimensions to describe and characterize the com- methods we studied involve a formal evaluation, where plexity and the applicability of a method. different aspects are studied with a comparison to the state-of-the-art. Few of them, such as [39,14,72] consider 7.5.1. Scalability end-user evaluation. Scalability is the ability of a method to continue to work well when its context is changed. This refers to its ability to 7.7. Socialization scale to a very large dataset (number of users, resources, objects, documents, etc.) while continuing to meet the Some SIR approaches put users in contact, and encourage users' needs (both precision and recall) and to take full them to socialize and collaborate in order to satisfy a par- advantage of it in terms of performance, e.g., execution time ticular goal of information needs. This dimension refers to and relevance of information. Some SIR approaches are able this particularity of socialization between users. Thus, many to scale to very large datasets [16,86,87], while other not approaches put users in contact in order to benefitfromthe [47] because of their algorithms' complexity. experience of each other. Especially, many social 14 M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 recommender systems offer this feature, since they provide 8.1. Approaches' category perspective explanations of why a recommendation has been done, and what yielded to the recommendation [81,90,82]. Then, if the In the social Web Search, especially in the social query recommendation is based on the experience of other users, reformulation part, we reported that there is no approach e.g., rating of other users, the current user can contact them that considers query reduction. Query reduction is a to obtain other information and precision. Also, as stated technique to reduce long queries, to queries, which are before, social search approaches are exclusively based on the shorter and more effective [17]. Hence, investigating a assistance of social entities. This will automatically put users social query reformulation approach by considering both in contact as it is the case in Q&A systems (Aadvark [57]), query expansion and reduction, seems to be promising and and social collaborative search systems (SearchTogether [72], provide good research perspectives. This can be by pro- Kolline [73],andsensemaking [74]). viding users with a way to suggest them other queries that can better match their requirements. 7.8. Privacy management Example 9. The query “interesting tickets for four people to warm weather countries for vacation” can be rewritten Social information is sensitive to the privacy of users. as “cheap family vacation ticket” or “luxury family vaca- Some approaches do not consider the privacy of users, as tion plan”. The new proposed query can be more precise/ they spread sensitive information about users. Even if concise and is eventually expected to provide persona- users make their social accounts public, we believe that lized, similar or even better results than the long one. reusing these data in other value-added services is still a user privacy problem. Hence, probably social recommen- Regarding the IR modeling part (documents repre- der systems and social search systems are the most related sentation and ranking functions), we believe that the and exposed to users' privacy problems. Examples where temporal dimension is a key aspect, which has not been we need to divulge information about users include deeply investigated. This includes considering the evolu- tion of users' behavior, and profiles in time. Indeed, users 1. In recommender systems, we need to justify why a are expected to evolve in time, which make their interests/ recommendation has been done. tastes different. IR systems should constantly learn infor- 2. In Q&A system, we need to justify why we put two users mation about users in order to adapt their results. in contact. As for the social recommendation track, this topic has 3. In social content search systems, we show who said been deeply investigated, whether for items, tags, or users what on what. recommendation. Items recommendation is probably the 4. In social collaborative search systems, we show who part that attracted the most attention. However, we also searched what. believe that the temporal dimension of recommender sys- tems can help to tackle the problems of (i) hot topics, i.e. Finally, many approaches are using social information in news, fresh information, (ii) the evolution of user profiles fi back-of ce, and thus, do not spread sensitive information along time, i.e. the user interests evolve and change along about users. This includes mainly social Web search time, and (iii) the diversity of information, i.e. in order not to approaches [22,39,47,46,43]. annoy users with similar information. Indeed, for the first problem, information is time-dependent, meaning that it 7.9. Industrialization attracts much attention at a given moment and will be quickly forgotten after a while. Many users of social media Developing effective and efficient algorithms is good, but state that the freshness of information is a key point in a putting them in action is better. Hence, this final dimension recommender system. The second problem deals with the is related to an aspect of production and commissioning. evolution and the update of user profiles. For example, the Many SIR approaches studied in this paper have spanned to a opinion of a user concerning something may change in time concrete and commercial prototype, e.g., Aadvark [57], whenhegrows,orreadsnewsaboutthistopic.Thethird SearchTogether [72],orcollecta[66], while other approaches problem is also a feeling that users have when they use are still as research contributions. This is certainly a big social media (Facebook in particular). Most of the time, when limitation of most of the contributions, i.e., their inability to a recent information appears, all users begin to publish span to an industrial level even with the interest the articles that deal with the same information, and users are approaches may have in this domain. quickly overwhelmed by similar information published mutually by each other. We believe that, at a given time, the recommender system should know that a given user is 8. Discussion and future directions already aware about this information and consequently it should be hidden [87]. In this section, we discuss some thoughts and future Finally, with the advent of social networks, many tools directions for research related to SIR. We consider two and approaches have emerged to deal with the weakness of perspectives: (i) category perspective, where the discussed classic search engines. This includes social content search aspects are related the categories of the SIR approaches, engines (for opinion seeking), query recency search (Q&A and (ii) dimension perspective, where the discussion is tools for contextualized search), and collaborative search directed by considering the analysis dimensions. engine (for collectively answering a common information M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 15 need). In these areas, many improvement still possible like be also investigated from the algorithmic perspective as the finding the right persons for answering a given question in considered data and structures are often complex. Q&A tools or how to find users who share common infor- There is also a room for contribution from the evalua- mation needs and relate them to work together to fulfill this tion perspective of SIR systems and approaches. In fact, need. although heavy evaluation is operated using different data sets and comparisons to other approaches, more evalua- 8.2. Dimension perspective tions and protocols have to be operated and set-up to involve real users in this process. As mentioned in the From the data perspective, most of the existing previous sections, some contributions started involving approaches use either content or structure of social net- users but not in heavy manner. Beyond involving users, works. Although few techniques started leveraging both working on elaborated protocols to involve them may help parts, there is still a room of improvement to encourage a lot in this process. Last but not least, and in order to this usage, since it has been shown that it may help [82]. ensure a certain reproduction of the evaluation results, Furthermore, it is interesting to note that social meta-data there should be an effort in sharing the code of the have been almost not used. Considering this type of data approaches in the community to ensure transparency, fi may bring additional capabilities to a method, e.g., con- ef ciency, and objectivity in this domain. textual information retrieval. The same observation can also apply on the data sources dimension, where approa- ches tend to leverage either one source or another. Since 9. Conclusion the social Web can be seen as a complementary level of the traditional Web, combining the different sources can In this paper, we proposed a deep review of the topic of provide more capabilities for any approach. Social Information Retrieval (SIR). Especially, we proposed On the other hand, exploiting personal information a taxonomy to classify and categorize SIR approaches into resulting from social interactions should naturally benefitto three main categories, namely: (i) social Web search, (ii) the user. This is achieved through personalization, which is social search, and (iii) social recommendation. We showed added to the SIR process. More efforts have to be put towards that these three sub-categories are fundamentally differ- reinforcing personalization in SIR. This is also justified by the ent in the way they leverage and use social information. huge amount of information available to the user, which need Many methods have been proposed to improve the classic to be handled on behalf of him, e.g., to display the most IR process, while others proposed new search paradigms interesting piece of information at the right moment and at based on the socialization between users. Social platforms represent valuable sources of infor- the right place. Moreover, it is pretty obvious to consider the mation and knowledge that can be reused to improve capability of any approach to provide a socialization func- many services (especially search services). On one hand, tionality as bootstrapping mechanism. We observed that many initiatives like OpenID, and the Mashup concept are pro- methods use social information but do not generate or favor moting the possibility to social platforms to share their the generation of social information in their turn. This can data with other applications, e.g., the full user name, the certainly be benefittothem,e.g.,byhavingabetterchar- profile's picture, the gender, the username and user id acterization and contextualization of the user as done in [57]. (account number), and even the list of friends and con- Social networks contain sensitive data directly related tacts. On the other hand, users are encouraged to share to the user. Data privacy is a big challenge that has to be and publish content on social platforms, and make their considered when designing a new SIR method. To handle social information publicly accessible. Therefore, this social this issue, the current techniques and methods leverage dimension of the Web attracted the attention of many the “public” part of the social data, i.e., the part of the data researchers in the IR community. that is open to public and for which the user gave an We discussed many SIR methods with respect to some explicit consent to share it publicly. However, we believe dimensions that we judge to be essential in order to assess that there is still huge part of the data, which can be the robustness and the effectiveness of a SIR approach. We leveraged to further improve the IR process. Investigations consider these dimensions, as factors that indicate the have to be made in this area following innovative techni- extent to which a SIR approach can be applicable, and in ques such as information granularity (similar to what is what context, i.e. size of the data and sparsity of the data. practiced in medical informatics [108]). Finally, we discussed some interesting improvement for Dealing with social data implies being confronted to some SIR categories for a research perspective. their large size, their diversity, and their dynamics. This dimension has to be considered as a priority when designing a new SIR approach, as the social data is inher- References ently large and complex. Some existing techniques are considering a part of it, e.g., tackling the size but not con- [1] S. Amer-Yahia, J. Huang, C. Yu, Building community-centric infor- sidering the dynamics. Thus, there is a need for more mation exploration applications on social content sites, in: Pro- investigation to solve scalability issues related to SIR, ceedings of the 2009 ACM SIGMOD International Conference on especially if any method targets a production objective. Management of Data, SIGMOD '09, ACM, New York, NY, USA, 2009, pp. 947–952. These issues can be handled with existing technologies [2] S. Amer-Yahia, L.V.S. Lakshmanan, C. Yu, Socialscope: enabling such as Hadoop and its associated technologies, but has to information discovery on social content sites, in: CIDR, 2009. 16 M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18

[3] R.A. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, [28] C. Biancalana, F. Gasparetti, A. Micarelli, G. Sansonetti, Social 2nd edition, . Addison-Wesley Longman Publishing Co., Inc, Boston, semantic query expansion, ACM Trans. Intell. Syst. Technol. 4 (4) MA, USA, 2010. (2013) 60:1–60:43. [4] S.E. Robertson, K. Sparck Jones, Systems, Taylor [29] M.R. Bouadjenek, H. Hacid, M. Bouzeghoub, J. Daigremont, Perso- Graham Publishing, London, UK, 1988 Ch. Relevance weighting of nalized social query expansion using social bookmarking systems, search terms, pp. 143–160. in: Proceedings of the 34th International ACM SIGIR Conference on [5] S. Brin, L. Page, The anatomy of a large-scale hypertextual web Research and Development in Information Retrieval, SIGIR '11, search engine, Comput. Netw. ISDN Syst. 30 (1–7) (1998) 107–117. ACM, New York, NY, USA, 2011, pp. 1113–1114. [6] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, [30] A. Hotho, R. Jäschke, C. Schmitz, G. Stumme, Information retrieval J. ACM 46 (5) (1999) 604–632. in folksonomies: search and ranking, in: Y. Sure, J. Domingue (Eds.), [7] G. Salton, Automatic Information Organization and Retrieval, The Semantic Web: Research and Applications, 2006. McGraw Hill Text, 1968. [31] S. Bao, G. Xue, X. Wu, Y. Yu, B. Fei, Z. Su, Optimizing web search [8] A. Maaradji, H. Hacid, R. Skraba, A. Vakali, Social web mashups full using social annotations, in: Proceedings of the 16th International completion via frequent sequence mining, in: SERVICES, 2011, pp. Conference on World Wide Web, WWW '07, ACM, New York, NY, 9–16. USA, 2007, pp. 501–510. [9] W. Stanley, F. Katherine, Social Network Analysis: Methods and [32] T. Takahashi, H. Kitagawa, S-bits: social-bookmarking induced topic Applications (Structural Analysis in the Social Sciences), 1st edition, search, in: Proceedings of the 2008 the Ninth International Con- Cambridge University Press, New York, 1994. ference on Web-Age Information Management, WAIM '08, IEEE [10] T. Hammond, T. Hannay, B. Lund, J. Scott, Social bookmarking tools: Computer Society, Washington, DC, USA, 2008, pp. 25–30. a general review, D-Lib Mag. 11 (4),. (2005). [33] T. Takahashi, H. Kitagawa, A ranking method for web search using [11] B. Lund, T. Hammond, T. Hannay, M. Flack, Social bookmarking tools social bookmarks, in: Proceedings of the 14th International Con- (ii): a case study—connotea, D-Lib Mag. 11 (4),. (2005). ference on Database Systems for Advanced Applications, DASFAA [12] D. Goh, S. Foo, Social Information Retrieval Systems: Emerging '09, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 585–589. Technologies and Applications for Searching the Web Effectively, [34] Y. Yanbe, A. Jatowt, S. Nakamura, K. Tanaka, Towards improving Information Science Reference—Imprint of: IGI Publishing, Hershey, web search by utilizing social bookmarks, in: Proceedings of the PA, 2007. 7th International Conference on Web Engineering, ICWE'07, [13] M. Gupta, R. Li, Z. Yin, J. Han, Survey on social tagging techniques, Springer-Verlag, Berlin, Heidelberg, 2007, pp. 343–357. SIGKDD Explor. Newsl. 12 (1) (2010) 58–72. [35] F. Abel, N. Henze, D. Krause, Ranking in folksonomy systems: can [14] P. Heymann, G. Koutrika, H. Garcia-Molina, Can social bookmarking context help? in: Proceedings of the 17th ACM Conference on improve web search? in: Proceedings of the 2008 International Information and Knowledge Management, CIKM '08, ACM, New Conference on Web Search and Data Mining, WSDM '08, ACM, New York, NY, USA, 2008, pp. 1429–1430. York, NY, USA, 2008, pp. 195–206. [36] A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang, Z. Zheng, H. Zha, [15] M.R. Morris, J. Teevan, K. Panovich, What do people ask their social Time is of the essence: improving recency ranking using twitter networks, and why? A survey study of status message Q&A beha- data, in: Proceedings of the 19th International Conference on vior, in: Proceedings of the SIGCHI Conference on Human Factors in World Wide Web, WWW '10, ACM, New York, NY, USA, 2010, pp. Computing Systems, CHI '10, ACM, New York, NY, USA, 2010, pp. 331–340. 1739–1748. [37] X. He, M. Gao, M.-Y. Kan, Y. Liu, K. Sugiyama, Predicting the [16] H. Ma, H. Yang, M.R. Lyu, I. King, Sorec: social recommendation popularity of web 2.0 items based on user comments, in: Pro- using probabilistic matrix factorization, in: Proceedings of the 17th ceedings of the 37th International ACM SIGIR Conference on ACM Conference on Information and Knowledge Management, Research & Development in Information Retrieval, SIGIR '14, ACM, CIKM '08, ACM, New York, NY, USA, 2008, pp. 931–940. New York, NY, USA, 2014, pp. 233–242. [17] G. Kumaran, V.R. Carvalho, Reducing long queries using query [38] M.R. Bouadjenek, H. Hacid, M. Bouzeghoub, SoPRa: a new social quality predictors, in: Proceedings of the 32nd International ACM personalized ranking function for improving web search, in: Pro- SIGIR Conference on Research and Development in Information ceedings of the 36th International ACM SIGIR Conference on Retrieval, SIGIR '09, ACM, New York, NY, USA, 2009, pp. 564–571. Research and Development in Information Retrieval, SIGIR '13, [18] Efthimis N. Efthimiadis, Interactive query expansion: a user-based ACM, New York, NY, USA, 2013, pp. 861–864. evaluation in a relevance feedback environment, J. Am. Soc. Inf. Sci. [39] D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman, N. Har'el, I. Ronen, 51, 11 (2000) 989–1003, http://dx.doi.org/10.1002/1097-4571. E. Uziel, S. Yogev, S. Chernov, Personalized social search based on [19] C. Lioma, R. Blanco, M.-F. Moens, A logical inference approach to the user's social network, in: Proceedings of the 18th ACM Con- query expansion with social tags, in: ICTIR, 2009. ference on Information and Knowledge Management, CIKM '09, [20] S. Jin, H. Lin, S. Su, Query expansion based on folksonomy tag co- ACM, New York, NY, USA, 2009, pp. 1227–1236. occurrence analysis, in: 2009 IEEE International Conference on [40] M.G. Noll, C. Meinel, Web search personalization via social book- Granular Computing, 2009. marking and tagging, in: ISWC'07/ASWC'07, 2007. [21] A. Mantrach, J.-M. Renders, A general framework for people [41] D. Vallet, I. Cantador, J.M. Jose, Personalizing web search with retrieval in social media with multiple roles, in: R. Baeza-Yates, A. folksonomy-based user and document profiles, in: Proceedings of de Vries, H. Zaragoza, B. Cambazoglu, V. Murdock, R. Lempel, F. the 32nd European Conference on Advances in Information Silvestri (Eds.), Advances in Information Retrieval, Lecture Notes in Retrieval, ECIR'2010, Springer-Verlag, Berlin, Heidelberg, 2010, pp. Computer Science, vol. 7224, Springer, Berlin, Heidelberg, 2012, pp. 420–431. 512–516. [42] Q. Wang, H. Jin, Exploring online social activities for adaptive [22] Y. Lin, H. Lin, S. Jin, Z. Ye, Social annotation in query expansion: a search personalization, in: Proceedings of the 19th ACM Interna- machine learning approach, in: Proceedings of the 34th Interna- tional Conference on Information and Knowledge Management, tional ACM SIGIR Conference on Research and Development in CIKM '10, ACM, New York, NY, USA, 2010, pp. 999–1008. Information Retrieval, SIGIR '11, ACM, New York, NY, USA, 2011, pp. [43] S. Xu, S. Bao, B. Fei, Z. Su, Y. Yu, Exploring folksonomy for perso- 405–414. nalized search, in: Proceedings of the 31st Annual International [23] J. Pitkow, H. Schütze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, ACM SIGIR Conference on Research and Development in Informa- E. Adar, T. Breuel, Personalized search, Commun. ACM 45 (9) (2002) tion Retrieval, SIGIR '08, ACM, New York, NY, USA, 2008, pp. 155– 50–55. 162. [24] M. Bender, T. Crecelius, M. Kacimi, S. Michel, T. Neumann, J.X. [44] M.R. Bouadjenek, A. Bennamane, H. Hacid, M. Bouzeghoub, Eva- Parreira, R. Schenkel, G. Weikum, Exploiting social relations for luation of personalized social ranking functions of information query expansion and result ranking, in: ICDE Workshops, 2008. retrieval, in: F. Daniel, P. Dolog, Q. Li (Eds.), Web Engineering, [25] R. Schenkel, T. Crecelius, M. Kacimi, S. Michel, T. Neumann, J.X. Lecture Notes in Computer Science, vol. 7977, Springer, Berlin, Parreira, G. Weikum, Efficient top-k querying over social-tagging Heidelberg, 2013, pp. 283–290. networks, in: Proceedings of the 31st Annual International ACM [45] M.R. Bouadjenek, H. Hacid, M. Bouzeghoub, A. Vakali, Using social SIGIR Conference on Research and Development in Information annotations to enhance document representation for personalized Retrieval, SIGIR '08, ACM, New York, NY, USA, 2008, pp. 523–530. search, in: Proceedings of the 36th International ACM SIGIR Con- [26] M. Bertier, R. Guerraoui, V. Leroy, A.-M. Kermarrec, Toward perso- ference on Research and Development in Information Retrieval, nalized query expansion, in: SNS, 2009. SIGIR '13, ACM, New York, NY, USA, 2013, pp. 1049–1052. [27] C. Biancalana, A. Micarelli, C. Squarcella, Nereau: a social approach [46] X. Zhang, L. Yang, X. Wu, H. Guo, Z. Guo, S. Bao, Y. Yu, Z. Su, SDOC: to query expansion, in: WIDM, 2008. exploring social wisdom for document enhancement in web M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18 17

mining, in: Proceedings of the 18th ACM Conference on Informa- Development in Information Retrieval, SIGIR '14, ACM, New York, tion and Knowledge Management, CIKM '09, ACM, New York, NY, NY, USA, 2014, pp. 33–42. USA, 2009, pp. 395–404. [69] J. Vosecky, K.W.-T. Leung, W. Ng, Collaborative personalized twitter [47] D. Carmel, H. Roitman, E. Yom-Tov, Social bookmark weighting for search with topic-language models, in: Proceedings of the 37th search and recommendation, VLDB J. 19 (6) (2010) 761–775. International ACM SIGIR Conference on Research & Development in [48] P.A. Dmitriev, N. Eiron, M. Fontoura, E. Shekita, Using annotations Information Retrieval, SIGIR '14, ACM, New York, NY, USA, 2014, pp. in enterprise search, in: Proceedings of the 15th International 53–62. Conference on World Wide Web, WWW '06, ACM, New York, NY, [70] M.R. Morris, Collaborating alone and together: investigating per- USA, 2006, pp. 811–817. sistent and multi-user web search activities, Technical Report, [49] K. Bischoff, C.S. Firan, W. Nejdl, R. Paiu, Can all tags be used for Microsoft Research, 2007. search? in: Proceedings of the 17th ACM Conference on Informa- [71] M.R. Morris, A survey of collaborative web search practices, in: tion and Knowledge Management, CIKM '08, ACM, New York, NY, Proceedings of the SIGCHI Conference on Human Factors in Com- USA, 2008, pp. 193–202. puting Systems, CHI '08, ACM, New York, NY, USA, 2008, pp. 1657– [50] A. Singhal, F. Pereira, Document expansion for speech retrieval, in: 1660. Proceedings of the 22nd Annual International ACM SIGIR Con- [72] M.R. Morris, E. Horvitz, Searchtogether: an interface for colla- ference on Research and Development in Information Retrieval, borative web search, in: Proceedings of the 20th Annual ACM SIGIR '99, ACM, New York, NY, USA, 1999, pp. 34–41. Symposium on User Interface Software and Technology, UIST '07, [51] D. Carmel, H. Roitman, E. Yom-Tov, Who tags the tags? A frame- ACM, New York, NY, USA, 2007, pp. 3–12. work for bookmark weighting, in: Proceedings of the 18th ACM [73] F.F. Filho, G.M. Olson, P.L. de Geus, Kolline: a task-oriented system Conference on Information and Knowledge Management, CIKM '09, for collaborative information seeking, in: Proceedings of the 28th ACM, New York, NY, USA, 2009, pp. 1577–1580. ACM International Conference on Design of Communication, SIG- – [52] S.-Y. Chen, Y. Zhang, Improve Web Search Ranking with Social DOC '10, ACM, New York, NY, USA, 2010, pp. 89 94. Tagging, MSM. [74] S.A. Paul, M.R. Morris, Cosense: enhancing sensemaking for colla- [53] G. Salton, A. Wong, C.S. Yang, A vector space model for automatic borative web search, in: Proceedings of the SIGCHI Conference on indexing, Commun. ACM 18 (11) (1975) 613–620. Human Factors in Computing Systems, CHI '09, ACM, New York, NY, – [54] S. Amer-Yahia, M. Benedikt, L.V.S. Lakshmanan, J. Stoyanovich, USA, 2009, pp. 1771 1780. Efficient network aware search in collaborative tagging sites, Proc. [75] H. Ma, An experimental study on implicit social recommendation, VLDB Endow. 1 (1) (2008) 710–721. in: Proceedings of the 36th International ACM SIGIR Conference on [55] Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Research and Development in Information Retrieval, SIGIR '13, – Julia Stoyanovich, Efficient network aware search in collaborative ACM, New York, NY, USA, 2013, pp. 73 82. fi tagging sites, Proc. VLDB Endow 1 (1) (2008) 710–721. http://dx. [76] L. Chen, P. Wei, W. Zhe, Hete-cf: Social-based collaborative ltering recommendation using heterogeneous relations, in: The IEEE doi.org/10.14778/1453856.1453934. [56] J. Teevan, D. Ramage, M.R. Morris, #twittersearch: a comparison of International Conference on Data Mining, ICDM'14, 2014. [77] J. Lin, K. Sugiyama, M.-Y. Kan, T.-S. Chua, Addressing cold-start in microblog search and web search, in: Proceedings of the Fourth app recommendation: latent user models constructed from twitter ACM International conference on Web Search and Data Mining, followers, in: Proceedings of the 36th International ACM SIGIR WSDM '11, ACM, New York, NY, USA, 2011, pp. 35–44. Conference on Research and Development in Information Retrieval, [57] D. Horowitz, S.D. Kamvar, The anatomy of a large-scale social SIGIR '13, ACM, New York, NY, USA, 2013, pp. 283–292. search engine, in: Proceedings of the 19th International Conference [78] S. Sedhain, S. Sanner, D. Braziunas, L. Xie, J. Christensen, Social on World Wide Web, WWW '10, ACM, New York, NY, USA, 2010, pp. collaborative filtering for cold-start recommendations, in: Pro- 431–440. ceedings of the 8th ACM Conference on Recommender Systems, [58] A. Broder, A taxonomy of web search, SIGIR Forum 36 (2) (2002) RecSys '14, ACM, New York, NY, USA, 2014, pp. 345–348. 3–10. [79] S. Sedhain, S. Sanner, L. Xie, R. Kidd, K.-N. Tran, P. Christen, Social [59] P. Anick, R.G. Kantamneni, A longitudinal study of real-time search affinity filtering: Recommendation through fine-grained analysis of assistance adoption, in: Proceedings of the 31st Annual Interna- user interactions and activities, in: Proceedings of the First ACM tional ACM SIGIR Conference on Research and Development in Conference on Online Social Networks, COSN '13, ACM, New York, Information Retrieval, SIGIR '08, ACM, New York, NY, USA, 2008, pp. NY, USA, 2013, pp. 51–62. – 701 702. [80] F. Liu, H.J. Lee, Use of social network information to enhance col- [60] R.W. White, G. Marchionini, Examining the effectiveness of real- laborative filtering performance, Expert Syst. Appl. 37 (2010) – time query expansion, Inf. Process. Manag. 43 (3) (2007) 685 704. 4772–4778. [61] G. Dror, Y. Koren, Y. Maarek, I. Szpektor, I want to answer; who has [81] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel, S. Yogev, S. Ofek- a question? Yahoo! answers recommender system, in: Proceedings Koifman, Personalized recommendation of items of the 17th ACM SIGKDD International Conference on Knowledge based on social relations, in: Proceedings of the third ACM con- Discovery and Data Mining, KDD '11, ACM, New York, NY, USA, ference on Recommender systems, RecSys '09, ACM, New York, NY, – 2011, pp. 1109 1117. USA, 2009, pp. 53–60. [62] J. Zhang, M.S. Ackerman, L. Adamic, K.K. Nam, Qume: a mechanism [82] I. Guy, N. Zwerdling, I. Ronen, D. Carmel, E. Uziel, Social media fi to support expertise nding in online help-seeking communities, recommendation based on people and tags, in: Proceedings of the in: Proceedings of the 20th Annual ACM Symposium on User 33rd International ACM SIGIR Conference on Research and Devel- Interface Software and Technology, UIST '07, ACM, New York, NY, opment in Information Retrieval, SIGIR '10, ACM, New York, NY, – USA, 2007, pp. 111 114. USA, 2010, pp. 194–201. [63] G. Hsieh, S. Counts, Mimir: a market-based real-time question and [83] H. Ma, I. King, M.R. Lyu, Learning to recommend with social trust answer service, in: Proceedings of the SIGCHI Conference on ensemble, in: Proceedings of the 32nd International ACM SIGIR Human Factors in Computing Systems, CHI '09, ACM, New York, NY, Conference on Research and Development in Information Retrieval, USA, 2009, pp. 769–778. SIGIR '09, ACM, New York, NY, USA, 2009, pp. 203–210. [64] D.H. Dalip, M.A. Gonçalves, M. Cristo, P. Calado, Exploiting user [84] M. Jamali, M. Ester, A matrix factorization technique with trust feedback to learn to rank answers in q&a forums: a case study with propagation for recommendation in social networks, in: Proceed- stack overflow, in: Proceedings of the 36th International ACM SIGIR ings of the Fourth ACM Conference on Recommender Systems, Conference on Research and Development in Information Retrieval, RecSys '10, ACM, New York, NY, USA, 2010, pp. 135–142. SIGIR '13, ACM, New York, NY, USA, 2013, pp. 543–552. [85] X. Yang, H. Steck, Y. Liu, Circle-based recommendation in online [65] B.J. Jansen, A. Chowdury, G. Cook, The ubiquitous and increasingly social networks, in: Proceedings of the 18th ACM SIGKDD Inter- significant status message, Interactions 17 (3) (2010) 15–17. national Conference on Knowledge Discovery and Data Mining, [66] B.J. Jansen, Z. Liu, C. Weaver, G. Campbell, M. Gregg, Real time KDD '12, ACM, New York, NY, USA, 2012, pp. 1267–1275. search on the web: queries, topics, and economic value, Inf. Pro- [86] H. Ma, D. Zhou, C. Liu, M.R. Lyu, I. King, Recommender systems with cess. Manag. 47 (4) (2011) 491–506. social regularization, in: Proceedings of the Fourth ACM Interna- [67] T. Peggs, The inner workings of a real time search engine: Thoughts tional Conference on Web Search and Data Mining, WSDM '11, on realtime search, 2009 (white paper). ACM, New York, NY, USA, 2011, pp. 287–296. [68] M. Efron, J. Lin, J. He, A. de Vries, Temporal feedback for tweet [87] J. Noel, S. Sanner, K.-N. Tran, P. Christen, L. Xie, E.V. Bonilla, E. search with non-parametric density estimation, in: Proceedings of Abbasnejad, N. Della Penna, New objective functions for social the 37th International ACM SIGIR Conference on Research & collaborative filtering, in: Proceedings of the 21st International 18 M.R. Bouadjenek et al. / Information Systems 56 (2016) 1–18

Conference on World Wide Web, WWW '12, ACM, New York, NY, International Conference on World Wide Web, WWW '06, ACM, USA, 2012, pp. 859–868. New York, NY, USA, 2006, pp. 953–954. [88] L. Yu, R. Pan, Z. Li, Adaptive social similarities for recommender [98] A. Hotho, R. Jäschke, C. Schmitz, G. Stumme, Folkrank: a ranking systems, in: Proceedings of the Fifth ACM Conference on Recom- algorithm for folksonomies., in: LWA, 2006. mender Systems, RecSys '11, ACM, New York, NY, USA, 2011, pp. [99] R. Krestel, P. Fankhauser, W. Nejdl, Latent Dirichlet allocation for 257–260. tag recommendation, in: Proceedings of the Third ACM Conference [89] X. Yang, Y. Guo, Y. Liu, H. Steck, A survey of collaborative filtering on Recommender Systems, RecSys '09, ACM, New York, NY, USA, based social recommender systems, Comput. Commun. 41 (0) 2009, pp. 61–68. (2014) 1–10. [100] L. Wu, L. Yang, N. Yu, X.-S. Hua, Learning to tag, in: Proceedings of [90] J. Chen, W. Geyer, C. Dugan, M. Muller, I. Guy, Make new friends, the 18th International Conference on World Wide Web, WWW '09, but keep the old: recommending people on social networking sites, ACM, New York, NY, USA, 2009, pp. 361–370. in: Proceedings of the SIGCHI Conference on Human Factors in [101] Y. Freund, R. Iyer, R.E. Schapire, Y. Singer, An efficient boosting Computing Systems, CHI '09, ACM, New York, NY, USA, 2009, pp. algorithm for combining preferences, J. Mach. Learn. Res. 4 (2003) 201–210. 933–969. [91] G. Groh, C. Ehmig, Recommendations in taste related domains: [102] X. Zhu, W. Nejdl, M. Georgescu, An adaptive teleportation random collaborative filtering vs. social filtering, in: Proceedings of the walk model for learning social tag relevance, in: Proceedings of the 2007 International ACM Conference on Supporting Group Work, 37th International ACM SIGIR Conference on Research & Develop- GROUP '07, ACM, New York, NY, USA, 2007, pp. 127–136. ment in Information Retrieval, SIGIR '14, ACM, New York, NY, USA, [92] P. Symeonidis, A. Nanopoulos, Y. Manolopoulos, "A Unified Fra- 2014, pp. 223–232. mework for Providing Recommendations in Social Tagging Systems [103] C. Wei, W. Hsu, M. L. Lee, A unified framework for recommenda- Based on Ternary Semantic Analysis," in Knowledge and Data tions based on quaternary semantic analysis, in: Proceedings of the Engineering, IEEE Transactions on, 22 (2) (2010) 179–192, 34th International ACM SIGIR Conference on Research and Devel- http://dx.doi.org/10.1109/TKDE.2009.85. opment in Information Retrieval, SIGIR '11, ACM, New York, NY, [93] J. Hannon, M. Bennett, B. Smyth, Recommending twitter users to USA, 2011, pp. 1023–1032. follow using content and collaborative filtering approaches, in: [104] M. Sarwat, J. Levandoski, A. Eldawy, M. Mokbel, Lars: an efficient Proceedings of the Fourth ACM Conference on Recommender and scalable location-aware recommender system, IEEE Trans. Systems, RecSys '10, ACM, New York, NY, USA, 2010, pp. 199–206. Knowl. Data Eng. 26 (6) (2014) 1384–1399. [94] J. Hannon, K. McCarthy, B. Smyth, Finding useful users on twitter: [105] Peter Mika, Social Networks and the Semantic Web (Semantic Web twittomender the followee recommender, in: Proceedings of the and Beyond), Springer-Verlag New York, Inc, Secaucus, NJ, USA, 33rd European Conference on Advances in Information Retrieval, 2007. ECIR'11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 784–787. [106] N.J. Belkin, Some(what) grand challenges for information retrieval, [95] X. Wang, H. Liu, W. Fan, Connecting users with similar interests via SIGIR Forum 42 (1) (2008) 47–54. tag network inference, in: Proceedings of the 20th ACM Interna- [107] J. Stoer, R. Bulirsch, Introduction to Numerical Analysis, 3rd edition, tional Conference on Information and Knowledge Management, Springer, New York, 2002. CIKM '11, ACM, New York, NY, USA, 2011, pp. 1019–1024. [108] K. Caine, R. Hanania, Patients want granular privacy control over [96] Robert Jäschke, Leandro Marinho, Andreas Hotho, Lars Schmidt- health information in electronic medical records, J. Am. Med. Inf. Thieme, Gerd Stumme, Tag recommendations in social book- Assoc. 20 (1). marking systems, AI Commun 21 (2008) 231–247. [97] G. Mishne, Autotag: a collaborative approach to automated tag assignment for weblog posts, in: Proceedings of the 15th Collaborative and Social Information Retrieval and Access: Techniques for Improved User Modeling

Max Chevalier University of Toulouse, IRIT (UMR 5505), France

Christine Julien University of Toulouse, IRIT (UMR 5505), France

Chantal Soulé-Dupuy University of Toulouse, IRIT (UMR 5505), France

INFORMATION SCIENCE REFERENCE Hershey • New York Director of Editorial Content: Kristin Klinger Director of Production: Jennifer Neidig Managing Editor: Jamie Snavely Assistant Managing Editor: Carole Coulson Typesetter: Jeff Ash Cover Design: Lisa Tosheff Printed at: Yurchak Printing Inc.

Published in the United States of America by Information Science Reference (an imprint of IGI Global) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: [email protected] Web site: http://www.igi-global.com and in the United Kingdom by Information Science Reference (an imprint of IGI Global) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 0609 Web site: http://www.eurospanbookstore.com

Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.

Library of Congress Cataloging-in-Publication Data

Collaborative and social information retrieval and access : techniques for improved user modeling / Max Chevalier, Christine Julien, and Chantal Soule-Dupuy, editors.

p. cm.

Includes bibliographical references and index.

Summary: "This book deals with the improvement of user modeling in the context of Collaborative and Social Information Access and Retrieval (CSIRA) techniques"--Provided by publisher.

ISBN 978-1-60566-306-7 (hardcover) -- ISBN 978-1-60566-307-4 (ebook)

1. User interfaces (Computer systems) 2. Recommender systems (Information filtering) 3. Information storage and retrieval systems--Social aspects. 4. Cognitive science. I. Chevalier, Max, 1975- II. Julien, Christine. III. Soule-Dupuy, Chantal.

QA76.9.U83C62 2009

005.4'37--dc22

2008040202

British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book set is original material. The views expressed in this book are those of the authors, but not necessarily of the publisher.

If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating the library's complimentary electronic access to this publication. 140

Chapter VII Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

Colum Foley Dublin City University, Ireland

Alan F. Smeaton Dublin City University, Ireland

Gareth J. F. Jones Dublin City University, Ireland

ABSTRACT

Traditionally information retrieval (IR) research has focussed on a single user interaction modality, where a user searches to satisfy an information need. Recent advances in both Web technologies, such as the sociable Web of Web 2.0, and computer hardware, such as tabletop interface devices, have en- abled multiple users to collaborate on many computer-related tasks. Due to these advances there is an increasing need to support two or more users searching together at the same time, in order to satisfy a shared information need, which we refer to as Synchronous Collaborative Information Retrieval. Synchronous Collaborative Information Retrieval (SCIR) represents a significant paradigmatic shift from traditional IR systems. In order to support an effective SCIR search, new techniques are required to coordinate users’ activities. In this chapter we explore the effectiveness of a sharing of knowledge policy on a collaborating group. Sharing of knowledge refers to the process of passing relevance information across users, if one user finds items of relevance to the search task then the group should benefit in the form of improved ranked lists returned to each searcher.In order to evaluate the proposed techniques the

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

authors simulate two users searching together through an incremental feedback system. The simulation assumes that users decide on an initial query with which to begin the collaborative search and proceed through the search by providing relevance judgments to the system and receiving a new ranked list. In order to populate these simulations we extract data from the interaction logs of various experimental IR systems from previous Text REtrieval Conference (TREC) workshops.

INTRODUCTION plore a subset of a document collection in order to reduce the redundancy associated with multiple The phrase “Collaborative Information Re- people viewing the same documents. Sharing of trieval” has been used in the past to refer to knowledge enables collaborating users to benefit many different technologies which support from the knowledge of their collaborators. Early collaboration in the information retrieval (IR) SCIR systems provided various awareness cues process. Much of the early work in collabora- such as chat windows, shared whiteboards and tive information retrieval has been concerned shared bookmarks. By providing these cues, with asynchronous, remote collaboration via the these systems enabled the collaborating searchers reuse of previous search results and processes in to coordinate their activities in order to achieve collaborative filtering systems, collaborative re- a division of labour and sharing of knowledge. ranking, and collaborative footprinting systems. However, coordinating activities amongst users Asynchronous collaborative information retrieval can be troublesome, requiring too much cognitive supports a passive, implicit form of collaboration load (Adcock et al., 2007). where the focus is to improve the search process Recently we have seen systems to support a for an individual. more system-mediated division of labour by divid- Synchronous collaborative information ing the results of a search query amongst searchers retrieval (SCI R) is a n e me rg i ng for m of col labor a- (Morris and Horvitz, 2007), or defining searcher tive IR in which a group of two or more users are roles (Adcock et al., 2007). However, there has explicitly collaborating in a synchronised manner been no work to date which addresses the system- in order to satisfy a shared information need. The mediated sharing of knowledge across collaborat- motivation behind these systems is related to both ing searchers. In this chapter we introduce our the ever-growing corpus of human knowledge on techniques to allow for effective system-mediated the web, the improvement of social awareness on sharing of knowledge. We evaluate how a sharing the internet today, and the development of novel of knowledge policy affects the performance of a computer interface devices. SCIR systems repre- group of users searching together collaboratively. sent a significant paradigmatic shift in focus and But first, in the next section, we provide a compre- motivation compared with traditional IR systems hensive account of work to date in synchronous and asynchronous collaborative IR systems. The collaborative information retrieval. development of new IR techniques is needed to exploit this. In order for collaborative IR to be effective there needs to be both an appropriate SYNCHRONOUS COLLABORATIVE division of labour, and an effective sharing of INFORMATION RETRIEVAL knowledge across collaborating searchers (Zebal- los, 1998; Foley et al., 2006). Division of labour Information retrieval (IR), as defined by Baeza- enables each collaborating group member to ex- Yates and Ribeiro-Neto (1999), is concerned with

141 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

the representation, storage, organisation of and et al., 1999) so that brainstorming activities can access to information items. The purpose of an be facilitated. IR system is to satisfy an information need. The first examples of SCIR tools were built Synchronous collaborative information using a distributed architecture where software retrieval (SCIR) systems are concerned with enabled communication across groups of remote the realtime, explicit, collaboration which occurs users. Recently the development of new computing when multiple users search together to satisfy a devices has facilitated the development of co-lo- shared information need; these systems represent cated SCIR tools. We will now outline research a significant paradigmatic shift in IR systems from to date in each of these areas. an individual focus to a group focus. As such these systems represent a more explicit, active form of Synchronous Remote Collaborative collaboration, where users are aware that they are Information Retrieval collaborating with others towards a common, and usually explicitly stated, goal. This collaboration SCIR systems have been developed to enable can take place either remotely, or, in a co-located remote users to search and browse the web setting. These systems have gained in popular- together. These systems often require users to ity and now with the ever-growing popularity of log-in to a particular service or may require the the social web (or Web 2.0), support for explicit, use of particular applications in order to facilitate synchronous collaborative information retrieval collaboration. is becoming more important than ever. GroupWeb (Greenberg and Roseman, 1996) The benefit of allowing multiple users to search represents an early collaborative browsing together in order to satisfy a shared information environment and was built upon the GroupKit need is that it can allow for a division of labour groupware toolkit (Roseman and Greenberg, and a sharing of knowledge across a collaborat- 1996). In GroupWeb, several users could log ing group. Division of labour means that each onto a collaborative browsing session and the member of a collaborating group can explore web browser was used as a group “presentation a subset of information thereby reducing the tool”. A master browser (or “presenter”) selected redundancy associated with two or more people a page and this page was displayed to each group viewing the same documents, and improving the member using a form of “What You See Is What efficiency of a search. Some methods proposed I See” (WYSIWIS). The system also supported here include increasing the awareness amongst synchronous scrolling and independent scrolling users of each collaborative searcher’s progress on a web page and supported the use of telepointers (Diamadis and Polyzos, 2004; Smeaton et al., (showing others’ mouse pointers on the page) in 2006) or system-mediated splitting of a task (Foley order to allow users to focus the attention of the et al., 2006; Adcock et al., 2007). The ability to group and to enact gestures. GroupWeb provided effectively share information is the foundation an annotation window where groups could attach of any group activity (Yao et al., 1999). Sharing shared annotations to pages when viewing and of knowledge across group members involved these annotations could be viewed by all group in a collaborative search can occur by providing members. In GroupWeb, group members were awareness of other searchers’ progress through tightly coupled. Enabling each user to see the same the search, and this can be achieved by enabling documents increases awareness amongst group direct chat facilities (Gianoutsos and Grundy, members but can be an inefficient technique for 1996; Gross, 1999; Krishnappa, 2005) or group exploring the vastness of the web. blackboards (Gianoutsos and Grundy, 1996; Cabri

142 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

The W4 browser (Gianoutsos and Grundy, the query terms were displayed in a list underneath 1996) extended the GroupWeb system to allow their photo. By clicking on a search query a user users to browse the web independently whilst could see the results returned for this query, and synchronising their work. In W4, a user could this reduced the duplication of effort across users. view all pages viewed by other users, they could When visiting a page, users could also see which chat with each other, share bookmarks (i.e. other users had visited that page previously. Us- documents deemed relevant), and see a shared ers could also provide ratings for pages using a WYSIWIS white-board to brainstorm. Users thumbs-up or thumbs down metaphor. Support could also embed chat sessions, links and annota- for division of labour was achieved through an tions directly into a web-page. A similar approach embedded text chat facility, a recommendation was employed by Cabri et al. (1999), which used mechanism, and a split search and multi-search a proxy server to record documents viewed by facility. Using split search a user could divide others. These documents were then displayed the results of their search with a collaborating to each user in a separate browser window. The searcher and, using multi-search, a search query system also made others aware of these viewed could be submitted to different search engines, documents by editing the HTML mark-up in pages each associated with different users. viewed by each collaborating searcher (links to The Adaptive Web Search (AWS) system pro- pages already viewed by other users in a session posed by Dalal (2007) represented a combination were indicated using different colours). of personalised, social and collaborative search. The above systems all required users to ex- The system was a type of meta-search system in plicitly log onto a service to support collaborative which users’ could search using multiple search searching. Systems have been developed in order engines and maintain a preference vector for a to make users who are browsing the web aware particular engine based on their long and short of others who may be nearby in order to facilitate term search contexts, user goals and geographic a more spontaneous collaboration. Donath and location. Users could perform social searching Robertson (1994) developed a tool which enabled by having their preference vector influenced by people to see others currently viewing the same others depending on a level of trust. web page as themselves. The system also allowed A commercial application of synchronous col- them to interact with these people and coordinate laborative IR is available in the popular Windows their activities in order to travel around the web Live Messenger, an instant messaging service. as a group. Sidler et al. (1997) extended this ap- During a chat session, users can search together proach in order to allow users to identify other by having the results from a search displayed searchers within their neighbourhood to enable to each user (Windows Live Messenger, 2007). spontaneous collaboration. Netscape Conferencer (Netscape Conferencer, SearchTogether (Morris and Horvitz, 2007) 2001) allows multiple users to browse the web was a prototype system which incorporated many together using WYSIWIS, where one user controls synchronous and asynchronous tools to enable a the navigation and chat facilities and whiteboards small group of remote users to work together to are implemented to facilitate communication. satisfy a shared information need. SearchTogether was built to support awareness of others, division Synchronous Co-Located of labour, and persistence of the search process. Collaborative Information Retrieval Awareness of others was achieved by representing each group member with a screen name and photo. Recent advances in ubiquitous computing devices Whenever a team member performed a new search such as mobile phones and PDAs have allowed

143 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

researchers to begin exploring techniques for profile in the system consisted of a set of weighted spontaneous collaborative search. keywords of their interests and was built automati- Maekawa et al. (2006) developed a system for cally by ext racting key words f rom both the user’s collaborative web browsing on mobile phones and homepage and those pages around it. Users wore PDAs. In this system a web page was divided electronic badges so that they could be identified into several components and these components as they approach the screen. A collaborating were distributed across the devices of collaborat- group of users using Let’s Browse were shown ing users. WebSplitter (Han et al., 2000) was a a set of recommended links to follow from the similar system for providing partial views to web current page, ordered by their similarity to the pages across a number of users and potentially aggregated users’ profiles. across a number of devices available to a user The tangible interface system developed by (e.g. laptop, PDA). Blackwell et al. (2004), allowed a group of users Advances in single display groupware (SDG) to perform “Query-By-Argument” whereby a technology (Stewart et al., 1999), have enabled series of physical tokens with RFID transmitters the development of collaborative search systems could be arranged on a table to develop a team’s for the co-located environment. The advantages query. of such systems are that they improve the aware- The TeamSearch system developed by Morris ness of collaborating searchers by bringing et al. (2006) enabled a group of users collaborat- them together in a face-to-face environment. ing around an electronic tabletop to sift through Increased awareness can enable both a more ef- a stack of pictures using collaborative Boolean fective division of labour and a greater sharing query formulation. The system enabled users to of knowledge. locate relevant pictures from a stack by placing Let’s Browse (Lieberman et al., 1999) was a query tokens on special widgets which corre- co-located web browsing agent which enabled sponded to predefined metadata categories for multiple users standing in front of a screen the images. (projected display onto a wall) to browse the Físchlár-DiamondTouch (shown in Figure 1) web together based on their user profiles. A user was a multi-user video search application devel-

Figure 1. Físchlár-DiamondTouch

144 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

oped by the authors and others at the Centre for on providing tools to increase awareness across Digital Video Processing at Dublin City University collaborating users. Research has suggested (Smeaton et al., 2006). Físchlár-DiamondTouch that the performance of a group searching in a was developed on an interactive table known SCIR system can be improved by allowing for a as DiamondTouch (Dietz and Leigh, 2001) and division of labour and sharing of knowledge. In used the groupware toolkit DiamondSpin (Shen early SCIR systems, the onus for coordinating et al., 2004) from Mitsubishi Electric Research the group’s activities was placed on the users. Labs (MERL). The system allowed two users to Work has been done into allowing for a system- collaborate in a face-to-face manner in order to mediated division of labour, but no work, as yet, interact with a state-of-the-art video retrieval ap- has attempted to implement a system-mediated plication, Físchlár (Smeaton et al., 2001). Users sharing of knowledge in order to improve the could enter a free text query using an on-screen quality of ranked lists returned to users searching keyboard. This query was then issued to the search together in an adhoc search task. engine and a list of the 20 top ranked keyframes (an image from a video chosen as a representative of a particular video shot) were displayed upon SHARING OF KNOWLEDGE IN the screen (the most relevant in the middle and SYNCHRONOUS COLLABORATIVE decreasing in relevance as the images spiralled INFORMATION RETRIEVAL out). Keyframes were rotated to the nearest user in order to provide for an implicit division of Suppose two users are searching together to satisfy labour. Two versions of Físchlár-DiamondTouch a shared information need using a state-of-the-art were evaluated in TRECVid 2005 (Foley et al., SCIR system as described in the previous section. 2005), one which provided for increased awareness W he n t wo or more u se r s come t oget he r i n a n SCI R amongst users and one which was designed for environment, there are several ways in which to improved group efficiency. This represented the initiate the collaborative search. For example, us- first time any group had performed collaborative ers may each decide to formulate their own search search in any TREC or TRECVid workshop. query, or users may decide on a shared, group In an effort to improve collaborative search query. Having generated an initial ranked list for effectiveness through “algorithmically-mediated an SCIR search, either as a result of a shared query collaboration”, the “Cerchiamo” system was de- or a separate query for each collaborator, these veloped by the FXPAL TRECVid team (Adcock results can be divided across users using a simple et al., 2007). Cerchiamo was designed to support round-robin strategy. Where, for a collaborative two users working together to find relevant shots search involving two users with a shared initial of videos. Two users worked under predefined query, user 1 would receive the first document in roles of “prospector” and “miner”. The role of the ranked list, user 2 would receive the second the prospector was to locate avenues for further ranked document, user 1 the third, and so on until exploration, while the role of the miner was to all results are distributed across the users. This explore these avenues. The system used informa- is the approach proposed by Morris and Horvitz t ion f rom u se r s’ i nt e r a ct ion s t o det e r m i ne t he next (2007). As users examine documents and find shots to display on-screen and provide a list of those relevant to the search, they may save them to suggested query terms. a “bookmarked” area. What these users are doing In this section we have described work to is providing explicit relevance judgments to the date in synchronous collaborative information search engine. In traditional, single-user IR, these retrieval (SCIR). Early SCIR systems focussed relevance judgments are often used in a process

145 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

known as relevance feedback to improve the Collaborative Relevance Feedback quality of a user’s query by reformulating it based on this relevance information. Over a number of One of the simplest ways to incorporate multi-user relevance feedback iterations, an IR system can relevance information into a feedback process build a short term profile of the user’s information is to assume that one user has provided all the need. At present, SCIR systems do not use this relevance judgments made by all users and then new relevance information directly in the search initiate a standard, single-user, relevance feed- process to re-formulate a user’s query, instead it back process over these documents. There may, is used simply as a bookmark and therefore we however, be occasions where it is desirable to believe that this information is wasted. No attempt allow for a user-biased combination of multi-user is made to utilise this relevance information dur- relevance information. Therefore, we will outline ing the course of an SCIR search to improve the how the RF process can be extended to allow for performance of a collaborating group of users. a weighted combination of multi-user relevance As a consequence, the collaborating group does information in a collaborative relevance feedback not see the benefit of this relevance information process. Combination of evidence is an established in their ranked lists. research problem in IR (Croft, 2002), in our work Relevance feedback is a n I R tech n ique wh ich we are interested in investigating the combina- has been proven to improve the performance of the tion of multi-user relevance information within IR process through incorporating extra relevance the relevance feedback process. In our work we information provided by users (in the form of use the probabilistic model for retrieval which documents identified as relevant by a user) into is both theoretically motivated, and proven to an automatic query reformulation process. The be successful in controlled TREC experiments basic operation of relevance feedback consists first shown in (Robertson et al., 1992). In the of two steps: (1.) query expansion – whereby probabilistic retrieval model the relevance feed- significant terms from documents judged relevant back processes of Query Expansion and Term are identified and appended to the user’s origi- Reweighting are treated separately (Robertson, nal query, and (2.) relevance weighting – which 1990). Figure 2 presents a conceptual overview biases weights of each query term based on this of the collaborative relevance feedback process relevance information. for two users are searching together. When the If we provided a relevance feedback mecha- relevance feedback process is initiated, user 1 has nism in an SCIR system, then when a member provided 3 relevance judgments and user 2 has of an SCIR group initiates a relevance feedback provided 4. As we can see, we have a choice as operation, if their search partner has provided to what stage in the relevance feedback process relevance judgments to the system, we could we can combine this information. incorporate both users’ relevance judgments In particular we have identified three stages in into the feedback process. This could enable the process at which we can combine relevance better quality results to be returned to the user. information. Furthermore such a sharing of knowledge policy could allow users to benefit from the relevance of Combining Inputs to the Relevance a document without having to view the contents of Feedback Process (A) the document. It is not clear, however, how multi- user relevance information should be handled in The relevance feedback process uses all avail- a relevance feedback process. able relevance information for a term in order to assign it a score for both query expansion and

146 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

Figure 2. Combining relevance information, the 3 choices

1 FT911-1912 14.125 2 FT923-14519 13.76 1 3 FT932-11091 13.557 4 FT931-262 13.196 Relevance 5 FT942-16649 13.167 A Relevance 6 FT933-113 13.120 FeedbackFeedback 7 FT944-4513 12.113 8 FT942-13611 12.112 9 FT921-9968 12.049 10 FT944-10574 12.047 Matching 2 and Ranking

RelevanceRelevance 1 FT911-1912 12.415 1 2 FT932-11091 12.296 FeedbackFeedback 3 FT923-14519 12.257 4 FT931-2524 12.196 5 FT942-16649 12.167 B 6 FT933-11983 12.120 7 FT944-4513 12.113 8 FT942-13611 12.112 Relevance 9 FT921-9968 12.049 Relevance Matching 10 FT944-10574 12.047 2 FeedbackFeedback and Ranking

RelevanceRelevance 1 FT931-2524 14.115 1 2 FT932-11091 13.296 FeedbackFeedback 3 FT923-14519 12.257 4 FT911-1912 12.196 Matching 5 FT942-16649 12.167 C and 6 FT933-11983 12.120 Ranking 7 FT944-4513 12.113 8 FT942-13611 12.112 9 FT921-9968 12.049 RelevanceRelevance 10 FT944-10574 12.047 2 FeedbackFeedback

term reweighting. If we have relevance informa- where tion from multiple co-searchers, combining this information before performing relevance feedback p = probability that a document contains term i should result in an improved combined measure given that it is relevant of relevance for these terms. This is the rationale q = probability that a document contains term i behind this novel method for combining relevance given that it is non-relevant information, which we refer to as partial-user weighting, as the evidence for relevance or non- The appropriate substitutions for p and q are relevance of a term is composed of the combined the proportions: partial evidence from multiple users. r We will now outline the derivation for the p = i (2) partial-user relevance weight and partial-user offer R weight. From Robertson and Spärck Jones (1976), n − r we can see that the probability of relevance of a q = i i (3) − term is defined as: N R where − = p(1 q) w(i) log (1) r = Number of relevant documents in which q(1− p) i term i occurs

147 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

R = Number of identified relevant documents U1 α ni = Number of documents in the collection in u 1 which term i occurs u0 N = Number of documents in the collection Therefore we have extended the proportions The probability that a document contains term using a linear combination of each user’s relevance i given that it is relevant, p, is equal to the propor- statistics. Using this approach, the probability tion of all relevant documents in which the term i that a document contains term i, given that it is occurs. The probability that a document contains relevant, is equal to the sum of the proportions term i given that it is non-relevant, q, is equal to for relevance from each user. The probability that the proportion of all non-relevant documents that a document contains term i, given that it is not contain the term. Applying these substitutions to relevant, is equal to the sum of the proportions of equation 1 we get the standard relevance weighting non-relevance. Each of these values is multiplied formula (Robertson and Spärck Jones (1976)): by a scalar constant α , which can be used to vary the effect of each user’s proportion in the final rnr 1 iii1 calculation, and a default value of can be used RNR (4) to consider all users equally. U rw( i ) log nr r One important consideration when combin- ii1 i NR R ing multi-user relevance information is what to do when a term has not been encountered by a If we assume that in a collaborative search user (i.e. the term is not contained in the user’s session we have U collaborating users searching. relevance judgments). There are two choices Then the proportions for p and q, in equations 2 here, we can either allow the user that has not and 3 respectively, can be extended as follows: encountered the term to still contribute to the shared weight, or we can choose to assign a weight U1 to a term based solely on the relevance and non- rui p αu (5) relevance proportions of users that have actually R u0 u encountered the term. If we wish to incorporate a user’s proportions for a term regardless of whether the term appears U1 in any of the user’s relevance judgments, then the nriui q αu (6) term will receive a relevance proportion, p, of 0 ( NR 0 n u0 u ) and a non relevance proportion, q, of , from R N − R a user who has not encountered the term (as ri = where 0 for that user). If we do not wish to incorporate a user’s pro- ni, N are as before portion for a term, in the case that they have not rui = Number of relevant documents identified by encountered the term, then the shared relevance user u in which term i occurs and non-relevance proportions of p and q in

Ru = Number of relevant documents identified equation 5 and equation 6 respectively will only by user u be composed of the proportions from users who

αu = Determines the impact of user u’s proportions have encountered the term. on the final term weight, and Applying the extended proportions of p and q, in equations 5 and 6 respectively, to the prob-

148 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

ability of relevance from equation 1, results in our U1 partial-user relevance weight (purw): puow( i )αuui r purw( i ) u0 (10) U1 U1 rnrui i ui ααuu1 where RNRuu purw( i ) log u0 u0 U1 U1 α = Determines the impact of each user’s r value nriui r ui u i ααuu1 on the final weight, and NRuu R u0 u0 (7) U1 αu 1 For practical implementation of the standard u0 relevance weighting formula (equation 4), and to limit the errors associated with zeros such as Combining Outputs of the Relevance dividing by zero, a simple extension is commonly Feedback Process (B) used that adds a constant to the values in the proportions. Applying the proportions suggested This method of combination operates at a higher in Robertson and Spärck Jones (1976), known as level of granularity than the previous method by the Jeffrey prior, to equation 7, results in: treating the relevance process as a black box. In this method, for each user, relevance weighting and offer weighting are calculated separately us- U1 U1 rui 0.5 n i r ui 0.5 ing only a searcher’s own relevance statistics (i.e. ααuu1 R1uu NR1 terms from documents identified as relevant by purw( i ) log u0 u0 U1 U1 the user and their distribution in these relevant n r 0.5 r 0.5 ααiui1 ui documents). The outputs from these processes (i.e. uuNR 1 R 1 u0uu u0 the scores) are combined to produce a combined (8) weight. Combination is therefore performed at a later stage in the relevance feedback process than So far we have shown how the partial-user the method proposed in the previous section. method can be applied to the standard relevance For relevance weighting, we calculate the weighting formula which is used for reweighting combined relevance weight (crw) using a linear terms in the relevance feedback process. Now we combination of relevance weight scores from all will consider applying the scheme to the offer users: weighting formula (Robertson (1990)), which is U1 used to rank terms for query expansion: crw( i )α rw uui (11) owii r rw i (9) u0 For offer weighting we can follow the same ap-

Using a linear combination approach the ri proach, by calculating the offer weight separately value in equation 9, can be extended to include for each user and then combining afterwards to a weighted combination of each collaborating produce a combined offer weight (cow): user’s ri value, to produce a partial-user offer weight (puow): U1 α cow( i )uui ow (12) u0

149 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

where U1 cds( d ,q )αuud s (13)

αu = determines the impact of each user’s contri- u0 bution on the final score, and where U1 α 1 u sud = the relevance score for document d in relation u0 to user u’s query

αu = determines the impact of each user’s contri- As with the partial-user method, we can either bution on the final document score, and include or leave-out a user’s contribution to either the combined relevance weight or combined offer U1 weight if they have not encountered the term in αu 1 their own set of relevance judgments. Once again u0 the α variable can be used to control the impact of each user’s evidence on the combination and In this section we have outlined how a system- a default value can be set to 1Ufor all users to mediated sharing of knowledge can be achieved give all users the same weighting. in an SCIR search, by incorporating each group member’s relevance judgments into a collaborative Combining Outputs of the Ranking relevance feedback process. We have proposed Process (C) three methods by which the standard relevance feedback formula can be extended into a collab- This stage of combination operates at a higher orative relevance feedback process. Evaluating level of granularity than either of the previous these techniques will allow us to establish how methods, as here we treat the entire search engine a sharing of knowledge policy, via collaborative as a black box and combination is performed at relevance feedback, impacts on the performance the ranked list or document level. of a collaborating group. In the next section we Combining the outputs from multiple rank- will outline how we plan to evaluate these tech- ing algorithms has become a standard method niques. for improving the performance of IR systems’ ranking (Croft, 2002). Experimental Setup In order to produce a combined ranked list, a reformulated query is generated for each col- In this chapter, we are evaluating many different laborating user, based on their own relevance approaches to combining relevance information. information, and these relevance feedback queries It would have been infeasible to evaluate each are then submitted to the search engine in order to of these approaches thoroughly using real user produce separate ranked lists, one for each user. experiments. Instead, by using simulations we These ranked lists are then combined in order to can evaluate our proposed approaches effec- produce a combined ranked list. Combination at tively while ensuring that our evaluations are the document level can be achieved, as before, by realistic. performing a linear combination of the document scores produced by the search engine, to arrive at a combined document score (cds):

150 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

Requirements Analysis partner. These are documents that are in their search partner’s current ranked list. Previous IR experiments that have used user simulations have focussed on a single user’s In our simulations, users do not manually refor- interactions with an IR system. Here we are at- mulate their queries during the search, instead, in tempting to simulate a synchronous collaborative order to receive new ranked lists during the search, IR environment, a dynamic, collaborative simula- users use a simple relevance feedback mechanism. tion. We are conscious that the simulation should In any SCIR search, users may provide multiple be realistic of future systems in any device or relevance assessments over the course of the search interface which could support SCIR search, i.e. session, and therefore we have a choice as to when desktop search, tabletop search, PDA or Apple to initiate a relevance feedback operation. For ex- iPhone search, etc. ample, we could choose to perform feedback after Our SCIR simulations will simulate a search a user provides a relevance judgment, or after the involving two collaborating users. Recent studies user has provided a certain number of relevance of the collaborative nature of a search task have judgments. For the purposes of the experiments shown how the majority of synchronous collab- reported in this chapter, our SCIR simulations orative search sessions involve a collaborating operate by initiating a relevance feedback opera- group of size two (Morris and Horvitz, 2007), tion for a user each time they provide a relevance and therefore we believe that this group size is the judgment, thereby returning a new ranked list of most appropriate to model, though the techniques documents to the user. This approach is known proposed could scale to larger groups. as Incremental Relevance Feedback, a method One of the important considerations for any first proposed by Aalbersberg (1992). Using the SCIR system is how to begin a collaborative Incremental RF approach, a user is provided search. For the experiments reported here the with a new ranked list of documents after each search assumes that one initial query has been relevance judgment, rather than accumulating a formulated. In a real system, this query could be series of relevance judgments together and issu- formulated by one user or by both users collab- ing them in batch to the RF process. This can oratively. By only requiring one query from the enable users to benefit immediately from their set of users, we can limit the interactions needed relevance judgments. Furthermore, studies have by users with the search system. Although query- shown that applying feedback after only one or ing may be easy using the standard keyboard and two relevant documents have been identified can mouse combination, interactions with phones or substantially improve performance over an initial other handheld devices can be difficult. query (Spärck Jones, 1997). In these experiments we are interested in Another choice for any SCIR system, related evaluating how a system-mediated sharing of to feedback granularity, is whether to present a knowledge policy can operate alongside an explicit user with a new ranked list only when they per- division of labour policy. Therefore, the simulated form feedback themselves or when they or their SCIR system will implement a system-mediated search partner provides feedback. Presenting division of labour where the results returned to a users with a new ranked list when their search searcher at any point in the search are automati- partner performs relevance feedback may allow cally filtered in order to remove: users to benefit more quickly from their partner’s relevance judgments. However deployment of such 1. Documents seen by their search partner. an intensive SCIR system would require design- 2. Documents assumed seen by their search ers to develop novel interface techniques to allow

151 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

for the seamless updating of a user’s ranked list. Methodology Furthermore, users searching in such an intensive system may suffer from cognitive overload by In order to populate these simulations we have being presented with new ranked lists, seemingly, mined data from previous TREC interactive at random. In this chapter we will evaluate the search experiments. The Text REtrieval Confer- effects of both of these interaction environments ence (TREC) is an annual workshop established on an SCIR search. in 1992 under the auspices of the National Insti- Figure 3, presents a conceptual overview of tute for Standards and Technology (NIST) in an the SCIR system we will simulate in our evalu- attempt to promote research into IR. For each ations. TREC, a set of tasks or tracks are devised, each Referring to Figure 3, the data required to evaluating different aspects of the retrieval pro- populate our SCIR simulations is: cess. The purpose of the TREC (Text REtreival Conference) interactive task is for a searcher to • An initial query (Q) – as outlined above, the locate documents of relevance to a stated infor- simulated SCIR session begins with an initial mation need (a search “topic”) using a search query entered by the set of two users. engine and to save them. For the interactive • Series of relevance judgments (RJ) – these track at TREC 6 – TREC 8 (1997-1999), each are explicit indications of relevance made participating group that submitted results for by a user on a particular document. evaluation was required to also include rich format • Timing information – this represents the data with their submission. This data consisted time, in seconds, relative to the start of the of transcripts of a searcher’s significant events search session at which relevance judgments during a search, such as their initial query, the were made. This timing information is used documents saved (i.e. relevance judgments), and to order relevance judgments in an SCIR their timing information. For our simulations we simulation and allows us to model SCIR extracted data from several groups’ submissions sessions in which collaborating searchers are to TREC 6 – TREC 8. Unfortunately, it was not providing relevance information at distinct possible to extract data for our simulations from times and at different rates in the process. all participating groups’ submissions to these

Figure 3. A simulated SCIR session

083155

t RJ RJ User 1 Results R esults Results

SCIR System Q

Results Results R esults

RJ RJ RJ

t

User 2 0 63 113 165

152 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

TRECs. This was due to a variety of reasons but real users searching to satisfy the same informa- typically some groups had either failed to submit tion need on a standardised corpus. There were rich format data or the data they submitted was a total of 20 search topics used in these TREC not complete as it lacked some of the data needed workshops, with varying degrees of difficulty and to build our simulations. In our simulations we therefore our simulations were evaluated across used this extracted data to simulate two users, who these 20 topics. originally searched the topic separately as part of their group’s submission, searching together with Relevance Judgments an SCIR system. Figure 4 shows a conceptual overview of The SCIR simulations proposed thus far are based a simulated SCIR session involving two users on taking static rich format data, which records whose rich format data was extracted from TREC a user’s previous interactions with a particular data, searching on a TREC topic entitled “Hubble search engine, and imposing our SCIR simulated Telescope Achievements”. From Figure 4, we can environment on this data. By imposing our own see that the search begins with the group query simulated environment on this rich format data, “positive achievements hubble telescope data”, we cannot assume that users would have saved which is the concatenation of both users’ original the same documents as that they did during their queries. By the time user 1 provides their first original search, as recorded in the rich format relevance judgment on document FT921-7107, user data. Before we can proceed with our simula- 2 has already provided a relevance judgment, on tions we need to replace these static relevance document FT944-128. By the time user 2 makes judgments with relevance judgments based on their second relevance judgment on document the ranked lists that simulated users are presented FT924-286, user 1 has made their first relevance with. Although in any simulation we can never judgment on FT921-7107. predict, with absolute certainty, the actions of a By extracting rich format data associated with user, in the simulations used in our experiments different users’ interactions on a search topic, we it is important that the relevance judgments are a can acquire multiple heterogeneous simulations, reasonable approximation of real user behaviour. where the data populating our simulations is from Our solution is to simulate the user providing a

Figure 4. Conceptual overview of two searchers searching together

0 62 91 368 456 665 697 t User 1 RJ RJ RJ RJ RJ

FT921- FT924- FT944- FT941- FT943- 7107 286 128 17652 11617

positive achievements SCIR System hubble telescope data

FT944- FT924- FT931- FT941- 128 286 6554 17652 RJ RJ RJ RJ

User 2 0 49 86 150 177 368 t

153 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

relevance judgment on the first relevant document, list of documents. This list could have been re- i.e. highest ranked, on their current ranked list, turned to the user either as a result of the initial where the relevance of the documents is judged query or after performing relevance feedback. In according to the TREC relevance assessments traditional, single-user IR the accuracy of each for the topic (“qrels”). individual searcher’s ranked lists can be evalu- Although we can never be fully certain that a ated using standard IR measurements such as user will always save the first relevant document average precision (AP), a measure which favours that they encounter on a ranked list (i.e. rather systems that rank relevant documents higher in than the second or third), recent studies have a returned ranked list of documents. In our work shown that users tend to examine search results we are concerned with the performance of a from top to bottom, “deciding to click each result group of users, and therefore we need to be able before moving to the next” (Craswell et al., 2008). to assign a score to the collaborating group at any Therefore we believe that this approximation of particular point in the search process. What we a real user’s action is reasonable. need is a measure which captures the quality and Before finalising our simulations, we also diversity across collaborating users’ ranked lists. need to enforce an upper limit on the number of Our solution is to calculate the total number of documents a user will examine in order to locate unique relevant documents across user’s ranked a document on which to provide a relevance judg- lists at a certain cut-off and use this figure as our ment. For example, it would be unreasonable to group score. In our simulations as we assume assume that a user would look down as far as rank that users will examine the top 30 documents in 900 in the ranked list in order to find a relevant the ranked list, our measure of quality is taken document. Instead, we limit the number of docu- at a cut-off of 30 documents from each users list ments that a simulated user will examine to the (i.e. a total of 60 documents). This performance top 30 documents in their ranked list. Although measure will enable us to capture both the quality in a real world system, users may be willing to and diversity across collaborating users’ ranked examine more or less documents according to lists and in particular the parts of the list that they the device they are using for searching, we feel will examine. that 30 is a reasonable figure to assume users will As described earlier, in our simulations, before examine in any SCIR search. returning a new list to a searcher, all relevance After performing relevance feedback, the judgments made by the searcher are removed. relevance judgments made by a user are never For the purposes of calculating the group score returned to them again for the duration of that we also include these saved documents in the search. As our baseline SCIR system will imple- calculation. ment a division of labour policy, we also remove these seen documents from their search partner’s Measurement Granularity subsequent list. Furthermore, as we assume that users will examine a maximum of 30 documents, In our simulations of SCIR search, a user is pre- our baseline system will also remove these docu- sented with a new ranked list of documents each ments from their search partner’s ranked list. time they make a relevance judgment. Taking a measurement of the group performance after each Evaluation Metric of these events allows us to capture the change in group performance over the course of a search. At each st age i n a n SCI R session, each collabor at- Figure 5 illustrates the procedure followed ing user will have associated with them a ranked to calculate the performance of a group over the

154 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

duration of an SCIR simulation involving two the group score at this point by counting the total collaborating users. The SCIR simulation begins number of unique relevant documents across these with a shared query, at this point we measure the two ranked lists (labelled “GS”) including the total number of relevant documents in the top 60 one relevance judgment made. As this is the first positions of this list (top 30 for each user). As this relevance feedback iteration in the SCIR session, figure represents the initial group score before any the AP for this merged list is plotted at position relevance feedback is provided to the system, it 1 on the graph. The measurement proceeds in is plotted at position 0 on the x-axis of the graph this manner, by calculating the number of unique at the bottom of Figure 5. The first relevance relevant documents across each user’s current feedback iteration is initiated after user 2 provides ranked list after each relevance judgment, these a relevance judgment after 63 seconds. At this figures are plotted after each relevance feedback point, user 2’s current list is updated as a result of iteration in order to show the group’s performance a feedback iteration, however user 1 is still view- over the course of the entire search. ing the results of the initial query. We calculate

Figure 5. Measurement granularity used in experiments

83 155

RJ RJ t User 1

GS GS GS GS Q GS

GS

RJ RJ RJ

63 113 165 t User 2 0

30

28

26

24

22

20

18

# Unique Relevant # Unique 16

14

12

10 012345 # Relevance Judgments

155 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

From this graph, we can also generate a single searchers in the feedback process can improve performance figure for the entire search, by aver- an SCIR search over a baseline SCIR system that aging the group score after the initial query and implements just a division of labour policy. For each subsequent feedback iteration. both the partial-user (A) and combined weighting technique (B), we will run two variations of the Significance Testing technique, one which allows a user who has not encountered a term in their relevance judgments We will use this single performance figure in order to contribute to its relevance and offer weight to test for significance in our results. In this chapter (Contr), and another which only considers the we use randomisation testing (Kempthorne and weighting for a term from the user that has en- Doerfler, 1969), a non-parametric significance countered (No Contr). test, to test for statistical significance and use a As outlined in the previous section, an impor- significance threshold of p < 0.05. All results with tant consideration for any SCIR system is when p values less than this threshold are considered as to provide users with the relevance information significant. These tests will allow us to understand from their search partner. In these experiments, whether observed differences in performance are we evaluated both a dynamic intensive environ- due to chance or point to real differences in system ment, where users’ ranked lists are updated when performance. Due to the lack of assumptions made either they or their search partner make relevance by the randomisation test, it can be applied to data judgments, alongside a more static environment, whose underlying distribution would not satisfy where users ranked lists are only updated after the conditions for a parametric test. Furthermore they make judgments themselves. even when the conditions for parametric tests are Figure 6 plots the performance of all combina- justified, the randomisation test has been shown tion techniques for both the static SCIR environ- to be of similar power to these tests. ment and the dynamic SCIR environment, along In this section we outlined our proposed evalua- with the baseline system of an SCIR system tion methodology. We proposed a novel method by implementing just a division of labour policy which a group of users can be simulated searching (SCIR + Full Div). The graph at the top shows together. We also proposed techniques for mea- the performance of the techniques over the entire suring the performance of a group of searchers search, while the graph at the bottom shows the at any point in the search. In the next section we performance of systems over the first few iterations will present the results from our evaluation. only. Table 3.3 shows the single performance fig- ure across all topics for all runs. When examining Results Figure 6, it should be noted that as these values are computed across a number of simulated runs In this section we present the results from our with differing numbers of relevance judgments, evaluations of each of the collaborative relevance values at later iterations (i.e. > 11 on x-axis) may feedback techniques (Type A, B and C), described not be representative of an overall trend than val- in section “Collaborative Relevance Feedback”. ues for earlier iterations. For example, the sudden We also evaluate the performance of a standard drop-off at around 35 relevance judgments is due single-user relevance feedback mechanism, which to this value being calculated based on only one assumes that all relevance judgments made in the or two simulated runs. search were made by one pseudo-user. As we can see from Table 3.3, all collabora- Through these experiments we will inves- tive relevance feedback techniques, except the tigate if passing relevance information across document fusion technique (C), provide small

156 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

improvements in performance over the SCIR + Examining the bottom graph in Figure 6, we Full Div system. With the best performing system, can see that the combination of relevance infor- the dynamic partial user (A) no contr technique, mation techniques do provide a more substantive providing a 1.5% improvement. Running signifi- increase in performance over the SCIR system cance testing over the single performance figures, with just a division of labour for the first few however, reveals no significant difference between iterations. Our significance tests confirm that for any SCIR system implementing a collaborative iterations 2 - 5 all collaborative RF techniques, relevance feedback process for either feedback for both static and dynamic environments, sig- environment (i.e. static or dynamic), and the nificantly outperform the SCIR system with full SCIR system with no combination of relevance division. With the best performing system, the information (SCIR + Full Div). When we relax dynamic combined weighting no contr technique the significance threshold, we find in the static (B) providing a 4.8% improvement over the SCIR environment, that the combined weighting (contr) + Full Div system for these iterations. (B) method and the pseudo user method outper- Next we compare the performance of static form the SCIR + Full Div system at significance versus dynamic feedback environments. From values of p = 0.165 and p = 0.186 respectively. Figure 6 and Table 3.3, it appears that the SCIR While in the dynamic environment, the partial systems operating in a dynamic feedback environ- user (no contr) (A) and combined weighting (B) ment provide a modest increase in performance technique outperform the SCIR + Full Div system over their static counterparts. Our significance at significance values of p = 0.186, and p = 0.169 tests reveal that only the partial-user technique respectively. (A) shows any significant difference between Comparing the performance across collabora- the running of the technique in static versus tive RF techniques from Figure 6 and Table 3.3, it dynamic mode and no significant improvement does appear that the document fusion technique could be found between any dynamic collabora- for both the static and dynamic environments tive relevance feedback technique and the static does not perform as well as the term-based combined weighting technique. techniques. Significance tests reveal that the dy- namic collaborative RF techniques of pseudo user, Discussion partial-user (A), and combined weighting (B) all significantly outperform the dynamic document We explored the effects of a collaborative rel- fusion technique (C). However, no difference could evance feedback technique operating alongside an be found at the significance threshold between expl icit d iv ision of labou r p ol icy i n a sy nch ronou s any static term-based technique and the static collaborative information retrieval system. We document fusion technique. When we relax our hypothesised that by incorporating each user’s rel- significance threshold to a threshold of p < 0.1, evance information into a collaborative relevance we do find that the techniques of pseudo user, feedback mechanism, the ranked lists returned to partial-user, and combined weighting perform the searchers could be improved. better than the static document fusion technique. Firstly, comparing the performance of the Although not strictly significant according to our collaborative relevance feedback techniques, we threshold, these p values, suggest that the results found that the term-based techniques of pseudo- are unlikely due to chance. user, partial user and combined weighting all Comparing the overall performance of the outperform the document fusion technique. contribution versus no contribution techniques However no significant difference could be found for both partial-user (A) and combined weighting at a significance threshold of p < 0.05. When we (B), we find no significant difference.

157 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

Figure 6. Comparison of collaborative relevance feedback techniques and baseline division of labour system across all topics

45

40

35

30

# Relevant Documents Relevant # 25

20

15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 # Relevance Judgments

SCIR + Full Div Static Pseudo User Static Partial User Contr Static Partial User No Contr Static Combined Contr Static Combined No Contr Static Document Fusion Dynamic Pseudo User Dynamic Partial User Contr Dynamic Partial User No Contr Dynamic Combined Contr Dynamic Combined No Contr Dynamic Document Fusion

22

21

20

19

18 # R# Documents elevant

17

16

15 01234567 # R elevance Judgments

158 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

Table 3.3. Single performance figure comparison of collaborative relevance feedback techniques and baseline division of labour system across all topics

159 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

relax the threshold we do find that all term-based do provide modest increases in performance over techniques outperform the document fusion the SCIR system with just a division of labour. technique. These results suggest that for both the Although at the significance threshold of p < 0.05 dynamic and static environments a term-based no significance can be found, improvements could technique can outperform the document based be found at lower significance thresholds. However fusion technique. when we examine the performance of the group Over the entire search, no significant differ- over the first few iterations of feedback, we do ences could be found across term-based tech- find that all the collaborative relevance feedback niques. In particular the collaborative techniques techniques in both the static and dynamic envi- of partial user and combined weighting perform ronments significantly improve the performance similarly to the standard single user (pseudo- over the SCIR + Full Div system. This result is user) method. interesting, and suggests that although users may Our results show small improvements can be benefit from gaining relevance judgments from made over some techniques by implementing an their search partner early in the search, that after intensive, dynamic environment. However due a number of iterations this benefit is reduced. to the slenderness of these differences and the We believe that the collaborative relevance fact that not all static techniques can be signifi- feedback process of aggregating relevance infor- cantly improved upon, it may not be worthwhile mation is causing users’ ranked lists to become too implementing such a policy due to the discussed similar. Although, by implementing the division of difficulties that such an environment presents for labour policy we are ensuring unique documents both the system designer and the user. across the top 30 ranked positions for both users, Our results show that over the entire search, we feel that the aggregation may be causing a the collaborative relevance feedback techniques loss of uniqueness across users’ lists. In order to

Figure 7. Comparisons of the total amount of unique documents across users’ ranked lists for SCIR system with collaborative RF and without, across all topics

1 SCIR + Full Div

0.95 Pseudo U ser

0.9 Partial User Combined 0.85 Document 0.8 Fusion

0.75 Unique Documents Unique

n 0.7

0.65 Proportio 0.6

0.55

0.5 13579111315171921232527293133353739414345 # Re levan ce J u d gm en ts

160 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

investigate this hypothesis, in Figure 7 we plot the FUTURE TRENDS proportion of unique documents across the top 1000 documents of each user’s ranked lists for the Our results have shown that the proposed, term- SCIR system with full division only and all static based, collaborative relevance feedback tech- collaborative relevance feedback techniques. niques perform similarly to the standard single- As we can see, there is a clear difference in user relevance feedback formula. This result is the total number of unique relevant documents not surprising as all techniques are attempting to across users between the collaborative relevance aggregate each user’s relevance information. The feedback systems and the SCIR + Full Div system. advantages of the collaborative techniques over the This difference is significant for all techniques standard single user technique are that they can and across all topics. This result confirms our hy- allow for a user-biased combination (by changing pothesis, that the collaborative relevance feedback the α value associated with each user), but in these process is causing users’ ranked lists to become experiments this α has set to 0.5 to consider all too similar. This finding is intuitive- one of the users equally. In a real-world system, an SCIR great advantages of having multiple users tackle system could exploit this α value in order to al- a search task is that the task can be divided. By low users to bias the relevance feedback process ensuring a complete division of labour, in the in favour of their own relevance judgments or in SCIR + Full Division system, we are allowing favour of their search partner. Alternatively an users to make unique relevance judgments, how- SCIR system could use this α value to enable an ever by implementing a collaborative relevance SCIR system to bias the collaborative relevance feedback process in such an environment, where feedback process in favour of expert searchers the relevance feedback process for a user uses through an authority weighting mechanism. the relevance information of their search partner, Our results have shown how a collaborative we are causing users to loose this uniqueness. relevance feedback mechanism causes a loss of Interestingly however, the gap is less substantial uniqueness across collaborating users’ ranked between the SCIR system with full division and lists. An alternative way of using multiple us- the SCIR system implementing document fusion. ers’ relevance judgments in an SCIR search The fact that the document fusion technique pro- is to implement a Complementary Relevance vides substantially more unique documents than Feedback technique. Whereas the collaborative all the term-based techniques suggests that the relevance feedback techniques discussed in this term-based techniques are causing the selection chapter attempt to make user’s queries more of similar terms for expansion between users. The similar, a complementary technique would make document fusion technique, allowing for a later them more distinct. stage of combination does not suffer from this problem, however as our results in this section have shown this does not lead to this technique CONCLUSION outperforming the others in terms of discover- ing relevant unique documents. This does not, Synchronous collaborative information retrieval of course, mean that the introduction of unique refers to an explicit and active form of collabora- documents degrades performance, but that the tion whereby users are collaborating in realtime collaborative relevance feedback mechanism to satisfy a shared information need. The benefits needs to strive to allow for the introduction of that such a system can provide over users search- more unique relevant documents. ing independently are that it can enable both a division of labour and a sharing of knowledge

161 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

across collaborating searchers. Although there Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Mod- has been some work to date into system-mediated ern Information Retrieval. Addison Wesley. division of labour, there has been no work which Blackwell, A. F., Stringer, M., Toye, E. F., & Rode, has investigated how a system-mediated sharing J. A. (2004). Tangible interface for collaborative of knowledge can be realised in an adhoc search information retrieval. In CHI ’04: extended ab- which can be either remote or co-located, despite stracts on Human factors in computing systems, the fact that SCIR systems in the literature have pages 1473–1476, Vienna, Austria. ACM Press. allowed users to make explicit relevance judg- ments in the form of bookmarks. Cabri, G., Leonardi, L., & Zambonelli, F. (1999). In this chapter, we have outlined how to make Supporting Cooperative WWW Browsing: A use of these bookmarks in order to benefit the Proxy-based Approach. In 7th Euromicro Work- group. We have proposed several techniques by shop on Parallel and Distributed Processing, which the standard relevance feedback formula pages 138–145, University of Maderia, Funchal, can be extended to allow for relevance information Portugal. IEEE Press. from multiple users to be combined in the process. Craswell, N., Zoeter, O., Taylor, M., & Ramsey, We have also outlined a novel evaluation method- B. (2008). An experimental comparison of click ology by which a group of users, who had previ- position-bias models. In Proceedings of WSDM, ously searched for a search topic independently pages 87 – 94, Palo Alto, CA, ACM Press. as part of a TREC interactive track submission, were simulated searching together. Croft, W. B. (2002). Advances in Information Our results have shown that over an entire Retrieval, volume 7 of The Information Re- SCIR search, the passing of relevance informa- trieval Series, chapter Combining Approaches tion between users in an SCIR search does not to Information Retrieval, pages 1–36. Springer, improve the performance of the group. However, Netherlands. over the first few iterations of feedback only, Dalal, M. (2007). Personalized social & real-time the combination of relevance information does collaborative search. In WWW ’07: Proceedings of provide a significant improvement. This is an the 16th international conference on World Wide interesting result and encourages us and others Web, pages 1285–1286, Banff, Alberta, Canada. to pursue further work in this area. ACM Press. Diamadis, E. T., & Polyzos, G. C. (2004). Efficient REFERENCES cooperative searching on the web: system design and evaluation. Int. J. Hum.-Comput. Stud., 61(5), Aalbersberg, I. J. (1992). Incremental relevance 699–724. feedback. In SIGIR ’92: Proceedings of the 15th Dietz, P., & Leigh, D. (2001). DiamondTouch: A annual international ACM SIGIR conference multi-user touch technology. In UIST ’01: Pro- on Research and Development in information ceedings of the 14th annual ACM symposium on retrieval, pages 11–22, Copenhagen, Denmark. User interface software and technology, pages ACM Press. 219–226, Orlando, Florida, USA. ACM Press. Adcock, J., Pickens, J., Cooper, M., Anthony, L., Donath, J. S., & Robertson, N. (1994). The Sociable Chen, F., & Qvarfordt, P. (2007). FXPAL Inter- Web. In Proceedings of the Second International active Search Experiments for TRECVID 2007. WWW Conference, Chicago, IL, USA. In TRECVid2007 - Text REtrieval Conference TRECVID Workshop, Gaithersburg, MD, USA.

162 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

Foley, C., Gurrin, C., Jones, G., Lee, H., Mc Krishnappa, R. (2005). Multi-user search engine Givney, S., O’Connor, N., Sav, S., Smeaton, A. F., (MUSE): Supporting collaborative information & Wilkins., P. (2005). TRECVid 2005 Experi- seeking and retrieval. Master’s thesis, University ments at Dublin City University. In TRECVid 2005 of Missouri-Rolla, University of Missouri-Rolla, - Text REtrieval Conference TRECVID Workshop, Rolla, USA. Gaithersburg, MD. National Institute of Standards Lieberman, H., Van Dyke, N. W., & Vivacqua, and Technology, MD USA. A. S. (1999). Let’s browse: a collaborative web Foley, C., Smeaton, A. F., & Lee., H. (2006). browsing agent. In IUI ’99: Proceedings of the Synchronous collaborative information retrieval 4th international conference on Intelligent user with relevance feedback. In CollaborateCom 2006 interfaces, pages 65–68, Los Angeles, California, - 2nd International Conference on Collabora- USA. ACM Press. tive Computing: Networking, Applications and Maekawa, T., Hara, T., & Nishio, S. (2006). A Worksharing, pages 1-4., Adelaide, Australia, collaborative web browsing system for multiple IEEE Computer Society. mobile users. In PERCOM ’06: Proceedings of Gianoutsos, S., & Grundy, J. (1996). Collaborative the Fourth Annual IEEE International Conference work with the World Wide Web: adding CSCW on Pervasive Computing and Communications support to a Web browser. In Procedingss of Oz- (PERCOM’06), pages 22–35, Washington, DC, CSCW’96, DSTC Technical Workshop Series, USA. IEEE Computer Society. University of Queensland, pages 14-21 University Morris, M. R., & Horvitz, E. (2007). SearchTo- of Queensland, Brisbane, Australia. gether: An interface for collaborative web search. Greenberg, S., & Roseman, M. (1996). GroupWeb: In UIST ’07: Proceedings of the 20th annual A W W W browser as real time g roupware. I n CHI ACM symposium on User interface software and ’96: Conference companion on Human factors in technology, pages 3–12, Newport, Rhode Island, computing systems, pages 271–272, Vancouver, USA. ACM Press. British Columbia, Canada. ACM Press. Morris, M. R., Paepcke, A., & Winograd, T. Gross, T. (1999). Supporting awareness and co- (2006). TeamSearch: Comparing Techniques for operation in digital information environments. In Co-Present Collaborative Search of Digital Media. Basic Research Symposium at the Conference on In TABLETOP ’06: Proceedings of the First IEEE Human Factors in Computing Systems - CHI’99, International Workshop on Horizontal Interactive Pittsburgh, Pennsylvania, USA. ACM Press. Human-Computer Systems, pages 97–104, Wash- ington, DC, USA. IEEE Computer Society. Han, R., Perret, V., & Naghshineh, M. (2000). WebSplitter: a unified XML framework for multi- Netscape Conferencer (2001). http://www.aibn. device collaborative Web browsing. In CSCW com/help/Software/Netscape/communicator/ ’00: Proceedings of the 2000 ACM conference conference/. on Computer supported cooperative work, pages Robertson, S. E. (1990). On term selection for 221–230, Philadelphia, Pennsylvania, USA. ACM query expansion. Journal of Documentation, Press. 46(4):359–364. Kempthorne, O., & Doerfler, T. E. (1969). The Robertson, S. E., & Spärck Jones, K. (1976). behavior of some significance tests under ex- Relevance weighting of search terms. Journal perimental randomization. Biometrika, 56(2):231- of the American Society for Information Science, 248. 27(3), 129–146.

163 Combining Relevance Information in a Synchronous Collaborative Information Retrieval Environment

Robertson, S. E., Walker, S., Hancock-Beaulieu, Spärck Jones, K. (1997). Search term relevance M., Gull, A., & Lau, M. (1992). Okapi at TREC. weighting given little relevance information. In Harman, D., editor, Proceedings of the First Readings in information retrieval pages 329–338, Text REtrieval Conference (TREC), pages 21–30, Morgan Kaufmann Publishers Inc. Gaithersburg, MD. NIST, USA. Stewart, J., Bederson, B. B., & Druin, A. (1999). Roseman, M., & Greenberg, S. (1996). Building Single display groupware: a model for co-pres- real-time groupware with GroupKit, a groupware ent collaboration. In CHI ’99: Proceedings of toolkit. ACM Trans. Comput.-Hum. Interact., the SIGCHI conference on Human factors in 3(1), 66–106. computing systems, pages 286–293, Pittsburgh, Pennsylvania, USA. ACM Press. Shen, C., Vernier, F. D., Forlines, C., & Ringel, M. (2004). DiamondSpin: An extensible toolkit Windows Live Messenger (2007). http://get.live. for around-the-table interaction. In CHI ’04: com/messenger/overview. Proceedings of the SIGCHI conference on Hu- Yao, K.-T., Neches, R., Ko, I.-Y., Eleish, R., & man factors in computing systems, pages 167–174, Abhinkar, S. (1999). Synchronous and Asynchro- Vienna, Austria. ACM Press. nous Collaborative Information Space Analysis Sidler, G., Scott, A., & Wolf, H. (1997). Col- Tools. In ICPP ’99: Proceedings of the 1999 laborative Browsing in the World Wide Web. In International Workshops on Parallel Processing, Proceedings of the 8th Joint European Networking page 74, Washington, DC, USA. IEEE Computer Conference, Edinburgh, Scotland. Springer. Society. Smeaton, A. F., Lee, H., Foley, C., & Mc Zeballos, G. S. (1998). Tools for Efficient Col- Givney., S. (2006). Collaborative Video Search- laborative Web Browsing. In Proceedings of ing on a Tabletop. Multimedia Systems Journal, CSCW’98 workshop on Collaborative and co-op- 12(4):375–391. erative information seeking in digital information environment. Smeaton, A. F., Murphy, N., O’Connor, N., Mar- low, S., Lee, H., Donald, K. M., Browne, P., & Ye., J. (2001). The Físchlár Digital Video System: A Digital Library of Broadcast TV Programmes. In JCDL 2001 - ACM+IEEE Joint Conference on Digital Libraries, pages 312–313, Roanoke, VA, USA.

164