WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

The Anatomy of a Large-Scale Social Search Engine

Damon Horowitz Sepandar D. Kamvar Stanford University [email protected] [email protected]

ABSTRACT The differences how people find information in a library We present Aardvark, a social search engine. With Aard- versus a village suggest some useful principles for designing vark, users ask a question, either by instant message, email, a social search engine. In a library, people use keywords to web input, text message, or voice. Aardvark then routes the search, the knowledge base is created by a small number of question to the person in the user’s extended social network content publishers before the questions are asked, and trust most likely to be able to answer that question. As compared is based on authority. In a village, by contrast, people use to a traditional web search engine, where the challenge lies natural language to ask questions, answers are generated in in finding the right document to satisfy a user’s information real-time by anyone in the community, and trust is based need, the challenge in a social search engine like Aardvark on intimacy. These properties have cascading effects — for lies in finding the right person to satisfy a user’s information example, real-time responses from socially proximal respon- need. Further, while trust in a traditional search engine is ders tend to elicit (and work well for) highly contextualized based on authority, in a social search engine like Aardvark, and subjective queries. For example, the query“Do you have trust is based on intimacy. We describe how these considera- any good babysitter recommendations in Palo Alto for my tions inform the architecture, algorithms, and user interface 6-year-old twins? I’m looking for somebody that won’t let of Aardvark, and how they are reflected in the behavior of them watch TV.” is better answered by a friend than the Aardvark users. library. These differences in information retrieval paradigm require that a social search engine have very different archi- tecture, algorithms, and user interfaces than a search engine Categories and Subject Descriptors based on the library paradigm. H.3.3 [Information Systems]: Information Storage and The fact that the library and the village paradigms of Retrieval knowledge acquisition complement one another nicely in the offline world suggests a broad opportunity on the web for social information retrieval. General Terms Algorithms, Design 1.2 Aardvark In this paper, we present Aardvark, a social search engine based on the village paradigm. We describe in detail the ar- Keywords chitecture, ranking algorithms, and user interfaces in Aard- Social Search vark, and the design considerations that motivated them. We believe this to be useful to the research community for two reasons. First, the argument made in the original 1. INTRODUCTION Anatomy paper [4] still holds true — since most search en- gine development is done in industry rather than academia, 1.1 The Library and the Village the research literature describing end-to-end search engine Traditionally, the basic paradigm in information retrieval architecture is sparse. Second, the shift in paradigm opens has been the library. Indeed, the field of IR has roots in up a number of interesting research questions in informa- the library sciences, and itself came out of the Stan- tion retrieval, for example around expertise classification, ford Digital Library project [18]. While this paradigm has implicit network construction, and conversation design. clearly worked well in several contexts, it ignores another Following the architecture description, we present a statis- age-old model for knowledge acquisition, which we shall call tical analysis of usage patterns in Aardvark. We find that, as “the village paradigm”. In a village, knowledge dissemina- compared to traditional search, Aardvark queries tend to be tion is achieved socially — information is passed from per- long, highly contextualized and subjective — in short, they son to person, and the retrieval task consists of finding the tend to be the types of queries that are not well-serviced by right person, rather than the right document, to answer your traditional search engines. We also find that the vast ma- question. jority of questions get answered promptly and satisfactorily, and that users are surprisingly active, both in asking and Copyright is held by the International World Wide Web Conference Com- mittee (IW3C2). Distribution of these papers is limited to classroom use, answering. and personal use by others. Finally, we present example results from the current Aard- WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA. vark system, and a comparative evaluation experiment. What ACM 978-1-60558-799-8/10/04.

431 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA we find is that Aardvark performs very well on queries that Question Analyzer deal with opinion, advice, experience, or recommendations, classifiers while traditional corpus-based search engines remain a good Gateways

IM T

choice for queries that are factual or navigational. ransport Email Layer Msgs Conversation Manager RSR Twitter Routing iPhone Engine 2. OVERVIEW Web {RS}* SMS 2.1 Main Components The main components of Aardvark are: Index Importers Database 1. Crawler and Indexer. To find and label resources that Web Content Users Facebook contain information — in this case, users, not docu- Graph LinkedIn Topics ments (Sections 3.2 and 3.3). Etc. Content 2. Query Analyzer. To understand the user’s information need (Section 3.4). Figure 1: Schematic of the architecture of Aardvark 3. Ranking Function. To select the best resources to pro- vide the information (Section 3.5). more discussion); and finally, Aardvark observes the user’s 4. UI. To present the information to the user in an acces- behavior on Aardvark, in answering (or electing not to an- sible and interactive form (Section 3.6). swer) questions about particular topics. The set of topics associated with a user is recorded in Most corpus-based search engines have similar key com- the Forward Index, which stores each userId, a scored list of ponents with similar aims [4], but the means of achieving topics, and a series of further scores about a user’s behavior those aims are quite different. (e.g., responsiveness or answer quality). From the Forward Before discussing the anatomy of Aardvark in depth, it is Index, Aardvark constructs an Inverted Index. The Inverted useful to describe what happens behind the scenes when a Index stores each topicId and a scored list of userIds that new user joins Aardvark and when a user asks a question. have expertise in that topic. In addition to topics, the In- verted Index stores scored lists of userIds for features like 2.2 The Initiation of a User answer quality and response time. When a new user first joins Aardvark, the Aardvark sys- Once the Inverted Index and Social Graph for a user are tem performs a number of indexing steps in order to be able created, the user is now active on the system and ready to to direct the appropriate questions to her for answering. ask her first question. Because questions in Aardvark are routed to the user’s extended network, the first step involves indexing friendship 2.3 The Life of a Query and affiliation information. The data structure responsible A user begins by asking a question, most commonly through for this is the Social Graph. Aardvark’s aim is not to build instant message (on the desktop) or text message (on the a social network, but rather to allow people to make use of mobile phone). The question gets sent from the input de- their existing social networks. As such, in the sign-up pro- vice to the Transport Layer, where it is normalized to a cess, a new user has the option of connecting to a social net- Message data structure, and sent to the Conversation Man- work such as Facebook or LinkedIn, importing their contact ager. Once the Conversation Manager determines that the lists from a webmail program, or manually inviting friends message is a question, it sends the question to the Question to join. Additionally, anybody whom the user invites to Analyzer to determine the appropriate topics for the ques- join Aardvark is appended to their Social Graph – and such tion. The Conversation Manager informs the asker which invitations are a major source of new users. Finally, Aard- primary topic was determined for the question, and gives vark users are connected through common “groups” which the asker the opportunity to edit it. It simultaneously issues reflect real-world affiliations they have, such as the schools a Routing Suggestion Request to the Routing Engine. The they have attended and the companies they have worked routing engine plays a role analogous to the ranking function at; these groups can be imported automatically from social in a corpus-based search engine. It accesses the Inverted In- networks, or manually created by users. Aardvark indexes dex and Social Graph for a list of candidate answerers, and this information and stores it in the Social Graph, which is ranks them to reflect a combination of how well it believes a fixed width ISAM index sorted by userId. they can answer the question and how good of a match they Simultaneously, Aardvark indexes the topics about which are for the asker. The Routing Engine returns a ranked list the new user has some level of knowledge or experience. This of Routing Suggestions to the Conversation Manager, which topical expertise can be garnered from several sources: a then contacts the potential answerers — one by one, or a user can indicate topics in which he believes himself to have few at a time, depending upon a Routing Policy — and asks expertise; a user’s friends can indicate which topics they them if they would like to answer the question, until a sat- trust the user’s opinions about; a user can specify an existing isfactory answer is found. The Conversation Manager then structured profile page from which the Topic Parser parses forwards this answer along to the asker, and allows the asker additional topics; a user can specify an account on which and answerer to exchange followup if they choose. they regularly post status updates (e.g., Twitter or Face- book), from which the Topic Extractor extracts topics (from unstructured text) in an ongoing basis (see Section 3.3 for

432 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

3. ANATOMY ments, so the methods for acquiring and expanding a com- prehensive knowledge base are quite different. 3.1 The Model With Aardvark, the more active users there are, the more The core of Aardvark is a statistical model for routing potential answerers there are, and therefore the more com- questions to potential answerers. We use a network variant prehensive the coverage. More importantly, because Aard- of what has been called an aspect model [12], that has two vark looks for answerers primarily within a user’s extended primary features. First, it associates an unobserved class social network, the denser the network, the larger the effec- variable t ∈ T with each observation (i.e., the successful tive knowledge base. This suggests that the strategy for increasing the knowl- answer of question q by user ui). In other words, the proba- bility p(u |q) that user i will successfully answer question q edge base of Aardvark crucially involves creating a good ex- i perience for users so that they remain active and are inclined depends on whether q is about the topics t in which ui has expertise: to invite their friends. An extended discussion of this is out- side of the scope of this paper; we mention it here only to X p(ui|q) = p(ui|t)p(t|q) (1) emphasize the difference in the nature of “crawling” in social t∈T search versus traditional search. Given a set of active users on Aardvark, the effective The second main feature of the model is that it defines breadth of the Aardvark knowledge base depends upon de- a query-independent probability of success for each poten- signing interfaces and algorithms that can collect and learn tial asker/answerer pair (u , u ), based upon their degree of i j an extended topic list for each user over time, as discussed social connectedness and profile similarity. In other words, in the next section. we define a probability p(ui|uj ) that user ui will deliver a satisfying answer to user uj , regardless of the question. 3.3 Indexing People We then define the scoring function s(ui, uj , q) as the com- The central technical challenge in Aardvark is selecting position of the two probabilities. the right user to answer a given question from another user. X In order to do this, the two main things Aardvark needs s(ui, uj , q) = p(ui|uj ) · p(ui|q) = p(ui|uj ) p(ui|t)p(t|q) to learn about each user ui are: (1) the topics t he might t∈T (2) be able to answer questions about psmoothed(t|ui); (2) the Our goal in the ranking problem is: given a question q users uj to whom he is connected p(ui|uj ). Topics. Aardvark computes the distribution p(t|ui) of from user uj , return a ranked list of users ui ∈ U that max- topics known by user ui from the following sources of infor- imizes s(ui, uj , q). Note that the scoring function is composed of a query- mation: dependent relevance score p(ui|q) and a query-independent • Users are prompted to provide at least three topics quality score p(ui|uj ). This bears similarity to the ranking which they believe they have expertise about. functions of traditional corpus-based search engines such as Google [4]. The difference is that unlike quality scores like • Friends of a user (and the person who invited a user) PageRank [18], Aardvark’s quality score aims to measure are encouraged to provide a few topics that they trust intimacy rather than authority. And unlike the relevance the user’s opinion about. scores in corpus-based search engines, Aardvark’s relevance • Aardvark parses out topics from users’ existing online score aims to measure a user’s potential to answer a query, profiles (e.g., Facebook profile pages, if provided). For rather than a document’s existing capability to answer a such pages with a known structure, a simple Topic query. Parsing algorithm uses regular expressions which were Computationally, this scoring function has a number of manually devised for specific fields in the pages, based advantages. It allows real-time routing because it pushes upon their performance on test data. much of the computation offline. The only component prob- ability that needs to be computed at query time is p(t|q). • Aardvark automatically extracts topics from unstruc- Computing p(t|q) is equivalent to assigning topics to a ques- tured text on users’ existing online homepages or blogs tion — in Aardvark we do this by running a probabilistic if provided. For unstructured text, a linear SVM iden- classifier on the question at query time (see Section 3.4). The tifies the general subject area of the text, while an distribution p(ui|t) assigns users to topics, and the distribu- ad-hoc named entity extractor is run to extract more tion p(ui|uj ) defines the Aardvark Social Graph. Both of specific topics, scaled by a variant tf-idf score. these are computed by the Indexer at signup time, and then updated continuously in the background as users answer • Aardvark automatically extracts topics from users’ sta- questions and get feedback (see Section 3.3). The compo- tus message updates (e.g., Twitter messages, Facebook nent multiplications and sorting are also done at query time, news feed items, IM status messages, etc.) and from but these are easily parallelizable, as the index is sharded by the messages they send to other users on Aardvark. user. The motivation for using these latter sources of profile topic information is a simple one: if you want to be able 3.2 Social Crawling to predict what kind of content a user will generate (i.e., A comprehensive knowledge base is important for search p(t|ui)), first examine the content they have generated in engines as query distributions tend to have a long tail [13]. the past. In this spirit, Aardvark uses web content not as a In corpus-based search engines, this is achieved by large- source of existing answers about a topic, but rather, as an scale crawlers and thoughtful crawl policies. In Aardvark, indicator of the topics about which a user is likely able to the knowledge base consists of people rather than docu- give new answers on demand.

433 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

In essence, this involves modeling a user as a content- Connection strengths between people are computed using a generator, with probabilities indicating the likelihood she weighted cosine similarity over this feature set, normalized P will likely respond to questions about given topics. Each so that p(ui|uj ) = 1, and stored in the Social Graph ui∈U topic in a user profile has an associated score, depending for quick access at query time. upon the confidence appropriate to the source of the topic. Both the distributions p(ui|uj ) in the Social Graph and In addition, Aardvark learns over time which topics not to p(t|ui) in the Inverted Index are continuously updated as send a user questions about by keeping track of cases when users interact with one another on Aardvark. the user: (1) explicitly “mutes” a topic; (2) declines to an- swer questions about a topic when given the opportunity; 3.4 Analyzing Questions (3) receives negative feedback on his answer about the topic The purpose of the Question Analyzer is to determine a from another user. scored list of topics p(t|q) for each question q representing Periodically, Aardvark will run a topic strengthening algo- the semantic subject matter of the question. This is the only rithm, the essential idea of which is: if a user has expertise probability distribution in equation 2 that is computed at in a topic and most of his friends also have some exper- query time. tise in that topic, we have more confidence in that user’s It is important to note that in a social search system, the level of expertise than if he were alone in his group with requirement for a Question Analyzer is only to be able to knowledge in that area. Mathematically, for some user ui, understand the query sufficiently for routing it to a likely his group of friends U, and some topic t, if p(t|ui) 6= 0, answerer. This is a considerably simpler task than the chal- P then s(t|ui) = p(t|ui) + γ u∈U p(t|u), where γ is a small lenge facing an ideal web search engine, which must attempt constant. The s values are then renormalized to form prob- to determine exactly what piece of information the user is abilities. seeking (i.e., given that the searcher must translate her infor- Aardvark then runs two smoothing algorithms the pur- mation need into search keywords), and to evaluate whether pose of which are to record the possibility that the user a given web page contains that piece of information. By may be able to answer questions about additional topics not contrast, in a social search system, it is the human answerer explicitly recorded in her profile. The first uses basic col- who has the responsibility for determining the relevance of laborative filtering techniques on topics (i.e., based on users an answer to a question — and that is a function which hu- with similar topics), the second uses semantic similarity1. man intelligence is extremely well-suited to perform! The Once all of these bootstrap, extraction, and smoothing asker can express his information need in natural language, methods are applied, we have a list of topics and scores and the human answerer can simply use her natural un- for a given user. Normalizing these topic scores so that derstanding of the language of the question, of its tone of P t∈T p(t|ui) = 1, we have a probability distribution for voice, sense of urgency, sophistication or formality, and so topics known by user ui. Using Bayes’ Law, we compute forth, to determine what information is suitable to include for each topic and user: in a response. Thus, the role of the Question Analyzer in a social search system is simply to learn enough about the p(t|ui)p(ui) p(ui|t) = , (3) question that it may be sent to appropriately interested and p(t) knowledgeable human answerers. using a uniform distribution for p(ui) and observed topic As a first step, the following classifiers are run on each frequencies for p(t). Aardvark collects these probabilities question: A NonQuestionClassifier determines if the input p(ui|t) indexed by topic into the Inverted Index, which al- is not actually a question (e.g., is it a misdirected message, a lows for easy lookup when a question comes in. sequence of keywords, etc.); if so, the user is asked to submit Connections. Aardvark computes the connectedness be- a new question. An InappropriateQuestionClassifier deter- tween users p(ui|uj ) in a number of ways. While social mines if the input is obscene, commercial spam, or otherwise proximity is very important here, we also take into account inappropriate content for a public question-answering com- similarities in demographics and behavior. The factors con- munity; if so, the user is warned and asked to submit a new sidered here include: question. A TrivialQuestionClassifier determines if the in- put is a simple factual question which can be easily answered • Social connection (common friends and affiliations) by existing common services (e.g., “What time is it now?”, “What is the weather?”, etc.); if so, the user is offered an • Demographic similarity automatically generated answer resulting from traditional • Profile similarity (e.g., common favorite movies) web search. A LocationSensitiveClassifier determines if the input is a question which requires knowledge of a particular • Vocabulary match (e.g., IM shortcuts) location, usually in addition to specific topical knowledge (e.g., “What’s a great sushi restaurant in Austin, TX?”); if • Chattiness match (frequency of follow-up messages) so, the relevant location is determined and passed along to • Verbosity match (the average length of messages) the Routing Engine with the question. Next, the list of topics relevant to a question is produced • Politeness match (e.g., use of “Thanks!”) by merging the output of several distinct TopicMapper algo- rithms, each of which suggests its own scored list of topics: • Speed match (responsiveness to other users) • A KeywordMatchTopicMapper passes any terms in the 1In both the Person Indexing and the Question Analysis components, “semantic similarity” is computed by using an question which are string matches with user profile approximation of distributional similarity computed over topics through a classifier which is trained to deter- Wikipedia and other corpora; this serves as a proxy mea- mine whether a given match is likely to be semantically sure of the topics’ semantic relatedness. significant or misleading.

434 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

• A TaxonomyTopicMapper classifies the question text Availability: Third, the Routing Engine prioritizes can- into a taxonomy of roughly 3000 popular question top- didate answerers in such a way so as to optimize the chances ics, using an SVM trained on an annotated corpus of that the present question will be answered, while also pre- several millions questions. serving the available set of answerers (i.e., the quantity of “answering resource” in the system) as much as possible by • A SalientTermTopicMapper extracts salient phrases from spreading out the answering load across the user base. This the question — using a noun-phrase chunker and a tf- involves factors such as prioritizing users who are currently idf–based measure of importance — and finds seman- online (e.g., via IM presence data, iPhone usage, etc.), who tically similar user topics. are historically active at the present time-of-day, and who • A UserTagTopicMapper takes any user “tags” provided have not been contacted recently with a request to answer by the asker (or by any would-be answerers), and maps a question. these to semantically-similar user topics.2 Given this ordered list of candidate answerers, the Rout- ing Engine then filters out users who should not be con- At present, the output distributions of these classifiers tacted, according to Aardvark’s guidelines for preserving a are combined by weighted linear combination. It would be high-quality user experience. These filters operate largely interesting future work to explore other means of combin- as a set of rules: do not contact users who prefer to not be ing heterogeneous classifiers, such as the maximum entropy contacted at the present time of day; do not contact users model in [16]. who have recently been contacted as many times as their The Aardvark TopicMapper algorithms are continuously contact frequency settings permit; etc. evaluated by manual scoring on random samples of 1000 Since this is all done at query time, and the set of candi- questions. The topics used for selecting candidate answerers, date answerers can potentially be very large, it is useful to as well as a much larger list of possibly relevant topics, are note that this process is parallelizable. Each shard in the assigned scores by two human judges, with a third judge Index computes its own ranking for the users in that shard, adjudicating disagreements. For the current algorithms on and sends the top users to the Routing Engine. This is scal- the current sample of questions, this process yields overall able as the user base grows, since as more users are added, scores of 89% precision and 84% recall of relevant topics. more shards can be added. In other words, 9 out of 10 times, Aardvark will be able The list of candidate answerers who survive this filter- to route a question to someone with relevant topics in her ing process are returned to the Conversation Manager. The profile; and Aardvark will identify 5 out of every 6 possibly Conversation Manager then proceeds with opening channels relevant answerers for each question based upon their topics. to each of them, serially, inquiring whether they would like 3.5 The Aardvark Ranking Algorithm to answer the present question; and iterating until an answer is provided and returned to the asker. Ranking in Aardvark is done by the Routing Engine, which determines an ordered list of users (or “candidate answer- 3.6 User Interface ers”) who should be contacted to answer a question, given Since social search is modeled after the real-world process the asker of the question and the information about the ques- of asking questions to friends, the various user interfaces for tion derived by the Question Analyzer. The core ranking Aardvark are built on top of the existing communication function is described by equation 2; essentially, the Routing channels that people use to ask questions to their friends: Engine can be seen as computing equation 2 for all candidate IM, email, SMS, iPhone, Twitter, and Web-based messag- answerers, sorting, and doing some postprocessing. ing. Experiments were also done using actual voice input The main factors that determine this ranking of users are from phones, but this is not live in the current Aardvark Topic Expertise p(u |q), Connectedness p(u |u ), and Avail- i i j production system. ability: In its simplest form, the user interface for asking a ques- Topic Expertise: First, the Routing Engine finds the tion on Aardvark is any kind of text input mechanism, along subset of users who are semantic matches to the question: with a mechanism for displaying textual messages returned those users whose profile topics indicate expertise relevant to the topics which the question is about. Users whose profile topics are closer matches to the question’s topics are given higher rank. For questions which are location-sensitive (as defined above), only users with matching locations in their profiles are considered. Connectedness: Second, the Routing Engine scores each user according to the degree to which she herself — as a person, independently of her topical expertise — is a good “match” for the asker for this information query. The goal of this scoring is to optimize the degree to which the asker and the answerer feel kinship and trust, arising from their sense of connection and similarity, and meet each other’s expectations for conversational behavior in the interaction. 2A general principle in the design of Aardvark is to use hu- man intelligence wherever possible to improve the quality of the system. For the present task of Question Analysis, this involves giving both askers and answerers prompts for telling Aardvark directly what the subject of a question is. Figure 2: Screenshot of Aardvark over IM

435 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

aardvark: You there? I have a question about **tennis** that I think you might be able to answer. (Type 'sure', 'pass'‚ or 'busy'.) me: sure aardvark: (From Michael B./29/M/NewYork,US) Does anyone know which health clubs in Manhattan have tennis courts? Or are all the good options in the other boroughs? (Type 'pass' to skip, or 'more' for extra options.) me: why aardvark: I thought of you for Michael's question because - Michael is a friend of your friend Sally Johnson. - You know a lot about **racquet sports** and **fitness** - You and he have similar interests in **sports** (Type 'more' or change settings at http://vark.com/a/XXXXX)

me: Well there is always the Midtown Tennis Club on 8th ave @27th if you really want to stay in manhattan -- but the quality isn't great. You'd do just as well to use the public courts in Central Park. Or another good option is to join NYHRC or NYSC in manhattan, and use their courts in other boroughs... aardvark: Great -- I've sent that to Michael. Thanks for the fast answer! (Type 'Michael:' followed by a message to add something, or 'more' for options.)

Figure 4: Screenshot of Aardvark Answering Tab on Figure 3: Example of Aardvark interacting with an iPhone answerer

has contact information for. Because this kind of “reaching from Aardvark. (This kind of very lightweight interface is out” to users has the potential to become an unwelcome in- important for making the search service available anywhere, terruption if it happens too frequently, Aardvark sends such especially now that mobile device usage is ubiquitous across requests for answers usually less than once a day to a given most of the globe.) user (and users can easily change their contact settings, spec- However, Aardvark is most powerful when used through ifying prefered frequency and time-of-day for such requests). a chat-like interface that enables ongoing conversational in- Further, users can ask Aardvark “why” they were selected teraction. A private 1-to-1 conversation creates an intimacy for a particular question, and be given the option to easily which encourages both honesty and freedom within the con- change their profile if they do not want such questions in the straints of real-world social norms. (By contrast, answering future. This is very much like the real-world model of social forums where there is a public audience can both inhibit po- information sharing: the person asking a question, or the tential answerers [17] or motivate public performance rather intermediary in Aardvark’s role, is careful not to impose too than authentic answering behavior [22].) Further, in a real- much upon a possible answerer. The ability to reach out to time conversation, it is possible for an answerer to request an extended network beyond a user’s immediate friendships, clarifying information from the asker about her question, or without imposing too frequently on that network, provides a for the asker to follow-up with further reactions or inquiries key differentiating experience from simply posting questions to the answerer. to one’s Twitter or Facebook status message. There are two main interaction flows available in Aard- A secondary flow of answering questions is more similar vark for answering a question. The primary flow involves to traditional bulletin-board style interactions: a user sends Aardvark sending a user a message (over IM, email, etc.), a message to Aardvark (e.g., “try”) or visits the “Answering” asking if she would like to answer a question: for example, tab of Aardvark website or iPhone application (Figure 4), “You there? A friend from the Stanford group has a ques- and Aardvark shows the user a recent question from her net- tion about *search engine optimization* that I think you work which has not yet been answered and is related to her might be able to answer.”. If the user responds affirma- profile topics. This mode involves the user initiating the ex- tively, Aardvark relays the question as well as the name of change when she is in the mood to try to answer a question; the questioner. The user may then simply type an answer as such, it has the benefit of an eager potential answerer – the question, type in a friend’s name or email address to but as the only mode of answering it does not effectively refer it to someone else who might answer, or simply “pass” 3 tap into the full diversity of the user base (since most users on this request. do not initiate these episodes). This is an important point: A key benefit of this interaction model is that the avail- while almost everyone is happy to answer questions (see Sec- able set of potential answerers is not just whatever users tion 5) to help their friends or people they are connected to, happen to be visiting a bulletin board at the time a question not everyone goes out of their way to do so. This willingness is posted, but rather, the entire set of users that Aardvark to be helpful persists because when users do answer ques- tions, they report that it is a very gratifying experience: 3There is no shame in “passing” on a question, since nobody else knows that the question was sent to you. Similarly, they have been selected by Aardvark because of their exper- there is no social cost in asking a question, since you are not tise, they were able to help someone who had a need in the directly imposing on a friend; rather, Aardvark plays the moment, and they are frequently thanked for their help by role of the intermediary who bears this social cost. the asker.

436 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

"What fun bars downtown have outdoor EXAMPLE 1 EXAMPLE 2 seating?" "I've started running at least 4 days each (Question from Mark C./M/LosAltos,CA) (Question from James R./M/ "I'm just getting into photography. Any week, but I'm starting to get some knee and I am looking for a restaurant in San TwinPeaksWest,SF) suggestions for a digital camera that would ankle pain. Any ideas about how to address Francisco that is open for lunch. Must be What is the best new restaurant in San be easy enough for me to use as a this short of running less?" very high-end and fancy (this is for a small, Francisco for a Monday business dinner? beginner, but I'll want to keep using for a formal, post-wedding gathering of about 8 Fish & Farm? Gitane? Quince (a little older)? while?" "I need a good prank to play on my people). (+7 minutes -- Answer from Paul D./M/ supervisor. She has a good sense of (+4 minutes -- Answer from Nick T./28/M/ "I'm going to Berlin for two weeks and SanFrancisco,CA -- A friend of your friend humor, but is overly professional. Any SanFrancisco,CA -- a friend of your friend would like to take some day trips to places Sebastian V.) ideas?" Fritz Schwartz) that aren't too touristy. Where should I go?" For business dinner I enjoyed Kokkari fringale (fringalesf.com) in soma is a good Estiatorio at 200 Jackson. If you prefer a bet; small, fancy, french (the french actually "My friend's in town and wants to see live "I just moved and have the perfect spot for place in SOMA i recommend Ozumo (a great a plant in my living room. It gets a lot of hang out there too). Lunch: Tuesday - sushi restaurant). music. We both love bands like the Friday: 11:30am - 2:30pm Counting Crows. Any recommendations for light from the north and south, but I know I (Reply from James to Paul) (Reply from Mark to Nick) shows (of any size) to check out?" won't be too reliable with watering. Any thx I like them both a lot but I am ready to try suggestions for plants that won't die?" Thanks Nick, you are the best PM ever! something new "Is there any way to recover an unsaved (Reply from Nick to Mark) (+1 hour -- Answer from Fred M./29/M/ Excel file that was closed manually on a "Should I wear brown or black shoes with a you're very welcome. hope the days they're Marina,SF) Mac?" light brown suit?" open for lunch work... Quince is a little fancy... La Mar is pretty "I'm putting together a focus group to talk fantastic for cevice - like the Slanted Door of about my brand new website. Any tips on "I need to analyze a Spanish poem for EXAMPLE 3 peruvian food... making it as effective as possible?" class. What are some interesting Spanish poems that aren't too difficult to translate?" (Question from Brian T./22/M/Castro,SF) What is a good place to take a spunky, off-the-cuff, social, and pretty girl for a nontraditional, fun, memorable dinner date in San Francisco? "I'm making cookies but ran out of baking powder. Is there anything I can substitute?" "I always drive by men selling strawberries (+4 minutes -- Answer from Dan G./M/SanFrancisco,CA) on Stanford Ave. How much do they charge Start with drinks at NocNoc (cheap, beer/wine only) and then dinner at RNM (expensive, "I have a job interview over lunch tomorrow. per flat?" across the street). Is there any interview restaurant etiquette (Reply from Brian to Dan) Thanks! "My girlfriend's ex bought her lots of that I should know?" (+6 minutes -- Answer from Anthony D./M/Sunnyvale,CA -- you are both in the Google group) expensive presents on anniversaries. I'm "I want to give my friend something that Take her to the ROTL production of Tommy, in the Mission. Best show i've seen all year! pretty broke, but want to show her that I lasts as a graduation present, but someone care. Any ideas for things I could do that (Reply from Brian to Anthony) Tommy as in the Who's rock opera? COOL! already gave her jewelry. What else could I are not too cliche?" give her?" (+10 minutes -- Answer from Bob F./M/Mission,SF -- you are connected through Mathias' friend Samantha S.) Cool question. Spork is usually my top choice for a first date, because in addition to having great food and good really friendly service, it has an atmosphere that's perfectly in between casual and romantic. It's a quirky place, interesting funny menu, but not exactly non- Figure 5: A random set of queries from Aardvark traditional in the sense that you're not eating while suspended from the ceiling or anything

Figure 6: Three complete Aardvark interactions In order to play the role of intermediary in an ongoing conversation, Aardvark must have some basic conversational intelligence in order to understand where to direct messages identities rather than pseudonyms, facilitate interactions be- from a user: is a given message a new question, a continu- tween existing real-world relationships, and consistently pro- ation of a previous question, an answer to an earlier ques- vide examples of how to behave, users in an online commu- tion, or a command to Aardvark? The details of how the nity will behave in a manner that is far more authentic and Conversation Manager manages these complications and dis- helpful than pseudonymous multicasting environments with ambiguates user messages are not essential so they are not no moderators. The design of the Aardvark’s UI has been elaborated here; but the basic approach is to use a state carefully crafted around these principles. machine to model the discourse context. In all of the interfaces, wrappers around the messages from another user include information about the user that can 4. EXAMPLES facilitate trust: the user’s Real Name nametag, with their In this section we take a qualitative look at user behavior name, age, gender, and location; the social connection be- on Aardvark. Figure 5 shows a random sample of questions tween you and the user (e.g., “Your friend on Facebook”, asked on Aardvark in this period. Figure 6 takes a closer “A friend of your friend Marshall Smith”, “You are both in look at three questions sent to Aardvark during this period, the Stanford group”, etc.); a selection of topics the user has all three of which were categorized by the Question Analyzer expertise in; and summary statistics of the user’s activity under the primary topic “restaurants in San Francisco”.4 on Aardvark (e.g., number of questions recently asked or In Example 1, Aardvark opened 3 channels with candidate answered). answerers, which yielded 1 answer. An interesting (and not Finally, it is important throughout all of the above in- uncommon) aspect of this example is that the asker and the teractions that Aardvark maintains a tone of voice which answerer in fact were already acquaintances, though only is friendly, polite, and appreciative. A social search engine listed as“friends-of-friends”in their online social graphs; and depends upon the goodwill and interest of its users, so it is they had a quick back-and-forth chat through Aardvark. important to demonstrate the kind of (linguistic) behavior In Example 2, Aardvark opened 4 channels with candi- that can encourage these sentiments, in order to set a good date answerers, which yielded two answers. For this ques- example for users to adopt. Indeed, in user interviews, users tion, the “referral” feature was useful: one of the candidate often express their desire to have examples of how to speak answerers was unable to answer, but referred the question or behave socially when using Aardvark; since it is a novel along to a friend to answer; this enables Aardvark to tap paradigm, users do not immediately realize that they can into the knowledge of not just its current user base, but also behave in the same ways they would in a comparable real- the knowledge of everyone that the current users know. The world situation of asking for help and offering assistance. asker wanted more choices after the first answer, and resub- All of the language that Aardvark uses is intended both to mitted the question to get another recommendation, which be a communication mechanism between Aardvark and the came from a user whose profile topics related to business user and an example of how to interact with Aardvark. and finance (in addition to dining). Overall, a large body of research [8, 2, 7, 21] shows that when you provide a 1-1 communication channel, use real 4Names/affiliations have been changed to protect privacy.

437 oieuesaepriual active particularly are users Mobile adaki cieyused actively this is of Aardvark month last the All from taken answers. (9/20/2009-10/20/2009). are 386,702 period asked below given having statistics and October 90,361, the questions to to of 225,047 2009 grew of users 1, total of March a number From the 2009. 2009, 20, of March in release Aardvark. of performance and ANALYSIS 5. out. the helping took for asker answerer the the (as questions), thank examples Aardvark to these of time of most majority in It the that in note constraints. to these and interesting recognize — also to is meeting able business are evening answerers family Monday human spon- formal a and small and spunky post-wedding a gathering, a with woman, date young a taneous for to appropriate hypercustomized recommen- are restaurant are dations different Very that answers need. information get their to askers allows “restaurants” it both to related user profile a his from “dating”. came in The and detailed, topics the most has and the who is coworker; friend-of-friend-of-friend. which a a answer, the from from third to came came connection answer answer social third came second distant answer the a first only asker; The with answers. someone 3 from yielding answerers, date WWW 2010•FullPaper adakwsfis aeaalbesm-ulcyi beta a in semi-publicly available made first was usage Aardvark current the of picture a give statistics following The that is Aardvark of features interesting most the of One candi- with channels 10 opened Aardvark 3, Example In rsn ntolvl.Frt oieueso Aardvark of users mobile sur- First, is which levels. month, two per on sessions prising 3.6322 of average an hw h ubro sr e ot rmerytest- early from 2009. month October per through 7 users ing me- Figure of the month. number per and the queries shows period, 3.1 issued this user in active dian day per questions 3,167.2 other tagged or base) peoplesˆa referred user a either the answered (i.e., of engaged or gener- (73.8% passively users asked base) 66,658 (i.e., user while Aardvark this the question), on In of content (55.9% 2009. ated users March 50,526 since users period, or- 2,272 growing Aardvark, from on ganically accounts created have users

Users iue7 adakue growth user Aardvark 7: Figure A ˘ usin) h vrg ur ouewas volume query average The questions). Z ´ so coe,20,90,361 2009, October, of As oieueshad users Mobile 438 usin fe aeasbetv element subjective a have often Questions contextualized highly are Questions answering and questions of Distribution times. 9: Figure iue8 aeoiso usin ett Aardvark to sent questions of Categories 8: Figure technology &programming

Questions Answered music, movies, TV, books websites &internetapps 0.5 1.5 2.5 product reviews&help 0 1 2 x 10 faygetdlsi atmr,M? r“htare “What or MD?” Baltimore, know in you delis “Do sub- example, great a (for any have them queries of of to 64.7% element that jective shows Oc- 2009 and of March between tober questions random 1000 of tally unique). are answers of ques- 98.2% of (and 98.1% of unique Aardvark 63% are in and tions 20], [19, 57 unique between are search, of queries Web diversity in greater While a in queries. results as context of words, context. addition much The ques- other as Aardvark In times search, 3–4 have web words query. tions content traditional the are to to words compared context these func- give of of that usage 45.3% increased the words, this to of tion due some is While length query (median=13). average increased words the Aardvark, 18.6 with is 2.2 19], length between [14, is words length 2.9 query – average the where search, will search. voice-based users with mobile active Aardvark similarly exper- that be early suggest (and also considerations iments) These in natural context. feels model that query sin- with Aardvark’s query. language so a natural your and using get phones, to to used to exactly are useful crafted people Second, more that’s is browsing answer it’s phone short First, phone, a gle a on On reasons. results two search unwieldy. web for traditional is months. through 6 this for available believe surprising been quite We only is has that This service mo- [14]). a 5.68 for month average as per on terms have sessions (who absolute users bile Google in mobile of active Second, users as mobile almost [14].) times are 3 users Aardvark almost mobile of are as users active com- desktop as of point Google, a on (As parison, users. desktop than active more are 4 0−3 min local services 3−6 min 6−12 min April 26-30•RaleighNCUSA 12−30 min 30min−1hr scmae oweb to compared As Aardvark restaurants &bars finance &investing home &cooking miscellaneous travel sports &recreation business research restaurants &bars 1−4 hr manual A 4+ hr WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

4 x 10 100 3 80 2.5 60 % 2 40

1.5 20 0

Questions 1 >2 >4 >8 >16 >32 Number of Topics 0.5 Figure 11: Distribution of percentage of users and 0 number of topics 0 1 2 3 4 5+ Answers Received asked to look at the question, and 38.0% have been Figure 10: Distribution of questions and number of able to answer. Additionally, 15,301 users (16.9% of all answers received. users) have contacted Aardvark of their own initiative to try answering a question (see Section 3.6 for an ex- the things/crafts/toys your children have made that planation of these two modes of answering questions). made them really proud of themselves?”). In par- Altogether, 45,160 users (50.0% of the total user base) ticular, advice or recommendations queries regarding have answered a question; this is 75% of all users who travel, restaurants, and products are very popular. A interacted with Aardvark at all in the period (66,658 large number of queries are locally oriented. About users). As a comparison, only 27% of Yahoo! An- 10% of questions related to local services, and 13% swers users have ever answered a question [11]. While a dealt with restaurants and bars. Figure 8 shows the smaller portion of the Aardvark user base is much more top categories of questions sent to Aardvark. The dis- active in answering questions (approximately 20% of tribution is not dissimilar to that found with tradi- the user base is responsible for 85% of the total num- tional web search engines [3], but with a much smaller ber of answers delivered to date) the distribution of proportion of reference, factual, and navigational queries, answers across the user base is far broader than on a and a much greater proportion of experience-oriented, typical user-generated-content site [11]. recommendation, local, and advice queries. Social Proximity Matters Of questions that were routed Questions get answered quickly 87.7% of questions sub- to somebody in the asker’s social network (most com- mitted to Aardvark received at least 1 answer, and monly a friend of a friend), 76% of the inline feedback 57.2% received their first answer in less than 10 min- rated the answer as ‘good’, whereas for those answers utes. On average, a question received 2.08 answers that came from outside the asker’s social network, 68% (Figure 10),5 and the median answering time was 6 of them were rated as ‘good’. minutes and 37 seconds (Figure 9). By contrast, on public question and answer forums such as Yahoo! An- People are indexable 97.7% of the user base has at least swers [11] most questions are not answered within the 3 topics in their profiles, and the median user has 9 first 10 minutes, and for questions asked on Facebook, topics in her profile. In sum, Aardvark users added only 15.7% of questions are answered within 15 min- 1,199,323 topics to their profiles; not counting over- utes [17]. (Of course, corpus-based search engines such lapping topics, this yields a total of 174,605 distinct as Google return results in milliseconds, but many of topics which the current Aardvark user base has ex- the types of questions that are asked from Aardvark pertise in. The currently most popular topics in user require extensive browsing and query refinement when profiles in Aardvark are “music”, “movies”, “technol- asked on corpus-based search engines.) ogy”, and “cooking”, but in general most topics are as specific as “logo design” and “San Francisco pickup Answers are high quality Aardvark answers are both com- soccer”. prehensive and concise. The median answer length was 22.2 words; 22.9% of answers were over 50 words 6. EVALUATION (the length of a paragraph); and 9.1% of answers in- cluded hypertext links in them. 70.4% of inline feed- To evaluate social search compared to web search, we ran back which askers provided on the answers they re- a side-by-side experiment with Google on a random sample ceived rated the answers as ‘good’, 14.1% rated the of Aardvark queries. We inserted a “Tip” into a random answers as ‘OK’, and 15.5% rated the answers as ‘bad’. sample of active questions on Aardvark that read: ”Do you want to help Aardvark run an experiment?” with a link to There are a broad range of answerers 78,343 users (86.7% an instruction page that asked the user to reformulate their of users) have been contacted by Aardvark with a re- question as a keyword query and search on Google. We quest to answer a question, and of those, 70% have asked the users to time how long it took to find a satisfac- 5A question may receive more than one answer when the tory answer on both Aardvark and Google, and to rate the Routing Policy allows Aardvark to contact more than one answers from both on a 1-5 scale. If it took longer than 10 candidate answerer in parallel, or when the asker resubmits minutes to find a satisfactory answer, we instructed the user their question to request a second opinion. to give up. Of the 200 responders in the experiment set, we

439 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA found that 71.5% of the queries were answered successfully 8. ACKNOWLEDGMENTS on Aardvark, with a mean rating of 3.93 (σ = 1.23), while We’d like to thank Max Ventilla and the entire Aardvark 70.5% of the queries were answered successfully on Google, team for their contributions to the work reported here. We’d with a mean rating of 3.07 (σ = 1.46). The median time-to- especially like to thank Rob Spiro for his assistance on this satisfactory-response for Aardvark was 5 minutes (of passive paper, and Elijah Sansom for preparation of the figures. And waiting), while the median time-to-satisfactory-response for we are grateful to the authors of the original “Anatomy” Google was 2 minutes (of active searching). paper [4] from which we derived the title and the inspiration Of course, since this evaluation involves reviewing ques- for this paper. tions which users actually sent to Aardvark, we should ex- pect that Aardvark would perform well — after all, users 9. REFERENCES chose these particular questions to send to Aardvark be- [1] A. Banerjee and S. Basu. A Social Query Model for cause of their belief that it would be helpful in these cases.6 Decentralized Search. In SNAKDD, 2008. Thus we cannot conclude from this evaluation that social [2] H. Bechar-Israeli. From to : search will be equally successful for all kinds of questions. Nicknames, Play, and Identity on Internet Relay Chat. Further, we would assume that if the experiment were re- Journal of Computer-Mediated Communication, 1995. versed, and we used as our test set a random sample from [3] S. M. Bietzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically Google’s query stream, the results of the experiment would categorized web query log. In SIGIR, 2004. be quite different. Indeed, for questions such as “What is [4] S. Brin and L. Page. The anatomy of a large-scale the train schedule from Middletown, NJ?”, traditional web hypertextual Web search engine. In WWW, 1998. search is a preferable option. [5] T. Condie, S. D. Kamvar, and H. Garcia-Molina. Adaptive However, the questions asked of Aardvark do represent peer-to-peer topologies. In P2P Computing, 2004. a large and important class of information need: they are [6] J. Davitz, J. Yu, S. Basu, D. Gutelius, and A. Harris. iLink: typical of the kind of subjective questions for which it is Search and Routing in Social Networks. In KDD, 2007. difficult for traditional web search engines to provide satis- [7] A. R. Dennis and S. T. Kinney. Testing media richness fying results. The questions include background details and theory in the new media: The effects of cues, feedback, and task equivocality. Information Systems Research, 1998. elements of context that specify exactly what the asker is [8] J. Donath. Identity and deception in the virtual looking for, and it is not obvious how to translate these in- community. Communities in Cyberspace, 1998. formation needs into keyword searches. Further, there are [9] B. M. Evans and E. H. Chi. Towards a Model of not always existing web pages that contain exactly the con- Understanding Social Search. In CSCW, 2008. tent that is being sought; and in any event, it is difficult [10] D. Faye, G. Nachouki, and P. Valduriez. Semantic Query for the asker to assess whether any content that is returned Routing in SenPeer, a P2P Data Management System. In is trustworthy or right for them. In these cases, askers are NBiS, 2007. looking for personal opinions, recommendations, or advice, [11] Z. Gyongyi, G. Koutrika, J. Pedersen, and from someone they feel a connection with and trust. The H. Garcia-Molina. Questioning Yahoo! Answers. In WWW Workshop on Question Answering on the Web, 2008. desire to have a fellow human being understand what you [12] T. Hofmann. Probabilistic latent semantic indexing. In are looking for and respond in a personalized manner in real SIGIR, 1999. time is one of the main reasons why social search is an ap- [13] B. J. Jansen, A. Spink, and T. Sarcevic. Real life, real users, pealing mechanism for information retrieval. and real needs: a study and analysis of user queries on the web. Information Processing and Management, 2000. [14] M. Kamvar, M. Kellar, R. Patel, and Y. Xu. Computers 7. RELATED WORK and iPhones and Mobile Phones, Oh My!: a Logs-based Comparison of Search Users on Different Devices. In There is an extensive literature on query routing algo- WWW, 2009. rithms, particularly in P2P Networks. In [5], queries are [15] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina. The routed via a relationship-based overlay network. In [15], an- EigenTrust Algorithm for Reputation Management in P2P swerers of a multicast query are ranked via a decentralized Networks. In WWW, 2003. authority score. In [6], queries are routed through a supern- [16] D. Klein, K. Toutanova, H. T. Ilhan, S. D. Kamvar, and ode that routes to answerers based on authority, responsive- C. D. Manning. Combining heterogeneous classifiers for ness, and expertise, and in [10], supernodes maintain exper- word-sense disambiguation. In SENSEVAL, 2002. tise for routing. Banerjee and Basu [1] introduce a [17] M. R. Morris, J. Teevan, and K. Panovich. What do people ask their social networks, and why? A Survey study of routing model for decentralized search that has PageRank status message Q&A behavior. In CHI, 2010. as a special case. Aspect models have been used to match [18] L. Page, S. Brin, R. Motwani, and T. Winograd. The queries to documents based on topic similarity in [12], and PageRank citation ranking: Bringing order to the Web. queries to users in P2P and social networks based on ex- Stanford University Technical Report, 1998. pertise in [6]. Evans and Chi [9] describe a social model of [19] C. Silverstein, M. Henziger, H. Marais, and M. Moricz. user activities before, during, and after search, and Morris Analysis of a very large Web search engine query log. In et al. [17] present an analysis of questions asked on social SIGIR Forum, 1999. networks that mirrors some of our findings on Aardvark. [20] A. Spink, B. J. Jansen, D. Wolfram, and T. Saracevic. From e-sex to e-commerce: Web search changes. IEEE Computer, 2002. 6 In many cases, users noted that they sent their question to [21] L. Sproull and S. Kiesler. Computers, networks, and work. Aardvark because a previous was difficult to In Global Networks:Computers and International formulate or did not give a satisfactory result. For example: Communication. MIT Press, 1993. “Which golf courses in the San Francisco Bay Area have [22] M. Wesch. An anthropological introduction to YouTube. the best drainage / are the most playable during the winter Library of Congress, 2008. (especially with all of the rain we’ve been getting)?”

440 109

Chapter VI Social Network Models for Enhancing Reference-Based Search Engine Rankings

Nikolaos Korfiatis Copenhagen Business School, Denmark

Miguel-Ángel Sicilia University of Alcala, Spain

Claudia Hess University of Bamberg, Germany

Klaus Stein University of Bamberg, Germany

Christoph Schlieder University of Bamberg, Germany

ABSTRACT

In this chapter we discuss the integration of information retrieval information from two sources—a social network and a document reference network—for enhancing reference-based search engine rankings. In particular, current models of information retrieval are blind to the social context that surrounds infor - mation resources, thus they do not consider the trustworthiness of their authors when they present the query results to the users. Following this point we elaborate on the basic intuitions that highlight the contribution of the social context—as can be mined from social network positions for instance—into the improvement of the rankings provided in reference-based search engines. A review on ranking models in Web search engine retrieval along with social network metrics of importance such as prestige and centrality are provided as background. Then a presentation of recent research models that utilize both contexts is provided, along with a case study in the Internet-based encyclopedia Wikipedia, based on the social network metrics.

Copyright © 2008, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited. Social Network Models for Enhancing Reference-Based Search Engine Rankings

INTRODUCTION That is to say that 1 (i.e., incoming links) are considered as a positive evaluation of Since the introduction of information technol- the respective Web site. The PageRank therefore ogy, information retrieval (IR) has been an does not distinguish whether the user setting the important branch of computer and information link agrees with the content of the other Web science, mainly due to the ability to reduce the page or whether she or he disagrees. This fact time required by a user to gather contextualized underlines that current reference-based ranking information and knowledge (Baeza-Yates & algorithms often do not take into consideration Ribeiro-Neto, 1999). With the introduction of that an information resource is a result of cogni- hypertext (Conklin, 1987), information retrieval tive and social processes. In addition to its sur- methods and technologies have been able to in- rounding hyperlinks, a social context (Brown & crease their accuracy because of the high amount Duguid, 2002) underlies the referencing of those of meta-information available for the IR system resources. to exploit. That is not only information about the This suggests that a critical point in improving documents per se, but information about their link-based metrics would be that of augmenting context and popularity. However the development or weighting the pure or reference model of the World Wide Web has introduced another with social information, provided that linking dimension to the IR domain by exposing the is in many cases influenced by social ties, and social aspect of information (Brown & Duguid, trustworthiness critically depends on the social 2002) produced and consumed by humans in this relevance or consideration of the authors of the information space. pages. Recently, several independent research- Current ranking methods in information ers have provided different models for this kind retrieval which are used in Web search en- of social network analysis as applied to ranking gines as well exploit the references between or assessing the quality of Web pages (Sicilia & information resources such as the hypertextual Garcia, 2005; Hess, Stein, & Schlieder, 2006; (hyperlinked) context of Web pages in order to Stein & Hess, 2006) or activity spaces such as determine the rank of a search result (Dhyani, Usenet and wikis (Korfiatis & Naeve, 2005). Other Keong, & Bhowmick, 2002; Faloutsos, 1985). approaches such as those presented by Borner, The well-known PageRank algorithm (Page, Maru, and Goldstone (2004) have also dealt with Brin, Motwani, & Winograd, 1998; Brin & integrating different networks by analyzing the Page, 1998) has proved to be a very effective simultaneous growth of coauthor and citation paradigm for ranking the results of Web search networks in time. algorithms. In the original PageRank algorithm, The main objective of this chapter is to provide a single PageRank vector is computed, using the a survey for recently proposed measures on a docu- link structure of the Web, to capture the relative ment reference network that integrate information “importance” of Web pages, independent of any from a second source: a social network. Therefore, particular search query. Nonetheless, the as- we provide a bridge to the areas of social network sumptions of the original PageRank are biased analysis (SNA) and ranking methods applied in towards measuring external characteristics. In information retrieval. fact, Page et al. (1998) conclude their article with To this end this chapter is organized as follows: the sentence: “The intuition behind PageRank is The next section describes the basic intuition that it uses information which is external to the behind the concepts and the definitions provided Web pages themselves—their backlinks, which by this chapter. We then provide an insight to the provide a kind of peer review.” classic models of citation and link-based ranking

110 Social Network Models for Enhancing Reference-Based Search Engine Rankings

methods such as PageRank and HITS. Next we of the Web document retrieval systems such as offer insight to sociometric models, which can search engines is under the negative effects of be used to evaluate users, as well as a reference issues such as the ambiguity inherent to natural to the FOAF vocabulary, which is used to define language processing. Further, backlink models social connections on the Web. An overview of are blind to the authors of the pages, which entails two hybrid ranking models that integrate in an that every link is equally important for the final equal way both the social and hypertextual context ranking. This overlooks the fact that some links follows, and we then provide a case study of mod- can be more relevant than others, depending on eling authoritativeness on data obtained from the their authors or other parameters such as affilia - English-language version of Wikipedia by using tion with organizations, and even in some cases, some of the metrics discussed earlier in the chapter. the links may have been created with a malicious Finally, we present our conclusions and provide purpose, such as spamming or the popular case an outlook of future research directions that can of “” (Mathes, 2005). be exploited towards the combined consideration Probably the best example where the above of hypertextual and social context in a Web IR phenomena are manifested is the field of scien - system such as a Web search engine. tometrics and in particular the scholar evaluation problem. Earlier from the introduction of the im- pact factor by Garfield (1972), several researchers BACKGROUND AND MOTIVATION considered the development of metrics that can assist the evaluation of a scholar based on the The original intuition behind the design of the reputation of his scientific works usually denoted World Wide Web and the Hypertext Markup as references to his or her work (Leydesdorff, (HTML) language is that authors can publish 2001). Web documents that can provide pointers to other In particular the scholar evaluation problem documents available online. This simplified aspect can be formalized as follows: of a hypertext model adapted by Tim Berners-Lee in the original design of the World Wide Web Considering a scholar S with a collection of (Berners-Lee & Fischetti, 1999) has given to the scientific output represented by a set of scholarly

Web the morphology of an open publication system productions/publications as A = {a1, a2, ..., an}, which can evolve simultaneously by providing then his/her research impact B is the aggregation references and pointers to existing documents if the impact of the individual scholarly works as: available. However, as in the original publication B = Ω {P(a1), P(a2), ..., P(an)} where, Ω is an contexts where an author gains credibility due aggregation operation (e.g., averaging) over the to the popularity of her or his productions/affili - popularity P (ai) of his or her research produc - ations, the credibility/appropriateness of a page tion a i. To the above formalization a set of open on the Web is to some extent correlated with its issues exist: popularity. What the first search engines on the Web failed to capture was exactly this kind of • Regarding the variability of his/her research popularity which attributes the appropriateness work, how can we aggregate the impact to the query results of the search engine. of his production to a representative and Although advances in query processing have generally accepted number? made the retrieval of the query results computa- • How do we measure the popularity of the tionally efficient (Wolf, Squillante, Yu, Sethura - production? For example, do we count the man, & Ozsen, 2002), the precision and the recall same the reference provided to a research

111 Social Network Models for Enhancing Reference-Based Search Engine Rankings

work by a technical report and by a gener - Formal Representations ally accepted “prestigious” journal? Before continuing with the models representing Scientometrics provides insight to open issues credibility in both social and hypertextual con- in the evaluation of the importance of documents texts, we should first have a look at the formaliza - that represent the scientific production, whereas tions used in order to be able to comprehend the in our case we look into the evaluation of the im- metrics and the models using them. portance of Web pages. We follow the assumption A social/hypertextual network is formalized as that those Web pages/documents are productions a graph G = V E),(: which is an ordered pair of of social entities which transpose a degree of two sets. A set of vertices V = V1 ,( V2 ,V3,.....,Vn ) credibility from their social context. which represents the social entities/pages, and a Based on this assumption we develop, in the set of edges E = ( E11 , E12 , E21 ,....., Eij ) where Eij section that follows, a basic framework that under- represents the adjacent connection between the lines the core of the credibility models presented nodes i and j. In abstract form the network struc- throughout the rest of this chapter both for the ture can be represented as a symmetric matrix social and hypertextual context. As can be seen (adjacency matrix) in which the nodes are listed in Figure 1, our framework considers a kind of in both axes and a Boolean value is assigned to affiliation network (see definition bellow) where Eij , which depicts the existence of a relational hypertext documents (Web pages) are connected tie between the entities, represented in Eij . to their authors. Depending on the type of network, the adjacent This gives a two-layer network: the document connection may be: reference layer with documents (e.g., Web pages) linking to other documents, and the social layer • Edge: Indicates a connection between two with actors (authors or reviewers of the docu- nodes. In that case the graph is in the general ments) with social ties (e.g., trust between the form: G = V E),(: . When the connection is actors, coauthorship, etc.).

Figure 1. Layers of context in the mixed mode network of authors or readers (human icons) and Web resources (squares)

112 Social Network Models for Enhancing Reference-Based Search Engine Rankings

directed, the graph has the general form Network Modes and Data

G = V E,,(: Ea ) denoting that the connec- tion is directional. In principle, network data consists of two types • Signed edge: Indicates a connection between of variables: structural and compositional. two nodes which is assigned with a value. In Compositional variables represent the attributes that case, the graph is in the general form of the entities that form apart the network. This

G = V E,,(: Ed ) where the set Ed is the kind of data can be, for instance, the theme of

value set for the mapping E → Ed . When the Web page. Structural variables define the the connection is directional, the graph ties between the network entities and the mode

has the general form G = V E,,(: Ea , Ed ) under which the network is formed. Depending

where Ea and Ed is the directed connec- on the domain considered, different terminolo- tion and value set respectively. gies can be used. Depending on the measurement of structural An edge is often referenced to in the litera- variables, a network can have a mode. The term ture as arc. In a social network context, an edge mode refers to the number of entities that the and an arc differ based on the fact that an edge structural variables address in the network. denotes a reciprocal connection whereas an arc In particular, we can have the two following is a directed one. However, when the network is a types of networks: directed graph (digraph), those two terms become synonymous. • One-mode network: This is the basic Furthermore, depending on the group size, type of network where structural variables a relation can be dichotomous, trichotomous, or address the relational ties between entities connecting subgroups. belonging on the same set. The Web is in an abstract form a one-mode network linking • A dichotomous relation is a bivalence type Web documents.

of relational tie where the valued set E d is • Two-mode network or affiliation network: mapped to [0; 1]. Dichotomous relation- This type of network contains two sets of ships form units called dyads that are used entities, that is, authors and articles. An to study indirect properties between two affiliation network is a special kind of two- individual entities. On the Web, for example, mode network where at least one member of the referring link from one page to another the set has a relational tie with the member represents a dichotomous relation. of the other set. For example, an author has • A trichotomous or triad is a container of at written an article. The networks studied most six dyads that can be either signed or through the rest of this chapter are affiliation unsigned. Usually in a triad, a dyadic relation networks, where authors produce documents is referenced by a third entity which provides with which they are affiliated. a point of common affiliation. For instance, two pages are referenced by a directory page Table 1. Example structural and compositional which can be studied in a triad. variables for different applications/domains • A subgroup relation is a more complex Structural Variables Compositional Variables relation that acts as a container of triads where the members are interacting using a links Web pages/documents common flow or path. References/citations Papers/articles

113 Social Network Models for Enhancing Reference-Based Search Engine Rankings

In social network theory several other kinds where Rd is the set of pages citing pd, and Ck is of network exist such as ego-centered networks the set of pages cited by Pk. or networks based on special dyadic relations. As This resembles the idea of a random surfer. α already mentioned, we intend to study only affili - Starting at some Web page pa with probability , α ation networks such as the author-article network she follows one of ( Ca ) links, while with (1 – ) or the reviewer-article network. she stops following links and jumps randomly to some other page. Therefore, (1 – α) can be considered as a kind of basic visibility for every REFERENCE-BASED RANKINGS page. From the ranked document’s perspective,

IN WEB CONTEXT a page pd can be reached by a direct (random) jump with probability (1 – α) or by coming from ∈ As noted earlier, using the hyperlink structure of one of the pages pk Rd, where the probability the Web (or any linked document repository) to to be on pk is vis k. compute document rankings is a very successful The equation shown above gives a recursive approach (visible in the success of Google). The definition of the visibility of a Web page because basic idea of link structure-based ranking methods vis a depends on the visibility of all pages pi citing p , and vis depends on the visibilities of the pages is that the prominence of a document pd can be a i citing p and so on. For n pages this is a system of determined by looking at the documents citing pd i 2 n linear equations that has a solution. In praxis, (and the documents cited by pd ). A paper cited by many other papers must be somehow important, solving this system of equations for some million otherwise it would not be cited so much. And pages would be much too expensive (it takes too a Web page linked by many other pages gains much time), therefore an iterative approach is used prominence (visibility). (for details, see Page et al., 1998). Another view is the random surfer model. If Even the iterative approach is rather time a user starts on an arbitrary Web page, follows consuming; the rank (visibility) of all pages is not some link to another page, follows a link from computed at query time (when users are waiting this page to a third one, and so on, the likelihood for their search results), but all documents are for getting a certain page will depend on two sorted by visibility offline, and this sorted list is aspects: from how many other pages it is linked used to fulfill search requests by selecting the first and how visible the linking pages are. k documents matching the search term. 5

PageRank HITS In 1976, Pinski and Narin computed the impor- The hypertext induced topic selection (HITS) is an tance (rank) vis d of a scientific journal pd by using alternative approach using the network structure the weighted sum of the ranks vis k of the journals 3 for document ranking introduced by Kleinberg pk with papers citing pd. A slightly modified ver - sion of this algorithm (the PageRank algorithm) (1999). Kleinberg identifies two roles a page can is used by the search engine Google 4 to calculate fulfill: hub and authority. Hubs are pages referring to many authorities (e.g., linklists), authorities are the visibility vis d of a Web page pd: pages that are linked by many hubs. Now for each vis k , page p , its hub-value h and its authority value vis d 1 d d p R C a can be computed: k d k d

114 Social Network Models for Enhancing Reference-Based Search Engine Rankings

TrustRank algorithm (Gyongyi, Garcia-Molina, hd ak , ad hl p C p R k d l d & Pedersen, 2004), where link spam is semi- automatically identified by using a small seed of Here a page pd has a high hd if it links many user-rated pages. authorities (i.e., pages pi with high ai). In other All these algorithms are best suited for net- words, it is a good hub if it knows the important works with cyclic link structures. Repositories pages. The other way around, pd has a high ad if with mainly acyclic reference structure like it is linked by many good hubs, that is, it is listed citation networks of scientific papers (where in the important link lists. While the authority ad each paper cites older ones) or online discussion gives the visibility of the document ( vis d = a d ), groups where each thread consists of one initial hd is only an auxiliary value. posting and sequences of replies/follow-ups, need But HITS not only differs from PageRank by other visibility measures, such as those provided the equations used to calculate the visibility. While by Malsch et al. (2006), where creation time of a the linear equation system is solved iteratively in document is much more important (as scientific both algorithms, they differ in the set of pages papers or postings are not changed after publish- used. As PageRank does the calculation off-line ing in contrast to Web pages). ∈ on all pages pi P and at query time selects the first k matching pages from this presorted list, Linking and Evaluating Users HITS in a first step selects all pages matching the search term at query time (this gives a subset Ranking is not only a phenomenon that takes M ⊂ P) and then adds all pages linked by pages places in the Web, but derives from the archetype C = from M ( M U ∈ C ) and all pages linking pi M i stratification of a society into degrees of power R = pages from M ( M U ∈ R ). Then h and a are and dominance. In principle, in social contexts pi M i d d ∈ C∪ ∪ R computed for all pages pd M' = (M M M ). two types of ranking exist: explicit (formal) Only including pages somehow relevant to the and implicit (informal). In explicit ranking it is search query (matching pages and neighbors) may declared who is the one who has the authority improve the quality of the ranking, but produces to command due to insignia, symbols, or status that he or she is being attributed from the begin- high costs at query time, for ad and hd are not computed off-line because M' is dependent on the ning. On the other side, in implicit ranking it is search term. This is a problem in praxis because the opinion and the behaviors of the surrounding a single search term can have up to some million entities that facilitate who has the authority to matching Web pages, even if there are strategies command and decide. for pre-calculation. 6 Both implicit and explicit forms of ranking may exist depending on the purpose of the social Other Approaches group and the kind of interactions between the members. However, our study is focused only on There are other link structure-based ranking implicit forms of ranking where status is a social algorithms, for example, the hilltop algorithm, 7 attribute that emerges rather than being set. In which in a first step identifies so-called expert sociology there are several kinds of studies that pages (pages linking to a number of unaffili - tend to explain how status emerges, therefore ated pages 8) and then each page is ranked by the several research models exist. Those models are rank of the expert pages it is linked by, or the analyzed further in the following sections.

115 Social Network Models for Enhancing Reference-Based Search Engine Rankings

Sociometric Status and Prominence the two most noteworthy measures of an actor in a directed network; the measures are “prestige” In a social group one recognizes several strata that and “centrality.” characterize the members with a kind of implicit rank that is known in social science literature as Prestige “status” (Katz, 1953). Usually in a sociological interpretation, status denotes power expressed in A prestige measure is a direct representation different contexts such as political or economical. of status which often employs the non-recipro- The most basic theoretical implication of status cal connections/choices provided to that entity is the availability of choices the entity receives in along with the influence that this entity might the network which gives the entity the advantage provide to the neighboring entities. In principle, of negotiation over the others. Depending on the prestige is a non-reciprocal characteristic of an topology of the network, status can be attributed to entity. That means that if the entity is considered several nodes of the sociogram (see Figure 2). prestigious by another entity, it does not mean In SNA studies, status is depicted upon the that this entity will consider the referring entity generalization of the location of the actor in the prestigious as well. network. In particular, status is addressed by how strategic the position of that entity in a network Inner Degree is (e.g., how it affects the position of the others). Theoretical aspects of status were defined by The simplest measure of prestige is the inner Moreno and Jennings (1945) as the instances of degree index or the popularity of an entity, the sociometric “star” and “isolate.” which is defined as follows. Considering a graph

Quantifications of status employ techniques of G = V ,(: E A ) and E A , the set of the directed graph theory such as the centrality index (Sabi- connections E = ( e ,e 2,11,1 ,...... ,e , ji ) between dussi, 1966; Freeman, 1979; Friedkin, 1991) which the members of the set V then the inner degree in have been adjusted to the various representations K i of a vertex Vi is the sum of the incoming of ties in a network. In our case we summarize connections to that vertex.

Figure 2. Types of connection degree in the network. It is obvious that the sociometric “star” (first dia - gram) is considered the one with the highest inner degree, thus is more popular. When there is reciprocity, the degree of influence can be used to provide an indication of prestige in the network.

116 Social Network Models for Enhancing Reference-Based Search Engine Rankings

in Actor Degree Centrality K i = ∑e ,ij j=1 Actor degree centrality is the normalized index of Degree of Influence and Proximity the degree of an actor divided by the maximum Prestige number of vertices that exist in a network. Con- sidering a graph G = V E),( with n vertices, then

However, the inner degree index of a vertex makes the actor degree centrality C d ( ni ) will be: sense only in cases where a directional relation- ship is available (the connection is non-recipro- d ( ni ) cal), and this cannot be applied in the study of C ( n ) = d i n −1 non-directed networks. In that case the prestige is computed by the influence domain of the vertex. For a non-directed graph G = V E E),,(: , the where n −1 is the number of the remaining nodes influence domain of a vertex Vi is the number or in the graph G. Actor degree centrality is often proportion of all other vertices that are connected interpreted in the literature as the “ego density” by a path to that particular vertex. (Burt, 1982) of an actor since it evaluates the importance of the actor based on the ties that 1 connect him or her to the other members of the d i = ∑e ,ij network. The higher the actor degree centrality − N 1 j=1 is, the most prominent this person is in a network, since an actor with a high degree can potentially On the above measure E represents the set of directly influence the others. paths between the vertices Vi and V j , and N −1 is the number of all available nodes in the graph Closeness or Distance Centrality G (the total number of nodes N = V minus the node that is subject to the metric). Another measure of centrality, the closeness Having the degree of influence, we can com - centrality, considers the “geodesic distance” of a pute the “proximity prestige” (P Pi) by normal- node in a network. For two vertices a geodesic is izing the degree of the vertex by the degree of defined as the length of the shortest path between influence such as: them. For a graph G = V E),( , the closeness

in centrality C c ( Vi ) of a vertex Vi is the sum of ki PP i = geodesic distances between that vertex and all d i the other vertices in the network. Centrality n −1 Unlike prestige measures that rely mainly on C ( V ) = ic ud ,( u ) directional relations of the entities, the centrality ∑ i n i=1 index of a graph can be calculated in various ways, taking also into account the non-directional con- where the function d(ui, u n) calculates the length nections of the vertex that is examined. The most of the shortest path between the vertices i and noteworthy measures of centrality are classified by j, and n¡1 is the number of all other vertices in the degree of analysis they employ in the graph. the network. The closeness centrality can be interpreted as a measurement of the influence of a vertex in a graph: the higher its value, the

117 Social Network Models for Enhancing Reference-Based Search Engine Rankings

easier it is for that vertex to spread information The FOAF Vocabulary into that network. Distance centrality can be valuable in networks where the actor possesses Although sociometric models can provide several transitive properties that through transposition pathways on evaluating the importance of users, can be spread through direct connections such there is a crucial need for an accurate way of min- as the case of reputation. ing the relations between them in order to be able to apply those models. This can be done by using Betweenness Centrality expressive vocabularies such as FOAF. The Friend-of-a-Friend vocabulary (Brickley Betweenness is the most celebrated measure of & Miller, 2005) is an expressive vocabulary set centrality since it not only measures the promi- whose syntax is based on RDF syntax (Klyne & nence of a node based on the position or the activity, Carroll, 2004) and which is gaining popularity but also the influence of the node in information nowadays as it is used to express the connections or activity passed to other nodes Considering a between social entities in the Web along with their graph G = V E),( with n vertices, the between- hypertextual properties such as their homepages ness C B u)( for vertex n ∈V is: or e-mails. In a best-case scenario, the author of a Web page (information resource) will attach his u)( FOAF profile in the resource in order to make it ∑ ∈≠≠ Vtus st identifiable as an own production by the visitors C B u)( = ( n − )(1 n − )2 of that page. This can be observed clearly in cases such as blogs where the information resource represents the person that expresses his or her where C B u)( = 1 if and only if the shortest path from s to t passes through v and 0 otherwise. views through the blog. Furthermore connec- Betweenness can be the basis to interpret roles tion between blogs also represents a kind of a such as the “gatekeeper” or the “broker,” which directional dichotomous tie between the authors are extensive studies in communication networks. of those blogs. A vertex is considered as a “gatekeeper” if its In FOAF, standard RDF syntax is used to betweenness and inner degree is relatively high. describe the relations between various acquain- The “broker” is a vertex that has relatively high tances (relations) of the person described by the outer degree and betweenness. FOAF profile. This relation is depicted in the

Table 2. Some representative elements of the FOAF vocabulary and the representation of the relational tie

Vocabulary Element Description Type of Relational Tie foaf:knows links foaf:persons Direct Provides affiliation/membership (relates an entity with a indirect foaf:member social group) Indicates authorship (relates an information resource foaf:maker indirect with its creator) foaf:based_near Indicates a spatial affiliation of a social entity indirect Indicates a temporal affiliation of a social entity with a foaf:currentproject indirect project or an activity

118 Social Network Models for Enhancing Reference-Based Search Engine Rankings

predicate which denotes that the for the expressiveness of social connections on person who has a description of in the Web. One particular issue is that although his profile for another person, has a social con - the FOAF vocabulary (see Table 2) has a set of nection with that person as well. For example a properties for the description of several kinds of FOAF profile for one of the authors of this chapter relationships, the property is the most and the connection he has with his coauthors can common relation that is expressed in a publicly be described by the fragment of RDF code seen available FOAF profile. According to the general in Box 1. discussion in the FOAF project, the reason for Research on descriptions of social relations this is that many users prefer not to express their in the semantic Web 9 is an undergoing effort strength of social connections publicly than to which has been initiated lately to address the do it in a general way which the various concerns and sociological implications property implies.

Box 1.

Nikolaos Korfiatis Claudia Hess < /foaf:Person> Klaus Stein Dr. Miguel-Angel Sicilia

119 Social Network Models for Enhancing Reference-Based Search Engine Rankings

Within well-defined application domains, improve the measures on the document reference however, more expressive constructs are required network by using propagated information from for describing the relationships between persons. the social network. Using the social relationships for recommending sensitive information such as Avesani, Massa, Integrating Imprecise Expressions of and Tiella (2005) in Moleskiing, a platform for the Social Context in PageRank exchanging information on ski tours (e.g., snow conditions, risk of avalanches), it is highly critical Sicilia and Garcia (2005) have introduced an to get this information only from persons that you approach for integrating the social context of an consider as trustworthy, and not from everyone information resource into the core of ranking you know (or who is known by some of your algorithms such as PageRank. Although a flavor friends). Golbeck, Parcia, and Hendler (2003) of PageRank (PageRank with priors) can be used therefore provided an extension to the FOAF vo - to normalize existing rankings, the core of the cabulary so that users can explicitly specify their method relies on the expression of the relational degree of trust in another person. Specifying the ties of the social context (e.g., connections in a degree of trust in someone with respect to her or social network) using imprecise expressions. his recommendations of scientific papers gives, The idea is that of using imprecise assessments according to the trust ontology 10 by Golbeck et of social relationships as a weighting factor for al. (2003), as shown in Box 2. algorithms like PageRank, whereas imprecise expressions map better to social relation due to the difficulty of making an accurate estimation of the EVALUATING WEB PAGE strength of the relational ties and their influence IMPORTANCE THROUGH THE to the rest of the structure members. ANALYSIS OF SOCIAL TIES The model considers two cases of social rela- tions: (a) those that are formal/declared and can be This section presents two research models to mined from direct social connections, and (b) those integrate social network data into document that are informal/assumed and can be inferred by reference network-based measures: the PeopleR- common properties such as affiliation. ank approach and the trust-enhanced document The model begins by computing a metric called rankings approach. The algorithms presented PeopleRank (PPR). PPR is based on the declared

Box 2.

9

120 Social Network Models for Enhancing Reference-Based Search Engine Rankings

Table 3. Intuitions behind the development of the PeopleRank algorithm (Adapted from Sicilia & Garcia, 2005)

PageRank PeopleRank Intuitively, pages that are well cited from many places around Intuitively, the trust on the quality of pages is related to the the Web are worth looking at. degree of confidence we have on their authors.

Also, pages that have perhaps only one citation from something Pages authored or owned by people with a larger positive like the Yahoo! homepage are also generally worth looking at. prestige should somewhat be considered more relevant.

relationships connecting pairs of Even though the idea of ranking by peer’s specifications. As mentioned earlier, declarations seems intuitive and is coherent with the FOAF vocabulary has deliberately avoided the concepts explained in this chapter, the origi- more specific forms of relation like friendship or nal intuitive justification provided for PageRank endorsement, since social attitudes and conven- requires a reformulation. Table 3 provides the tions on this topic vary greatly between countries original and the socially oriented justifications. and cultures. An important parameter of the PpR compu- In consequence, the strength is provided ex- tation is that declarations of social awareness plicitly as part of the link. In the case of absence should be interpreted. Here the management of of such value, an “indefinite” middle value is vagueness plays a role, since there is no single used. With the above, the PeopleRank can be model or framework that provides a metric for defined by simply adapting the original PageRank social distance. definition: The relevance r of a social tie in that model is provided by the following general expression,

We assume a social entity/person A that has per - which provides a value for each edge ( e1, e2) in the … sons T 1 Tn which declare they know him/her directed graph formed by the explicitly declared (i.e., provide FOAF social pointers to it). The social relationships. parameter d is a damping factor which can be For a set of peers p1, p2, we define social rel - ⋅ set between 0 and 1…Also C(A) is defined as the evance as r(p1, p2) = S(p1, p2) e(p1, p2), asserting ≠ number of (interpreted) declarations of backlinks that p1 p2. going out of A’s FOAF profile. The relevance r of an edge is determined from a degree of strength S (in a scale of fuzzy numbers In the above definition we consider as an [~0; ~ 10]) weighted by a degree of evidence e interpreted declaration of backlink the non-di- about the relationship. These strengths could be rect connection between two persons that can provided by extending the current FOAF schema be inferred by other FOAF elements such as the with an additional attribute. Since we consider FOAF affiliation. the variable S as a fuzzy variable, we can define Following the above definition the PeopleR - membership in a fuzzy set such as S ε {close friend, ank of a person A(PpR (A)) can be defined as acquaintance, distant friend,…} and a membership follows: function that expresses the quantifications of the n PpR T )( variable S from its linguistic form. PpR A )( = 1( − d) + d∑ i The quantification can be done by the ex - C T )( i=1 i pression of the linguistic variable in the FOAF schema.

121 Social Network Models for Enhancing Reference-Based Search Engine Rankings

The second fuzzy variable in the relevance while people that were coworkers (as declared function is the evidence of the relation. As already by ) are credited an amount of mentioned, we infer evidence from non-direc- 0.1 per common past project. tional connections (e.g., affiliations). The more Figure 3 provides an abstract description of the common affiliations exist between two actors in process. The ComputeSocialRelevance procedure the network, the more relevant the connection crawls the social network of the actors/people in is, since the evidence is used as a normalization order to extract the social relevance factor r based factor. on the social connections. Then the PageRank Degrees of evidence support a notion of “ex- for each document is calculated taking for each ternal” evidence on the relation that completes edge as input the r factor of the social relevance the subjectively stated strength. The following between the authors. In case the documents are expression provides a formalization for this written by the same author, the social relevance evidence that is based both on the perceptions of is set to zero. “third parties” and/or affiliations. Of course the above model can be modified and extended with more parameters such as expres-

Pi e(( p 1 , p2 )) = max(Φ i∈U S ( p 1 , p2 ), P( p 1 , p2 )) sions of negative and positive evaluations in the relational ties (e.g., in the case of a citation or a review). However it provides an initial straight- asserting again that p ≠ p ≠ p and having i ∈ i 1 2 forward model for the integration of imprecise U. According to this expression, the strengths of expressions of social ties in the PageRank from a social tie in a group of users U are aggregated which further empirical analysis could be carried through simple fuzzy averaging (Φ) and the evi - out. Furthermore the procedure presented by this dence provided by third parties p ; for example, i model, unlike several information retrieval algo- work in common projects in which two persons rithms, considers a social relation as an imprecise P(p , p ) (p1; p2) collaborate P(p , p ) is obtained 1 2 1 2 assessment to which a quantification provided from FOAF declarations. by the actors in the form of Boolean values (e.g., Concretely if we follow the affiliation by knows or not) fails to capture other important the FOAF descriptor, for example foaf:Project, elements such as strength and consider it as an people coworking in a project (as declared by evaluation factor. ) are credited an amount of 1,

Figure 3. The steps of the PpR algorithm

122 Social Network Models for Enhancing Reference-Based Search Engine Rankings

Integrating Trust Networks and the reviewers of documents. The term ‘reviewer’ Document Reference Networks encompasses all persons having an opinion about the document, such as persons who read the An alternative approach for integrating different document or editors who accepted the document types of networks has been presented by Hess for publication. Both cases are addressed in the et al. (2006) and Stein and Hess (2006). They following separately, leading to two different propose to integrate information from a trust trust-enhanced visibility functions. network between persons and a document refer- ence network. A trust network has been chosen Case 1: Reviewers and Documents as a social network because trust relationships represent a strong basis for recommendations. In In this first type of a two-layer-architecture, a trust contrast to social networks based only on a ‘knows’ network between reviewers is connected with a or ‘coworkers’ relationship, trust relationships document reference network. In the document are more expressive, although more difficult to reference network, documents are linked via cita- obtain. Trust networks cannot automatically be tions or hyperlinks. In the second layer, reviewers extracted from data on the Web, but users must make statements about the degree of trust they indicate their trust relationships, as they are do- have in other reviewers in making ‘good’ reviews, ing it already in an increasing number of trust- that is, above all that they apply similar criteria based social networks. Examples for such trust to the evaluation as themselves. Depending on networks are the epinions Web of trust in which the trust metric that is used for inferring trust users rate other users with respect to the quality values between indirectly connected persons, of their product reviews, or social networking trust statements are in [0, 1] with 0 for no trust services such as and Linked-in, in which and 1 for full trust, or even [–1, 1] ranging from you can rate your friends with respect to their distrust 11 to full trust. Both layers are connected trustworthiness (among other criteria such as via reviews that is, reviewers express their ‘coolness’). In trust networks, trust values can opinions on documents. be inferred for indirectly connected persons by We have now two types of information, firstly using trust metrics such as those presented in the reviews of documents made by persons in Goldbeck, Parsia, and Hendler (2003) or Ziegler which the requesting user has a certain degree and Lausen (2004). This reflects the fact that we of trust that is, trust-weighted reviews— and trust in the real world the friends of our friends to secondly the visibility of the documents calcu- a certain degree. Trust is therefore not transitive lated on the basis of the document network. We in the strict mathematical sense, but decreases now integrate the trust-weighted reviews in the with each additional step in the trust chain. In visibility function and hence personalize it. This contrast to the measures presented in this chapter, new trust-review-weighted-visibility function (in most trust metrics calculate highly personalized the following called twr-visibility) has the property trust values for the users in the network: a trust that reviews influence the rank of a document value for a user is always computed from the to the degree of trust in the reviewer that is, a perspective of a specific user. The evaluations review by someone deemed as highly trustworthy of one and the same user can therefore greatly influences the rank to a considerable part, whereas vary between users depending on their personal reviews of less trustworthy persons have few trust relationships. influence. Having only untrustworthy reviewers, We can distinguish two roles that actors play in the user likely prefers a recommendation merely the trust networks: they can either be the authors or based on the visibility of the documents.

123 Social Network Models for Enhancing Reference-Based Search Engine Rankings

This architecture permits us to deal with two TWR vf pd tr r d 1 t r vf pd interpolation problems. Firstly, propagating trust d d values in the trust network makes it possible to The trust in the reviewers of a document consider reviews of users who are directly linked therefore determines to which degree the trust- via a trust statement as well as indirectly via some weighted reviews influence the twr-visibility. In “friends,” that is, via chains of trust. Secondly, the case that the trust in the reviewer is not absolute reviews influence not only the visibility of the (trust = 1.0), the twr-visibility of the documents reviewed papers but also indirectly the visibility citing this document determines its twr-visiblity. of other papers, namely of those papers cited by Reviews by trustworthy persons therefore influ - the reviewed documents. Figure 4 illustrates these ence indirectly the twr-visibility of not directly interpolations. Person 1 asks for a recommenda- evaluated documents. The vf TWR function permits tion about document 4 that has not directly been use of any visibility function such as the PageRank reviewed, but which visibility is influenced by formula for the propagation of the trust-enhanced the twr-visibility of document 2 that has been visibilities. The indirect connections in the trust reviewed by a person deemed as trustworthy by network are computed before calculating the twr- person 1. visibility. Any trust metric can be used for the The following function gives the twr-visibility propagation of the trust statements. The above for a document p in which r is the review of d d presented function therefore represents a general document p and t is the trust in the reviewer. d rd framework for integrating trust and document The trust in the reviewer is directly taken as the reference networks. A detailed description of trust in the review.

Figure 4. Interpolation in a document and reviewer network (From Hess et al., 2006)

Person 2 Person 1 0.95 0.1 1.0

Person 3

Read? Review: 0.0 Review: 0.95 Buy?

Document 2

Document 1 Document 4

Document 3

124 Social Network Models for Enhancing Reference-Based Search Engine Rankings

this approach and a simulation demonstrating 2. The trust values that are attributed to the the personalization provided by this function is references have to be transformed into edge in Hess et al. (2006). weights, so an edge between two documents

pa and pb has a weight wa→b, reflecting the Case 2: Authors & Documents relationship (trust) between the authors of the corresponding papers. This step is necessary In the second case, the trust network is built up be- as attributed trust values might be negative tween authors of documents. The trust statements due to distrust between the authors. In a refer to the trust in the quality and accurateness of visibility function however, we need edge the evaluated author’s documents. As in the case weights > 0. A mapping function defines of the reviewer network, the range of the trust how trust values are transferred into edge values depends on the trust metric chosen. Authors weights. Depending on the definition of the are connected via an ‘is-author’ relationship with mapping function, different trust semantics the documents they have written. Documents can can be realized. The mapping function can be written by several persons. ‘Is-author’-edges for instance be defined in a way that only are not weighted. The structure of the document such references are heavy weighted which network is identical as in the first case: references express high trust whereas another definition connect the documents. could give a high weight to distrust values in This two-layered architecture permits calcula- order to get an overview on papers which are tion of a trust-weighted visibility for each docu- not appreciated in a certain community. ment in the document reference network. The idea 3. The trust-enhanced visibility of the docu- is that the semantics of links will differ based on ments can now be calculated by a visibility whether there is a trust or a distrust relationship function that considers the edge weights. We between the citing and the cited author. This illustrate this general framework by using a reflects the fact that a link can be set in different concrete visibility function for calculating contexts. On the one hand, a link can affirm the the trust-weighted visibility, namely the authority, the importance, or the high quality of a PageRank. The original PageRank formula document, for example the citing author confirms is modified such that each page pi contributes the results of the original work by his or her own vis i not any longer with (with Ci being the experiments and therefore validates it. On the C i other, disagreement can be expressed in a link set of pages cited by pi ) to the visibility vis d ranging from different opinions to suspicions that of another page pd but distributes its visibility information is incorrect or even faked. according to the edge weight, so we get: Several steps are required to compute the trust-weighted visibility: w → α i d vis d = (1 – ) ∑ vis i ∈ ∑ wi→ j 1. Trust relationships between the authors of a piR d p ∈C citing and the cited paper are attributed to j k the reference between the documents in the document network. In the case of coauthor- This function can further be personalized so ship, more than one trust value is available. that it calculates individual recommendations for Assuming that the opinions of coauthors do each user. It can also be adapted such that docu- not vary extremely, the average of the trust ments are highly ranked which are controversially values is attributed to the reference. discussed. This approach again is a general frame-

125 Social Network Models for Enhancing Reference-Based Search Engine Rankings

work. Other visibility functions can be applied Therefore Wikipedia is an ideal case for social instead of the PageRank. information retrieval since its hyperlink context is attributing an ever-growing popularity with a A Social Information Retrieval Case set of disputes regarding the trustworthiness of Study on Wikipedia its content and the expertise of the contributors in most of the articles. Having presented metrics of social and hyper- Furthermore since it facilitates a large amount textual evaluation, we choose as a case study to of social connections over a common affiliation, model the authoritativeness of a system rich with we consider it an interesting example to discuss social interactions such as Wikipedia. Wikipedia authoritativeness over evolving documents. is based on wiki software (Leuf & Cunningham, In our case we consider the following social 2001) and is considered one of the most successful interactions: collaborative editing projects on the Web since it currently contains over one million articles 13 • W hen a contributor edits content that has and has an extensive community of contributors been submitted by someone else, then it contributing content and improving the quality establishes a tie with him or her. This is of the articles. depicted by an acceptance factor which rep- Traditional encyclopedias such us Britannica resents the percentage of the content of the are often characterized with a high level of cred - previous contributor that is visible after. ibility by domain experts, taking into account the • E very contributor that has a single or more background process that has resulted (domain contribution to the article establishes a rela- authorities contribute to the final outcome). tional tie with the other content contributors On the other hand, since it uses the WikiWiki of the article. Evidence of participation in system, Wikipedia allows the editing and creation common projects strengthens this tie. of encyclopedic articles by anyone who wishes to contribute. Its primary target is to provide free As can be seen in Figure 5, we define two different editing access and gather knowledge representing networks the articles network and the contributors the consensus of the term presented and thus not network from where we can extract both the social to evaluate the contributing authorities. and the hypertextual context for our case study. However, as the content increases along with Before continuing to the development of met- the contributing sources (see Figure 1.2), a critical rics, we need to further define the information sets issue arises regarding the credibility of Wikipedia from which we will extract the networks that will as an authoritative reference source (Lih, 2004; be provided as input to our metrics. Lipczynska 2005). The question is extended not In the context of an encyclopedia, we define only to the outcome (article) but also to the process the following information sets: of shaping the article, in which a contributor would allow another authority to submit, change, or delete a • Domain: A collection of articles which contribution accepted or not accepted by him or her. tackle a common subject (e.g., philoso- Wikipedia has internal mechanisms of managing phy). those cases such as a permission ranking system, • Category: A collection of domains which where a contributor is accredited by the level of have a common categorical and etymological participation in the shaping of the article, as well as root. For example, the domains philosophy a discussion tab on most of the articles or notifica- and economics have a kind of connection tions and warnings regarding the content. in the category of social sciences.

126 Social Network Models for Enhancing Reference-Based Search Engine Rankings

Within the above sets, the following network same article). This makes the importance layers are defined: of the contributors network higher since the content on the articles network relies • The articles network: Every article in much on the interaction by the member of Wikipedia contains references to other ar- the contributors network. ticles as well as external references. A set of links used for classification purposes is In this case study our focus is to examine a set also available in most of the active articles of of metrics for both social- and document-based the encyclopedia. Every article represents a relevance. However, both metrics must be able vertex in the article network and the internal to be combined in order to extract an overall connections between the articles the edges indicator of authoritativeness/trustworthiness. of the network. The development of those metrics is based on the • The contributors network: Wikipedia is following assumptions: a collaborative writing effort, which means that an article has multiple contributors. • T he more decentralized the editing of an We assume that a contributor establishes article, then the better this article represents a relationship with another contributor if a consensus about it. they work on the same article. In the re- • The contributors whose content has been sulted weighted network, each contributor most accepted (seen from the result of the is represented by a vertex and their social diff operation in the wiki) are attributed a ties (positive or negative) are represented level of authority regarding the article. by an edge, denoting the sequence of their • This level of authority remains only in the social interaction. The contributors network domain of the article. However, domains relies on a set of directed and inferred re- which belong in the same topic can retain lationships. In particular there is the direct the level of authority for a contributor. tie (contributor X edits the content of the contributor Y) but also an inferred tie (con- The graph that we model is a signed directed tributor X and contributor Y work on the network, with arcs signed as a factor depicting

Figure 5. Network layers in the Wiki publication model. Contributors are linked together by working on common projects (articles) in the same topic.

127 Social Network Models for Enhancing Reference-Based Search Engine Rankings

the level of acceptance of the content submitted C D ni )( = nd i )( = ∑ xij by contributor A and accepted by contributor j B. In order to model the authoritativeness of The adjacent xij represents the relational tie contributors, we selected the degree centrality. between the contributors and their contribution In our case study, we use the degree centrality over the domain of the article. This also is charac- index, which is the simplest definition of centrality terized by the visibility of the contribution in the and as mentioned is based on the incoming and final article and can be either 1 or 0. To provide the outgoing adjacent connections to other contribu- centrality, we divide the degree with the highest tors in an article graph. To measure the centrality obtained degree from the graph, which in graph at an individual level, we define the contributor theory is proved to be the number of remaining degree centrality; and to an article level, the ar- vertices (g) minus the ego ( g – 1). Therefore the ticle degree centralization which represents the contributor degree centrality can be calculated as variability of contributor degree centrality over an instance of the actor centrality index: the specific article. ( nd ) ′ i Contributor Degree Centrality C D ( ni ) = g −1 As already discussed, the inner degree (in a graph theoretic interpretation, the amount of edges Article Degree Centralization coming into a node) represents the choices the We define an article’s degree centralization C actor has over a set of other actors. However, in DM our wiki network model, the amount of incoming as the variability of the individual contributor centrality indices. The C (n*) represents the largest edges represents edits to the text; therefore the D metric of inner degree is the opposite, meaning observed contributor degree centrality: that the person with the higher inner degree has g * − the largest amount of objection/rejection in the ∑ [ C D n )( C (niD )] C = i contributor community and thus receives a kind DM ( g − )(1 g − )2 of negative evaluation from his or her fellow contributors. On the other hand, the outer degree Again we divide the variability with the high- of the vertex represents edits/participation in est variability observed in the graph in order to several parts of the article and thus gives a clue to get a normalized calculation. the activity of the person in relation to the article and the domain. Mathematically we can represent Interpretation of the Results such formalism as follows. Considering a graph representing the network of contributors for an Having defined the metrics, we provide some article contributed in the wiki, the contributor indicative measurements for a set of articles from degree centrality a contextualized expression of the category philosophy. actor degree centrality is a degree index of the As can be observed from Table 5, the article adjacent connections between the contributor and degree centralization is relatively low because of others who edit the article. From graph theory, the the small collections of articles used in the case outer degree of a vertex is the cumulative value study and the inter-connections of the actors in of its adjacent connections: the domain. However, it is enough to let us discuss some qualitative interpretations such as:

128 Social Network Models for Enhancing Reference-Based Search Engine Rankings

Table 4. Contributor degree centrality for the Wikipedia article “Immanuel Kant”

Cluster Freq Freq% CumFreq CumFreq% Representative (Outerdegree) 1 1 0.4329 1 0.4329 65.6.92.153 2 199 86.1472 200 86.5801 82.3.32.71 4 14 6.0606 214 92.6407 80.202.248.28 6 6 2.5974 220 95.2381 Snowspinner 8 3 1.2987 223 96.5368 Tim Ivorson 10 1 0.4329 224 96.9697 StirlingNewberry 12 2 0.8658 226 97.8355 24.162.198.123 16 2 0.8658 228 98.7013 JimWae 18 1 0.4329 229 99.1342 Jjshapiro 20 1 0.4329 230 99.5671 SlimVirgin 31 1 0.4329 s 231 100 Amerindianart

Table 5. Articles used in the case study along with the number of contributors and their degree centralization

Number of Article Degree Article Name Contributors Centralization (max 1) Adam Smith 276 0.039114 Aristotle 274 0.0232 Immanuel Kant 231 0.20484 Johann Wolfgang von Goethe 242 0.016682 John Locke 292 0.008581 Karl Marx 232 0.006601 Ludwig Wittgenstein 220 0.006328 Philosophy 280 0.00254 Plato 284 0.001207 Socrates 289 0.000405

means that contributions have been done by • The dispersion of the actor indices denotes individuals with interests in other domains how dependent this article is on individual as well. contributors. For instance, if an article has • T he range of the group degree centralization a very low degree of centralization, then reflects the heterogeneity of the authoring it means that the social process to shape it sources of the article. In our case, the article was highly distributed, thus resulting in an “Immanuel Kant” has a significantly higher article which has been submitted by multiple degree of centralization, which means that authorities. In our case, the articles repre- it has been contributed by authorities most sent a low degree of centralization, which concentrated in the domain of the article who thus have contributed to other articles.

129 Social Network Models for Enhancing Reference-Based Search Engine Rankings

Contributors with higher inter-relation over the 1. Define the compositional and structural same domain represent higher authorities based variables on both levels (actors, documents, on the assumptions that their interest spans the and their connections), domain to which the article belongs, and there- 2. Compute measures on the nodes (actors) of fore they have conducted background research the social network and propagate them down regarding the material they have contributed. On to the nodes (documents) in the document the other hand contributors with lesser authority network, tend to have their content erased/objected by 3. Propagate edge weights from the social contributors with higher authority. network down to the corresponding edges It should be noted that the measures developed of the document network to get weighted and presented in this case study do not actually edges on the document network, or measure the “subjective” quality of an article since 4. Apply some ranking algorithm on the docu- such a task is a cognitive process characterized ment network modified by 2 or 3 to compute by a high level of complexity. social network enhanced rankings. Those measures can contribute in the direction of providing an indicator of “consensus” related to Current information retrieval models that use an article and thus assert that it does not provide links to compute rankings as measures of the popu- controversial views or expressions of a small group larity of the Web pages fail to consider the social of persons (especially in articles with political context that is tacit in the authorship of the pages content). Thus a level of neutrality expressed in and their links. Social network analysis provides the writing of this article is asserted. sound models for dealing with these kind of mod- In the context of social-based information els and can be combined with existing backlink retrieval, those metrics can correlate with a de- models to come up with richer models. gree of acceptance of the Wikipedia article by Applying these steps gives improved ranking the domain experts and therefore provide a more algorithms to build better search engines, which valuable trustworthiness factor than traditional inherently integrate information about authors or metrics that can measure only popularity and not reviewers of documents to provide better results. trustworthiness. The algorithms described in this section therefore allow use of information about the authors of Web pages or about reviews from other users to build CONCLUSION AND OUTLOOK better information retrieval mechanisms. This also allows the identification of untrustworthy In this chapter we have provided background on information, as information from unreliable backlink models, social network concepts, and sources and link spam. their potential application to a combined model However, research issues are open to the direc- of Web retrieval that considers the links but also tion of applying such models in real context since the relations between the creators of the docu- information about the social context is something ments and the links. that is very difficult to extract and model. The However, to incorporate the metrics discussed Semantic Web can contribute to this direction by in this chapter, there must be a process that can advancing the development and use of vocabular- provide a roadmap to that direction. The follow- ies that can express social relations in various ways ing steps highlight an abstract procedure on how and associate them with content. Nevertheless, to integrate information from a social network in privacy issues should be also considered since the the computation of document rankings:

130 Social Network Models for Enhancing Reference-Based Search Engine Rankings

expression of social relations in a publicly acces- Faloutsos, C. (1985). Access methods for text. sible information space is something that exhibits ACM Computing Surveys, 17 (1), 49-74. vulnerabilities in cases such as “phishing” attacks Freeman, L.C. (1979). Centrality in social net- and social engineering (Levy, 2004). works: Conceptual clarification. Social Networks, 1(3), 215-239. REFERENCES Friedkin, N.E. (1991). Theoretical foundations for centrality measures. American Journal of Avesani, P., Massa, P., & Tiella, R. (2005). A Sociology, 96 (6), 1478-1504. trust-enhanced recommender system application: Friedkin, N.E. (1998). A structural theory of Moleskiing. Proceedings of the 20th ACM Sym - social influence (structural analysis in the social posium on Applied Computing, Santa Fe, NM. sciences). Cambridge: Cambridge University Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Press. Modern information retrieval. Harlow, UK: Ad- Garfield, E. (1972). Citation analysis as a tool in dison-Wesley. journal evaluation. Science, 178 (4060), 471-479. Berners-Lee, T., & Fischetti, M. (1999). Weav- Golbeck, J., Parsia, B., & Hendler, J. (2003). ing the Web: The original design and ultimate Trust networks on the semantic Web. Proceed- destiny of the World Wide Web by its inventor. ings of Cooperative Intelligent Agents, Helsinki, San Francisco: Harper. Finland. Borner, K., Maru, J.T., & Goldstone, R.L. (2004). Gyongyi, Z., Garcia-Molina, H., & Pedersen, J. The simultaneous evolution of author and paper (2004). Combating Web spam with TrustRank. networks. Proceedings of the National Academy Proceedings of VLDB (vol. 4). of Sciences (vol. 101, pp. 5266-5273). Hess, C., Stein, K., & Schlieder, C. (2006). Brickley, D., & Miller, L. (2005). FOAF vocabu- Trust-enhanced visibility for personalized docu - lary specification. ment recommendations. Proceedings of the 21st Brin, S., & Page, L. (1998). The anatomy of a ACM Symposium on Applied Computing, Dijon, large-scale hypertextual Web search engine. France. Proceedings of the 7th International Conference Hubbell, C.H. (1965). An input-output approach on the World Wide Web, Brisbane, Australia. to clique identification. Sociometry, 28 (4), 377- Brown, J.S., & Duguid, P. (2002). The social 399. life of information. Boston: Harvard Business Katz, L. (1953). A new status index derived from School Press. sociometric analysis. Psychometrika, 18, 39-43. Burt, R.S. (1980). Models of network structure. Kleinberg, J.M. (1999). Authoritative sources in Annual Review of Sociology, 6, 79-141. a hyperlinked environment. Journal of the ACM, Conklin, J. (1987). Hypertext: An introduction 46 (5), 604-632. and survey. IEEE Computer, 20 (9), 17-41. Klyne, G., & Carroll, J.J. (2004). Resource Dhyani, D., Keong, N.W., & Bhowmick, S.S. description framework (RDF): Concepts and (2002). A survey of Web metrics. ACM Comput - abstract syntax. W3C recommendation, World ing Surveys, 34 (4), 469-503. Wide Web Consortium.

131 Social Network Models for Enhancing Reference-Based Search Engine Rankings

Korfiatis, N., & Naeve, A. (2005). Evaluating Sabidussi, G. (1966). The centrality index of a wiki contributions using social networks: A case graph. Psychometrika, 31 , 581-603. study on Wikipedia. Proceedingsofthe 1st Online Scott, J. (2000). Social network analysis: A Conference on Metadata and Semantics Research handbook (2nd Ed.). London/Thousands Oaks, (MTSR’05). Rinton Press. CA: Sage. Leuf, B., & Cunningham, W. (2001). The wiki Sicilia, M.A., & Garcia, E. (2005). Filtering way: Collaboration and sharing on the Internet. information with imprecise social criteria: A San Francisco: Addison-Wesley Professional. FOAF-based backlink model. Proceedings of Levy, E. (2004). Criminals become tech savvy. the 4th Conference of the European Society for Security & Privacy Magazine, 2 (2), 65-68. Fuzzy Logic and Technology (EUSLAT), Bar- celona, Spain. Leydesdorff, L. (2001). The challenge of scien- tometrics: The development, measurement, and Stein, K., & Hess, C. (2006). Information retrieval self-organization of scientific communications. in trust-enhanced document networks. In Beh- Universal. rendt et al. (Eds.), Semantics, Web and mining. Berlin: Springer-Verlag. Lih, A. (2004). Wikipedia as participatory jour- nalism: Reliable sources? Metrics for evaluating Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, collaborative media as a news resource. Proceed- J., & Ozsen, L. (2002). Optimal crawling strate - ings of the International Symposium on Online gies for Web search engines. Proceedings of the Journalism. 11th International Conference on the World Wide Web (pp. 136-147), Honolulu, HI. Lipczynska S. (2005). Power to the people: The case for Wikipedia. Reference Reviews, 19 (2). Ziegler, C.N., & Lausen, G. (2004). Spreading activation models for trust propagation. Pro- Malsch, T., Schlieder, C., Kiefer, P., Lubcke, M., ceedings of the IEEE International Conference Perschke, R., Schmitt, M., & Stein, K. (2006). on E-Technology, E-Commerce and E-Service Communication between process and structure: (EEE04), Taipei, Taiwan. Modeling and simulating message-reference- networks with COM/TE. Journal of Artificial Societies and Social Simulation. ENDNOTES Mathes, A. (2005). Filler Friday: Google bomb - ing. Retrieved May 27, 2006, from http://uber. 1 With the term backlink, we refer to the nu/2001/04/06/ backwards reference given to an object by Page, L., Brin, S., Motwani, R., & Winograd, T. another referring object. The most well- (1998). The PageRank citation ranking: Bringing known type of a backlink is the citation order to the Web. Technical Report, Stanford provided to scientific articles and scholarly Digital Library Technologies Project, USA. work. 2 Technically a document reference network Pinski, G., & Narin, F. (1976). Citation influence is a graph with documents as nodes and for journal aggregates of scientific publications: references (links) as directed edges. The vis- Theory, with application to the literature of phys- ibility of a node is determined by analyzing ics. Information Processing and Management, the link structure. 12 (5), 297-312.

132 Social Network Models for Enhancing Reference-Based Search Engine Rankings

3 The basic idea of this algorithm goes 8 Unaffiliated means that they do not belong back to Seelay (1949), who used it in so- to the same organization. This is determined ciometrics (first published in Katz, 1953), by comparing URLs, IP addresses, and but without normalization by the number network topology. of outgoing edges and without damping 9 See the Proceedings of the 1st FOAF Work - term (which was introduced by Hubbell, shop, Galway. 1965). 10 The trust ontology is available at http://trust. 4 We do not really know how Google does mindswap.org/ont/trust.owl (last access its ranking. They claim that their ranking date: May 30). algorithm is based on PageRank with modi- 11 Although distrust statements are difficult fications. to handle in propagation: Should I trust 5 For the technical details, see Brin and Page someone who is distrusted by someone I (1998). distrust? Or should I distrust this person 6 There is no obvious reason not to use the even more? hub and authority approach for off-line 12 Statistics and data for the English Language calculation on the whole Net. There are Wikipedia. For further information about search engines claiming to use a HITS- the current size of Wikipedia, the reader based algorithm, but without giving detailed can visit: http://en.wikipedia.org/wiki/Wiki - information. pedia:Statistics (last access date: May 30, 7 Also patented by Google. 2006).

133