ARISTOTLE UNIVERSITY OF THESSALONIKI JOINT POSTGRADUATE COURSE ON «INFORMATICS AND MANAGEMENT» DEPARTMENTS OF INFORMATICS AND ECONOMIC SCIENCES

Link prediction based on Multi-modal Social Networks

Master’s Thesis of Christos Perentis (297)

Examination Committee Supervisor: Panagiotis Symeonidis, PhD Members: Athena Vakali, Associate Professor Christina Boutsouki, Assistant Professor

THESSALONIKI JUNE 2011 ii !"#$%&%'('#& )!*')#$%+,#& -'$$!(&*#.+$ /#!%,+,!%#.& )"&0"!,,! ,'%!)%12#!.3* $)&1/3* «)(+"&4&"#.+ .!# /#&#.+$+» %,+,!%3* )(+"&4&"#.+$ .!# &#.&*&,#.3* ')#$%+,3*

Link prediction based on Multi-modal Social Networks

/5678µ9:5;< '=>9?@9 :AB 2=

'GC:9?:5;< '65:=A6< '65H7D68E: !"#"$%&'() *+µ,-#./(), 0%/12'34") ,D7F: 56(#1 7"218(, 5#"98(4&'4%" :"6($;'4%" <4%='.#" >93+'=3?2(, @9.23+4( :"6($;'4%"

!"##$%&'()* (&+'(&# 2011

iv To my parents Fryni and George vi Acknowledgements

This thesis would not have been possible without the support of Dr. Panagiotis Syme- onidis whose guidance was crucial to its formation and completion. I am utterly con- vinced that Dr. Symeonidis will continue to conduct primary research on the broader and emerging research area of Data Mining with the same dedication. I hope that we will cooperate again in the future, addressing further challenging issues.

Last but not least, I would like to thank my parents, my family and my friends for their selfless support not only during this thesis but throughout my studies and my life in general. Abstract

Online social networks (OSNs) such as Facebook and MySpace, are aware of high acceptance since they enable users to share digital content (links, pictures, videos), express or share opinions and expand their social circle, by making new friends. All these kinds of interactions that users participate in, lead to the evolution and expansion of the over time. OSNs support users, providing them with friend recommendations based on the existing explicit friendship network they gradually build. This task refers to the Link Prediction problem, where given a snapshot of a social network we try to infer which new future interactions are likely to occur in the near future, among its members. Most of the related work focuses on the structural properties of a single type of network to provide user recommen- dations. However, users form several implicit networks due to a number of interactions with items such as co-sharing a group, co-commenting on posts or co-tagging photos. In addition, the majority of earlier work uses this kind of auxiliary (user-item) sources in order to only recommend items to users.

Main aim of this thesis is to exploit such implicit interactions, which are simultaneously being formed within social networks, (relating users with items) in order to provide enhanced user recommendations or to be fully used in absence of information from the explicit friendship network. Katz status index, a global-based approach, is adopted for the computation of users "proximity" in a . Several extensions of the Katz measure are taken into account, including this of using only a single source but also other combined cases using auxiliary user-item relationships. The premise, that considering also auxiliary sources will perform more accurate user recommendations, is experimentally verified.

Keywords: Link prediction, Data mining, Online Social Networks, Recommender Systems, Multi-Modal Networks, Bipartite Networks, Experimentation, Web 2.0, Katz status index.

vii viii

!"#$%&'&

!" #$%&'()'* +,"-.-"/0- 1"/23.- (Online Social Networks ) 4$.* 2, Facebook /5" 2, MySpace 6-.&)7,#- µ'689% 5$,:,;< 5$4 2,#* ;&<(2'* 2,# =56/4(µ",# >(2,3 µ?(5 (25 2'9'#25)5 ;&4-"5. @?(5 5$4 5#28, :)-'25" % :#-5242%25 (2,#* ;&<(2'* -5 :%µ",#&6<(,#- ?-5 $9

()*"+, -%"+.+/: =&4F9'D% G#-:?('.- (Link Prediction), JE4&#E% 1':,µ?-.-, @?(5 +,"-.-"/<* 1"/23.(%*, G#(2<µ525 G#(28('.-, =,9#2&,$"/8 (Multi-modal) 1)/2#5, 1"µ'&< 1)/2#5, ='"&5µ52"(µ4*, Web 2.0, µ'2&"/< (#(;?2"(%* Katz.! Contents

1 Introduction 1 1.1 Motivation ...... 1 1.2 Contribution ...... 2 1.3 Outline ...... 4

2 Social Networking in the WWW 5 2.1 Web 2.0 Technologies ...... 5 2.1.1 Blogs ...... 7 2.1.2 Wikis ...... 8 2.1.3 Mash-ups ...... 9 2.1.4 Social Tagging Systems ...... 9 2.2 Social Networking and Social Networks ...... 11 2.3 Online Social Networks Sites ...... 12 2.3.1 Facebook ...... 12 2.3.2 Myspace ...... 14 2.3.3 Twitter ...... 15 2.4 Social Rating Networks ...... 16 2.4.1 Epinions ...... 16 2.4.2 Flixster ...... 17 2.4.3 Digg ...... 18

3 Data mining in OSNs 21 3.1 (SNA) ...... 21 3.2 Basic Graph Theory ...... 23 3.2.1 Unipartite Graph ...... 24 3.2.2 Bipartite Graph ...... 26

ix x CONTENTS

3.2.3 Muti-Modal Graph ...... 27 3.2.4 Measures and Metrics ...... 28 3.3 Community Detection ...... 31 3.4 Topological Properties ...... 32 3.4.1 Small World Networks ...... 32 3.4.2 Scale free Networks ...... 32 3.5 Visualization and Tools ...... 33 3.6 Personalization and Recommender Systems ...... 35 3.6.1 Content-Based ...... 36 3.6.2 Collaborative Filtering ...... 37 3.6.3 Hybrid ...... 38 3.7 OSNs and Recommender Systems ...... 38

4 Link Prediction 41 4.1 Mathematical Problem Formulation ...... 42 4.2 Related Work ...... 43 4.2.1 Different Link Prediction Tasks ...... 44 4.2.1.1 Link Prediction ...... 44 4.2.1.2 Rating Prediction ...... 45 4.3 Classification of Topological Measures for Link Prediction ...... 46 4.3.1 Node Based Methods ...... 46 4.3.2 Path Based Methods ...... 48 4.3.3 Higher Level Approaches ...... 51 4.4 Experimental Configuration ...... 52

5 Methodology 55 5.1 Proposed Approach ...... 55 5.2 Link Prediction based on User-User Adjacency Matrix ...... 56 5.3 Link Prediction based on User-Item Bi-Adjacency Matrix ...... 59 5.4 Link Prediction based on Multi-modal Graph ...... 61

6 Experimental Evaluation and Results 65 6.1 xSocial Synthetic Data Set ...... 66 6.2 Experimental Setup ...... 69 6.2.1 Evaluation Method ...... 69 6.2.2 Performance Measures ...... 69 6.3 Results and Discussion ...... 72

7 Conclusion and Future Work 79 CONTENTS xi

A Appendix 89 A.1 xSocial Generator ...... 90 A.2 Matlab Code ...... 92 A.3 Visual Basic Code ...... 94 A.4 Topological Properties Computation in C++ ...... 98 xii CONTENTS List of Figures

2.1 Overview of the Facebook Website ...... 13 2.2 Overview of the Myspace Website ...... 15 2.3 Overview of the Twitter Website ...... 15 2.4 Overview of the Epinions Website ...... 17 2.5 Overview of the Flixster Website ...... 18 2.6 Overview of the Digg Website ...... 19

3.1 Example of an undirected graph and its corresponding adjacency matrix. 24

3.2 Example of the complete K4 graph...... 25 3.3 Example of an undirected graph and its corresponding adjacency matrix in the power of 2...... 25 3.4 Example of an directed graph and its corresponding adjacency matrix. . 26 3.5 Example of an undirected weighted graph and its corresponding adja- cency matrix...... 26 3.6 Example of a bipartite graph and its corresponding adjacency matrix. .. 27 3.7 Example of a multimodal graph and its corresponding adjacency matrices. 28 3.8 Example of metrics in a social network...... 30 3.9 Example of a graph with three communities in a social network...... 31 3.10 TouchGraph screenshot visualizing friendships within a Social Network. 34 3.11 Example of Netdraw visualizing a social network...... 34

4.1 Visualization of a social network evolution and the link prediction task. 42

5.1 Example of a unipartite friendship Social Network...... 56 5.2 Example of a bipartite Social Network...... 59 5.3 Example of a Multi-modal Social Network...... 61 5.4 Illustration of all the possible combined paths...... 62

xiii xiv LIST OF FIGURES

6.1 Typical Presicion-Recall curve ...... 70 6.2 Comparison of tr_cKatz3, tr_cKatz2, cKatz, sKatz in terms of Precision/Re- call(%)...... 72 6.3 Precision(%) diagram compared to the # of Recommended Friends. ... 73 6.4 Recall(%) diagram compared to the # of Recommended Friends...... 74 6.5 AUC(%) statistic compared to the # of Recommended Friends...... 75 6.6 AUC statistic compared to the Fraction of Edges observed...... 76 List of Tables

5.1 Running example: User-User Adjacency matrix A...... 57 5.2 User-User similarity matrix based on adjacency matrix A...... 58 5.3 Running example: User-Item matrix R...... 59 5.4 Running example: User-User Adjacency B = R RT ...... 60 × 5.5 User-User similarity matrix based on adjacency matrix B...... 60 5.6 User-User similarity matrix based on Multi-modal graph...... 63

6.1 Topological properties of xSocial 1K friendship and co-comment networks. 68 6.2 Precision/Recall(%) curve comparing tr_cKatz3, tr_cKatz2, cKatz, sKatz. 72 6.3 Comparison of tr_cKatz3, tr_cKatz2, cKatz, sKatz in terms of AUC(%) statistic...... 75 6.4 Comparison of tr_cKatz3, tr_cKatz2, cKatz, sKatz AUC statistic vs. Frac- tion of Observed Edges...... 76

xv xvi LIST OF TABLES 1 Introduction

1.1 Motivation

The rapidly expanding use of information and communication technologies (ICT) com- bined with the increasing accessibility of individuals on the World Wide Web (WWW) has made creation storage of data available for billions of users through a variety of devices. Users had never been exposed to such large volumes of data before, and this is mainly because data production has become cheaper, faster and easier than ever. This exposure urged users to manage a vast and heterogeneous amount data as well as to filter important information in a daily basis. This phenomenon soon generated new and crucial needs for users, as well as fundamental challenges for Information Management. Specifically, the emerging research area of Data Mining has been chal- lenged through applications such as efficient searching in web and the extraction of meaningful information from large volumes of heterogeneous data, respectively. The introduction of Web 2.0 inflated the information overload issue, assigning a major role to the ordinary user, which is now not only allowed to publish and edit, but also encouraged to share digital resources within online web-communities forming networks. Web 2.0 technologies and especially online social network (ONS) sites have dynamic characteristics, supporting the interaction among users and items in several levels. These systems, as far as it concerns relationships among users, allows them to transfer their offline friend relationships online or make new online friendships. In addition, regarding the relationships between users and items, users are able to join

1 1. INTRODUCTION 1.2. Contribution groups of common interest, share content (text, pictures, sound, video, etc.), declare tastes or habits, post or comment on opinions as well as labeling items with tags. All this kind of information is stored in an online web structure composing a user profile, accompanied by data describing all kinds of interactions among people and items. Large volumes of data produced by users and their online interactions with items, can be exploited by the emerging research area of data mining in the world wide web, called Web Mining. One basic technique regarding web mining is web structure min- ing. Specifically, it is the process of using graph theory to analyze the node and connec- tion structure of a web site. OSNs as well as web sites follow similar graph notations in their network web structure. In the first case, OSNs use graphs where different kinds of vertices-nodes depict users or other items, while different types of edges con- necting vertices represent relationships. Following the same notation, web sites depict the nodes-vertices and hyperlinks as their edges. This reduction, caused pronounced way for the researcher’s intensive study of social networks and the relationships that develop within them, under a unified technological and strict mathematical research framework. So far, it has been identified that Web 2.0 applications users suffer from the overload of information they share and that there is a potential for web mining tools to provide solutions that filter the information they handle, since they both can be studied in a unified framework. A lot of academic research and commercial applications have been oriented towards the personalization of the user information which are exposed, in or- der to decrease the amount of information users handle as well as to provide a more precise web experience. Since users’ information and their interactions among other users or items are available online, whether items declare groups on an OSN or a can- didate product to buy, then there is a clear indication that systems should recommend users or items to users. Recommender systems appear in Web 2.0 application and sites, including product sites for an informed purchase (e.g Amazon.com) as well as in OSNs sites(e.g Facebook.com), where users or items are being recommended to users based on a common relationship. It should be denoted that it is crucial for recommendation systems acceptance to provide with an understandable and transparent explanation to the user the reason behind the recommendation.

1.2 Contribution

Online social networks like Facebook, and Myspace recommend new friends to reg- istered users based in particular on a explicit social network that they built by adding each other as friends. Recommendation systems of online social networks sites provide

2 1. INTRODUCTION 1.2. Contribution recommendations using methods based on algorithms which measure the possibility of two users being social friends, but not realize it. These similarity measures are based on features of the social graph, focusing particularly on the computed "proximity" of two nodes. Essentially, the main task in providing recommendations is the prediction of new relationships of the desired type, which is about to establish in the near future, taking into account the existing relationships. The aforementioned task, known as the Link Prediction problem, is described by David Liben-Nowel and Jon Kleinberg [55] as if we are given a snapshot of a social network, if we can infer which new interac- tions among its members are likely to occur in the near future. There are two main approaches that handle Link Prediction problem [55], but both explore features of the social graph. The first one is node-dependent, focusing mainly on the node structure and local features of the network. The second one is based on global features, detecting the overall path structure in a network, and it is path-dependent. In this postgraduate thesis there is a high motivation because of the fact that the ma- jority of earlier work in Link Prediction infers new future interactions between users by mainly focusing on structural properties of a single type of network. However, users can also form several implicit social networks through their daily interactions like com- menting on people’s post, or rating similarly same products, or by tagging people’s photos. It is crucial, in order to provide more accurate recommendations to explore an extended social graph where more than one type of relationships exist among users. Users may interact and be connected from other implicit relationships, but sometimes more informative than the explicit ones. Basic goal is to explore in what degree the combination of explicit with these implicit networks, will enhance user recommenda- tion in ONSs. The proposed methodology includes the computation of the Katz status index sim- ilarity measure, which belongs to path-dependent approaches, but extending it to aux- iliary sources in order to enhance the accuracy of user recommendation prediction. In order to achieve that, it is necessary to fully analyze and understand the theoretical background of node "proximity" measures, as well as provide a toy example forming the Multi-modal network, consisting of the explicit and the implicit networks, simul- taneously. Then, an experimental process is being formed, in order to experimentally evaluate our proposed methodology, in terms of the prediction accuracy. For the evaluation of the proposed method, a synthetic data set is used from a recent research by Faloutsos et al. [18]. They designed the first multi-modal graph generator xSocial, that is capable of producing multiple weighted time-evolving networks, using simulation techniques to match OSNs observed patterns. It provides an explicit friend- ship network as well as an implicit one where users are related through co-commenting

3 1. INTRODUCTION 1.3. Outline on users posts. Both networks will be tested in single and combined approaches to ex- amine their contribution in a unified user recommendation framework.

1.3 Outline

This postgraduate thesis is structured in 7 chapters. In Chapter 1, there is an introduc- tory summary about the subject of the Thesis, the motivation and the contribution. In addition, there is an outline of the text structure. The rest of the thesis is organized as follows. Chapter 2 presents the introduction of Web 2.0 framework, including on- line social and rating networks as well as the nature of relationships and interactions among users. Moreover, popular online social and rating networks web sites are pre- sented. In Chapter 3, Social Network Analysis field is presented, followed by basic Graph Theory notations and available visualization tools. In addition, personalization of information is discussed in the WWW context as well as various types of recom- mendation systems. In Chapter 4 there is an overview of the research area of Link Prediction and the re- lated work. Several methods and approaches to address the Link Prediction problem are presented. Next, in Chapter 5 we present our Methodology using a motivating ex- ample and propose our approach in the Link Prediction problem. Specifically, we focus on the link prediction task using multiple explicit and implicit information networks, in order to provide accurate recommendations to users of OSNs. The experimental evaluation of our approach is presented in Chapter 6, using several performance mea- sures. Next, we visualize the results and discuss them. Finally, in Chapter 7, there is a conclusion of this Thesis and future work is being discussed. There is also available Bibliography and Appendix including source code of the implementation.

4 2 Social Networking in the WWW

The following Chapter analyzes the general context of Web 2.0 technologies including blogs, wikis and mash-ups. In addition, we focus on social tagging systems which support several personalization services to users. Online social networking services are introduced which enable users to share digital content and establish several types of relationships. Moreover, social rating networks (SRN) are presented which allow users to collaborative express their preferences on items. Finally, the most popular of both OSNs and SRN web sites are examined.

2.1 Web 2.0 Technologies

Over the past years, there have been significant changes that have permanently af- fected the way that we use the world wide web. New web-based applications and services have been gradually introduced to the ordinary users, allowing them to inter- actively share information and collaborate online, through a variety of devices. These actions take place into a dynamic web platform on which users interact with an ef- ficient and explicit way, without having to be experts with information and commu- nication technologies (ICTs). Moreover, applications of Web 2.0 are characterized by a participatory information sharing, data and applications interoperability as well as user-centered design. This new generation of the world wide web, became popular with the term Web 2.0, which was first used by Tim O’Reilly and Dale Dougherty [73], coming from the

5 2. SOCIAL NETWORKING IN THE WWW 2.1. Web 2.0 Technologies business world. O’Reilly suggested a new version of the world wide web, without referring to any technical issues or updates, but to changes in ways developers and end-users use the Web. He actually said that a basic rule of success to this new plat- form is by building applications, that take advantage of network impacts and to be improved the more people use them.

Browsers are an integral part of the exploration of information and enable users to interact through applications of Web 2.0 or earlier. In the previous years users had de- veloped a passive relationship with visiting websites, as they only could read and save content from websites. Also, the creation of personal web pages used to imply specific programming knowledge. Nowadays, browsers are compatible with new technolo- gies, in order to support the general concept of Web 2.0 this of collaborative content sharing. Specifically, web browsers use software extensions and plug-ins to manage users’ content and interactions. Technologies such as HTML, Ajax and Flash enhanced graphic and usability experience of users.

In addition, Web 2.0 applications turned to service-oriented architecture and the exploitation of Web-client concept, becoming accessible from various devices. In that matter, web services for users can improve, using functionality from different appli- cations, resources and content formats (XML), combined to new services focusing the end-user. Building applications has gradually become a more open and collaborative process, since application programming interfaces (APIs) were opened to program- mers, in order to exploit and combine them to different applications. Such examples are news feeds, RSS, Mash-ups, Web Services. Applications and websites in the Web 2.0 context, include several activities like searching and authoring. Moreover, links, tags, extensions (Flash, Java, ActiveX) and signals (RSS) [64] are useful tools which integrate the Web 2.0 environment. Searching and links, are well known features and have been used since Web 1.0, enabling users to search for content and share it, re- spectively. Authoring is a new ability given to users, in which they create and update content collaboratively. Tags are labels added from users to help the content catego- rization. Tags collections in a social tagging system are called "folksonomies", meaning a type of directories created by many users.

All these tools and technologies supporting Web 2.0 contribute to a different way people communicate, forming the . Ordinary users share opinions, experi- ences, information and are now not only the receivers of applications and web services, but also active participants. People express themselves through blogging, podcasting, contributing to RSS, tagging items. In addition, people develop new types of rela- tionships through social bookmarking and networking. The term and the concept of

6 2. SOCIAL NETWORKING IN THE WWW 2.1. Web 2.0 Technologies

Web 2.0 have gradually been introduced in many different activities of users, organi- zations, even governments. In the education field, collaborative e-learning or class- room systems were based on Web 2.0 and libraries adopted users active participation. For marketing Web 2.0 tools offer an opportunity to create an interactive relationship with consumers. Governments realized that citizens feedback is valuable for the policy making process, introducing e-deliberation, e- consultation, e-voting systems based on Web 2.0 features [59]. Moreover, governments exploited Web 2.0 to offer services con- cerning everyday transactions between them and citizens or businesses (G2C, G2B). Companies and organizations, use these tools to enhance communication and the flow of information inside them as well as for their operational improvement.

2.1.1 Blogs

A significant part of Web 2.0 technologies has been the introduction of Blogs, which allowed typical users of the world wide web to express their personal opinions and concerns in an explicit way. Blogs are standalone or part of an existing web site, where opinions, information, posts, links to other websites, photos, videos and other content is published by individuals. Such content publication is sorted in a chronological order, allowing blog visitors to browse and interact efficiently. It is a common phenomenon for blogs to keep a standard area of topics being discussed such as technology, science, politics, news and many more. Blog holders, called "bloggers", usually begin the dis- cussion and call for responses of other people, to share opinions. In the early stages of the emergence of blogs, which name was a blend of the words "web logs", users had to embed text into HTML code of the website. But their widespread acceptance and us- age emerged a need for the automatization of these tasks. The gap for these tools has been bridged by open source platforms and well known providers, who lead to the automatization of these collaborative processes, making it really easy for non-expert users. In a very short time blogs were widely accepted because of the possibility offered to open an online dialogue engaging easy-to-use tools. This dialogue has addressed potential users from every part of the world, providing the opportunity for significant contribution to it. The users were now able to create an online community based on democratic foundations, which promotes dialogue through the extensive use of ICTs. The set of this worldwide collective community, was called "blogosphere", since blogs and users where interconnected through posts, comments, links and other content. All these kind of discussion, which took place in the "blogosphere", became a significant gauge of public opinion on various political or other issues, as well as a media of polit- ical pressure to the governments actions. It was more easy than ever, for people with

7 2. SOCIAL NETWORKING IN THE WWW 2.1. Web 2.0 Technologies common interests to express opinions, but also to engage others to this discussion who did not intent to in the first place. Governments and political parties [74] in general, soon realized the momentum gained by the public debate on the Internet. Thus, there was an interesting need for the measurement of emotion (positive or negative) in the blogosphere, to conduct po- litical campaigns or just measure public opinion in terms of decision-making. Many researchers used web mining techniques to extract emotion from the internet and espe- cially blogs, in order to understand the effect and the relationship of electronic public dialogue, in offline life [61, 1, 92, 39, 70, 91, 67]. Also, governments soon realized that the comments and views of citizens on local and global issues, not only are valuable feedback for policy development but also that the internet was an untapped medium for this purpose [59, 19, 60]. Governments turned slowly to the organized use of information and communication technologies and Web 2.0 technologies to provide cit- izens with electronic services (e-Government) but also actively engage them to policy- making (e-Democracy, e-Participation, ePetition) [78].

2.1.2 Wikis

Wikis are also another emerged technology of the Web 2.0 phenomenon, which en- hances users collaborative learning and allows knowledge sharing. Wikis are websites, whose content is formed by users without needing administration permission to do so. Unlike typical websites, whose content is formed and added by the owner of the web- page, wikis engage users adding content collaboratively. Specifically, these websites are editable in an easy way in a matter that users can add, remove, create new, edit content and put references. Wikis has the advanced feature of keeping the history of changes that have been occurred to the webpages by each user, defining a new ver- sion every time. All these versions, new and past ones are accessible anytime someone wants to retrieve it. These characteristics enhance user participation and collective collaboration aiming to content building by different user views [53]. Such websites have been widely used into enterprises and organizations, in order to keep track of the changes participants or employees make during the collaborative task. There have also been created and used wiki-based systems, supporting semantic dialogue argumentation among the members of a discussion for research [51] or delib- erative processes. The most widely known wiki over the world wide web is Wikipedia, an online Encyclopedia started in 2001. It currently contains more than 17 millions of articles in more than 270 languages, being edited by more than 91 thousands of active contributors. Wikipedia attracts about 400 unique millions visitors monthly, making it one of the top popular websites worldwide. There are two approaches about the Wiki

8 2. SOCIAL NETWORKING IN THE WWW 2.1. Web 2.0 Technologies word origin, first one concerning the Hawaiian "wiki wiki" meaning in a fast way and secondly the acronym "What I know is", giving explicitly the collaboration functional- ity of this tool.

2.1.3 Mash-ups

Mush-up is a term used in terms of Web development and refers to a website or a web- based application that combines data from more than two sources, in order to provide a new added value service. More and more online applications and web sites that pro- vide so, use an open application programming interface (API) framework, enabling other programmers to exploit their applications’ capabilities. API is a set of rules and specifications, in order for someone to efficiently use functions provided from a pro- gram or a service. In this context, Mush ups are websites that use the functionality service of a web application and combine it with new data, providing an innovative service. For example, portals are websites that aggregate information, classified into cat- egories, making easy for a user to browse through a tree structure. Let us suppose that a government portal contains all addresses from tax governmental buildings in a city. One mush up website could exploit the content of the portal and combine it with Google Maps application for example, providing an easy way to a user by a map to reach such a building. Examples of such applications can easily be found as well in the business world, adding value to existing services. Such an example is the ap- plication of searching products from a variety of catalogs and categories, combined with geographical information systems (GIS). Users could define advanced queries and combine various types of data, choosing between prices and places.

2.1.4 Social Tagging Systems

Tagging systems allow web users to bookmark content such as websites, images and videos they found online with keywords (tags), that they think it is most suitable every time. These keywords, are also called labels or tags and there are no particular words users have to use while labeling objects online. Users make tagging systems stronger as much as they use them, helping to make easier categories for a variety of online objects. This contribution improves searching, enhances spam detection and personal organization of items. Social tagging systems (STS) extend tagging systems in such a way to include users and labeled items, forming three way relationships [32, 34]. Social tagging systems exploit Web 2.0 features and relates users, items and tags, in order to provide item recommendations to users [37]. Users express their preferences

9 2. SOCIAL NETWORKING IN THE WWW 2.1. Web 2.0 Technologies about interesting items they find and label them with tags [31]. Then, this kind of in- formation can be categorized and become available for other users. In recent years, various tagging systems have been created and extended to social tagging systems, in- cluding persons. Among others very popular are Del.icio.us1 and Technorati2. The first one allows users to tag, save, manage and share web pages with other users. Technorati also categorizes using users tags post and blogs through the web offering a personal- ized search to users, exploiting tags. Core system developed for content organization in Delicious and Flickr was called a "folksonomy" by Thomas Vander Wal, combining the words "folk" and "taxonomy" [63]. The main difference between these two terms relies on that taxonomy is a hierarchically structured categorization of terms, consist- ing of an explicit parent-child relationship. Unlike this, "folksonomies" are the set of terms (tags) that a group of users used to label particular content, based on common URLs.

Moreover, there has been use of STS in music data as well, enabling users to label with tags songs, artists, albums such as Last.fm3. These systems use common tags that users share in order to recommend music data based on users tags and declared prefer- ences or inform users for a music event that matches user preferences of music genre. STS are widely used in commercial sites [85] and e-retailing such as Amazon.com4 which embedded with great success such systems. Specifically, users of Amazon be- come recipients of product recommendations, that other users bought in a common product category. In addition, they take full advantage of available metadata users add during their purchases such as ratings and others, in order to enhance the recom- mendation process. Web 2.0 websites which manage audio and visual content such as Youtube5 and Flickr6 use tags in order to categorize content and make easier for user to search and share.

It is quite obvious that STS and the introduction of labels (tags) added by users define new explicit and implicit relationships among a network of individuals, tags and items holding useful information available for efficient sharing, as well forming multi-modal networks. In the next sections, methods that exploit such relations is going to be studied and analyzed such as collaborative filtering [31, 32], latent factor models [82] and semantics [83], in order to provide better recommendations.

1http://www.delicious.com 2http://www.technorati.com 3http://www.last.fm 4http://www.amazon.com 5http://www.youtube.com 6http://www.flickr.com

10 2. SOCIAL NETWORKING IN THE WWW 2.2. Social Networking and Social Networks

2.2 Social Networking and Social Networks

Social networks sites and services have become an integral part of Web 2.0, gaining great popularity because of the integration they offer on forming web communities and content sharing. Individuals have the opportunity more than ever to form and extend their social relationships in an online environment. The basic concept of social network preexisted in the offline life, before its expression and implementation using the ICTs. This concept describes the connections among individuals into a social struc- ture and can be visualized using nodes and different types of edges connecting them. Nodes refer to persons or organizations and edges represent various types of inter- dependencies among them. Such types can be friendships, relatives, collaborations, common interests, colleagues, sharing common ideas or views and many more.

As far as it concerns social networking, it usually appears to refer exactly to the social network sites over the Internet. Although, Boyd and Ellison [12], detect a sig- nificant distinction to the concept of these two similar and familiar terms. They argue that the term "networking" emphasizes relationship initiation, meaning the conclusion of a new relationship between strangers. This can happen to a social network, but they show that this is not the main reason that were build. Online social networks may allow and help strangers make new friendships, but what makes them unique is that enables users to bring online their existing offline social network. Moreover, they de- fine social network sites as online services that enable users to create a unique profile, with particular privacy settings, define a list of other users that they share a type of connection, and have access to traverse friendships lists of their connections and their own one.

Social Networks are becoming very popular and enhance the evolution of Web 2.0 technologies, since they provide an area where a variety of new application can be launched to a large audience. In addition, they promote the evolution of mobile de- vices and their software, since it is crucial to support this raising tense for social com- munication combined with the geographical information systems. Users interact with their social network, are allowed to meet new people with common interests, political views or other affiliations. Moreover, users are consumers in a world market, where they buy, rate products and other items, express feelings and preferences. Online so- cial networks will effectively co-evolve with social rating and information about places users move, providing unified services to users.

11 2. SOCIAL NETWORKING IN THE WWW 2.3. Online Social Networks Sites

2.3 Online Social Networks Sites

Online Social Networks Sites have been gradually introduced since 1997, before Web 2.0 has ever been launched as a concept [12]. At the early stages of social networks sites, they gained very high popularity and adoption. At this stage, these hybrid so- cial network websites, supported users interaction through chat rooms or webpages that provided personal information about users. Soon, the features have increased to include email, in order to expand users’ social graph and later the exploration of users friend lists. First social network site that has ever been detected to include basic char- acteristics of OSNs in a hybrid shape was SixDegrees.com, which offered only user profiles and friend lists. During the next years, new OSNs appeared holding new features for users, whereas other were shut down because they were unsustainable. LiveJournal, Friendster, MyS- pace and Linked-In have been some of the well-known OSNs that have made their appearance, offering to the users more features and different concepts. Such different capabilities have been to search for old classmates by school, or connect with people that have to do with music, being funs or composers. In other OSNs, users could be related through their working status and form communities of researchers, business men, students, professors and many more. Bebo, Facebook, Windows Live Spaces and Twitter, belong to the next generation of OSNs, which combine some of the previous mentioned features but introducing some new. Some of the most popular will be ana- lyzed in the next subsections, in order to get clear view of the characteristics and their exponential growth and adoption.

2.3.1 Facebook

Facebook is one of the most popular online social networking websites and it counts more than 500 millions of active users, 50% of them log on to Facebook in any given day and it is translated into 70 languages7. The main founder of Facebook is Mark Zuckerberg, who launched it during 2004, while he was studying at Harvard. In the first place, Facebook was a private social network allowing only Harvard students to join. Then however, Facebook gradually extended to other universities and finally during 2006 became available for everybody more than 13 years old to join. Finally, its current revenue value is estimated at 2 billions U.S. dollars. Below, there is a figure 2.1 depicting the Facebook Website.

7Facebook’s statistics included in this section were obtained from: http://www.facebook.com/ press/info.php?statistics

12 2. SOCIAL NETWORKING IN THE WWW 2.3. Online Social Networks Sites

Figure 2.1: Overview of the Facebook Website

Facebook includes several features that made it really popular especially to young people. Facebook gives users the opportunity to create a rich profile with a variety of information. One of the most characteristic features of Facebook is "Status up- date", where users put a small message about their current emotional or other status. Users can declare their sex, birthday, languages spoken, interests, political and reli- gious views. In addition, users can declare their education status and choose from lists their schools or universities, in such a way that someone can search based on this and find old colleagues and classmates. Many groups and networks can be created by users affiliated with several topics. These groups can be public or private and users can invite others to join them. New friends in Facebook are recommended through a recommendation system that computes users similarity based on the number of the mutual friends that two users have, called "FOAF" [55]. It is noted that an average Facebook user has 130 friends and people spend over 700 billion minutes per month on Facebook. OSNs, as it has been already mentioned, is an online space where users can share content. Facebook enables users to share photos, videos, links. All these items can be posted to a personal online space called "Wall", where "friends" can post items or com- ment on them. In this context, Facebook encourages users to have an open dialogue and sometimes deliberate on topics and items. There is also the capability of adding metadata to items, since users can put labels on them (tags) or hit "Like" button, declar- ing their preference on an item. Users are notified by every action that happens to their profile or to an item that they are related in a way. There are over 900 million objects

13 2. SOCIAL NETWORKING IN THE WWW 2.3. Online Social Networks Sites

that people interact with (pages, groups, events and community pages). An average user is connected to 80 community pages, groups and events and creates 90 pieces of content each month. More than 30 billion pieces of content (web links, news stories, blog posts, notes, photo albums) are shared each month. Facebook engages Web 2.0 technologies providing APIs to programmers in order to build new applications for users. There are more than 250 million active users cur- rently accessing Facebook through their mobile devices. Games, quizzes and many other imaginative applications combining content, geographical places, maps, mobile devices are available most of the time for free, but using personal data of users. It is really impressive that Facebook became one of the first OSNs in which most users subscribed with their real names. This movement shows that users start to trust such systems but also emerged many privacy issues, because of the uncontrolled extraction of personal data and sometimes sensitive private data, by many Facebook applications. Finally, in the business part, Facebook is being sustainable and profitable through the use of adverts, Facebook gifts and applications’ performance.

2.3.2 Myspace

Myspace is a very popular social network website launched during 2003 and is es- timated to have 200 million user accounts. Myspace website supports a variety of features embedded in the user’s personal profile. Users can declare their mood using several "emoticons" and they fill with personal information sections such as "About Me" and "Who I’d Like to Meet", called blurbs. In addition, user interests are stored in a details section. Myspace has been one of the first social networks adopting Web 2.0 technologies, allowing users to share content and make comments on it. Flash and HTML can be embedded in users profiles without the need of expert knowledge. In Figure 2.2 an overview of the Myspace website is presented. Users share pictures and videos and were the first users to be able to embed Youtube’s videos, since its launch during 2005. Myspace became very popular because of its re- lationship with music and artists in general. It gave the opportunity to new artists and amateur composers to share their artwork. Many artists and composers used it as a promoting tool for their artwork, since they can share whole playlists of their music. Lately, Myspace was integrated to conclude services of Last.fm, iTunes and Napster. There is also the capability of blogging and an interesting use of this media for politi- cal campaigning, environmental awareness and other kinds of activism. It is true that Myspace has to deal with a great competition from other newer OSNs like Facebook, but also from Web 2.0 services like Youtube that gradually became more popular.

14 2. SOCIAL NETWORKING IN THE WWW 2.3. Online Social Networks Sites

Figure 2.2: Overview of the Myspace Website

2.3.3 Twitter

Twitter is another popular social network website founded during 2006 by Jack Dorsey. It is estimated (not confirmed stats) to have about 200 million users. Twitter’s ser- vices enable users to "follow" other users as well as sending short messages, known as "tweets". Following a user means, that every time he or she "tweets", in other words posts a public message, then the follower user will be notified. Users are allowed to group posts by topic or type, reply to posts, or repost messages from another user. Figure 2.3 below reflects Twitter’s website.

Figure 2.3: Overview of the Twitter Website

15 2. SOCIAL NETWORKING IN THE WWW 2.4. Social Rating Networks

In addition, twitter users are able to use it via the website using a computer, or a tablet or other device that connects to the web. In some countries there is a possibility to "tweet" via a mobile phone using an SMS service.

2.4 Social Rating Networks

Online social rating networks (SRNs) are formed in the context of social rating systems which enable users to express in a quantified way their opinion about several items on the web. These systems ask from users to rank, to vote for, to like, to dislike or express another opinion about an item. Moreover, participants can characterize argu- ments as pros or cons, during an online deliberation about a political or another type of issue [19]. Items exposed to this evaluation process can be products or information, posts to forums or social networks, comments, movies, services and many other items on the web. Information about an item or data is usually called "metadata". All this kind of "metadata" contribute to an overall evaluation of various items on web en- gaging users’ active participation. Rating items on the web leads to a more informed browsing, an informed product purchase or users’ exposure to more relevant content. In addition, similar opinions can be effectively summarized and form groups of users. There have been several approaches that try to address issues such as the item rec- ommendation to users such as Collaborative Filtering [43] or the extraction of mean- ingful summaries from users’ reviews [49]. It is obvious that users who contribute their opinions using social rating systems such as Epinons, Flixter, and Digg form social rat- ing networks. In these networks users are not explicitly related being "friends" in a so- cial network but they share several common implicit relationships like co-rating prod- ucts, co-like items, co-commenting posts and many other. Some of them may share as well an explicit relationship, such as being friends in a social network. It is crucial for recommendation methods to use every available type of relationship among users and items in order to provide user, items such as products, or rating recommendations [88].

2.4.1 Epinions

Epinions is a product review site, launched in 1999, bringing to consumers an informed product purchase. It is a web platform, where people find reliable information about products, unbiased advice, in depth product evaluations and personalized recommen- dations8. Users of Epinions contribute two types of reviews Express and Regular, de- pending on the text length, in order to give quick or detailed information to other

8http://www99.epinions.com/about/

16 2. SOCIAL NETWORKING IN THE WWW 2.4. Social Rating Networks

Figure 2.4: Overview of the Epinions Website

users. In addition, users rate using available metadata such as rating scale from 1 to 5 and are encouraged to place Pros and Cons arguments about their declared opinion. Of course, there is also a summary of the overall expressed opinion. Members of the website can rate opinions in a five level Helpfulness Scale concerning regular reviews and a bivalent scale for express reviews of "Show" or "Don’t Show". In Figure 2.4 there is an overview of the Epinions.com website. Epinions motivates users to share helpful reviews, offering Monetary prizes for re- viewers that are highly rated from customers as very helpful and contributing to the evaluating process. In the context of social rating networks, users of Epinions primary form implicit relationships with other users, this of collaborative rating products. Epin- ions website allows users to form and explicit relationships "trusting" other users. This "web of trust" measure is related with ratings and trust declarations, determining the order opinions are shown to each user. The formula of the measure is kept secret.

2.4.2 Flixster

Flixster is a social rating network dealing with movie rating. Registered users of Flixster are allowed to rate movies, share these ratings, discover new interesting movies and interact with other new users with similar movie taste. Flixster enables users to receive information about local Theater schedules, based on users’ zip code and using maps. In addition, there is an option of ordering tickets online. In Figure 2.5 below there is an overview of the Flixster website. During 2010, Flixster acquired another popular movies rating website, "Rotten Tomatoes", giving access to a larger volume of reviews. Both websites, provide additional metadata contributing to rate process like a 5-star scale, share buttons, recommendation button and other. Flixster has integrated with Web 2.0 applications, since the apps release for popular social networks such as

17 2. SOCIAL NETWORKING IN THE WWW 2.4. Social Rating Networks

Figure 2.5: Overview of the Flixster Website

Facebook, Myspace and other, gaining a larger audience and being accessible from var- ious mobile devices (e.g. iPhone, Blackberry) and mobile operating systems (e.g. iOS, Android).

2.4.3 Digg

Digg is a very popular online social rating network helping users to share and rate news. Users share links which redirect users to content (text, photos,video). Topics that are being shared, cover a variety of subjects such as business, Entertainment, Technol- ogy, Politics and many more. The main concept of this social news website is that users share links, users choose to "digg" them, in other words to vote positively on a subject or reject it. Posts’ popularity is measured by the number of "diggs", that they receive. In, addition users are encouraged to comment on posts as well as they can save an in- teresting post for reading later, or keep record of it. Digg engages users in deliberating, since users can not only comment on posts, but also co-comment on a threaded dialog. Furthermore, users are provided with available metadata buttons such as thump up/- down to argue or not with other users’ comments. In Figure 2.6, there is an overview of the digg.com website. "Digging" common posts is the major relationship among users in this social rating network, being an implicit type of relationship. However, through these kind of rela- tionships, there is a potential for new explicit types of relationships to be formed like friendship. Specifically, if a user likes posts that are being made from another user, he

18 2. SOCIAL NETWORKING IN THE WWW 2.4. Social Rating Networks

Figure 2.6: Overview of the Digg Website or she is able to "follow" his posts in the future. Users also receive notifications on their profile interface for any event that is related with them or with items that they have been connected with (e.g. posts, links). All this structured information, interactions and available metadata are integrated, in order Digg.com to efficiently sort comments of users and provide people and posts recommendations9.

9http://about.digg.com/blog/algorithm-experiments-better-comments# suggestions

19 2. SOCIAL NETWORKING IN THE WWW 2.4. Social Rating Networks

20 3 Data mining in OSNs

The rapid exposure of Web 2.0 technologies, that has been described in Chapter 2, com- bined with the great number of relationships that users are allowed to form by using them, emerges the need for efficient management of the large amounts of information and the extraction of meaningful messages. Users of ONSs share content and form relationships, which can be exploited from Data Mining research field to provide bet- ter services and personalization. In this chapter, we study the social network analysis field, using Graph Theory notation, presenting available tools and discussing a new age of recommendation services.

3.1 Social Network Analysis (SNA)

Social Network Analysis (SNA) is a set of methods and practices concerned with the study of the social structures that are formed within social networks. Specifically, SNA can be considered as a technique for the quantitive study and visualization of the rela- tionships entities form, and the extraction of meaningful messages about their behavior in a sociological aspect. In order to achieve that, SNA uses tools from various research fields such as Sociology, Graph Theory, Computer Science and Statistics. Traditional techniques of social network analysis are based on individuals and in- dividuals’ behavior [47]. Unlike this, SNA focus on models, applications and theories that study individuals’ relationships as part of social groups [90, 48]. SNA is a tech- nique for the quantitive measurement and visualization of relationships among the

21 3. DATA MINING IN OSNS 3.1. Social Network Analysis (SNA) actors referring to people, groups, companies, organizations, or any other unit of in- formation and knowledge processing. There are several types of relationships among them such as friendship, colleagues, common political, religious or other views and many other. Wasserman and Faust [90] define four central points in SNA:

Actors and their actions, which are regarded as related to each other, and not • independent.

The ties of relationships (connections) among actors are channels used to transfer • resources.

Network models focusing on the individual, regard networks as something that • limit individual action.

Network models consider network structure (social, political, economical, etc.) • as continuous interaction motifs among actors.

It has already been mentioned that SNA uses several tools derived from different fields, in order to analyze network data. Graph theory, discrete mathematics and sev- eral algorithms from computer science contribute to analyze centrality measures and positions of actors in a social network. Several measures and metrics are used and defined, which give a social perspective to measure actors’ fame, reputation and influ- ence, or network connectivity. Data mining as a part of the wide research field of com- puter science and information management, provides tools to detect communities [87] within social networks. Such tools are clustering techniques like k-means, spectral analysis and categoriza- tion algorithms [72]. Some other tools are statistics, latent factor methods [82] and opinion mining [41, 76], in which Jamali et al. analyzed popularity of different theme category in Digg social rating network and tried to predict the popularity of posts based on users’ comments. In addition, sentiment mining and analysis has been used in several researches reported by Pang et al. [76], in order to categorize sentiments in blogs or public discussions. Such work was presented by Go et al. [27], where they categorized tweets as positive or negative in particular discussions in Tweeter SN. Fi- nally, new approaches deal with the likelihood of the presence of new different types of relationships between actors, called the link prediction problem [55]. SNA is increasingly applied in different fields, due to the popularity that Web has gained lately as well the large volume of available data that are created through it. The rapid exposure of Web 2.0 and especially the social networking phenomenon has con- tributed to easy data gathering about users, relationships and their daily interactions.

22 3. DATA MINING IN OSNS 3.2. Basic Graph Theory

Except online social networks, whose related work has already been mentioned, there are a number of fields and applications that used efficiently SNA methods. Such a field is public security and terrorism, where several researches [50, 71] used SNA as a tool for the study of criminal organizations or automatic discovery of interception of privacy data. Moreover, SNA has been applied to business and economy, where Gay et al. [24] exploited SNA and visualization tools to depict companies alliances in the commercial world. Last but not least, co-authorship and research academic networks have been explored using SNA tools by Luo et al. [58]. The visualization of social networks and their relationships, is an important part of the SNA, since it allows the representation of the structure of networks explicitly. Graph theory is ideal for effective representation of actors with nodes and their inter- actions with different types of edges. Subsection 3.2 provides a detailed study and presentation of graph theory and its notations. In addition, expect from the theoretical background of Graph Theory, there are several visualization tools and data structures that enable the automated study of social networks, using graphics information sys- tems. The emerging research area of social network analysis has been promoted by a non- governmental organization, since 1976. The "International Network for Social Network Analysis"1 organizes every year the global conference "International Socials Networks Conference - Sunbelt", and moreover publishes the journals Connections and Social Net- works.

3.2 Basic Graph Theory

Graph Theory [10] is being investigated by the field of Discrete Mathematics [57] and is widely used in Computer Science. Graphs are mathematical structures, modeling relations between entities from a certain collection. There are many different types of graphs that visualize different networks with several types of relationships. One definition for a graph is the following: "A graph G(V,E) is a non-empty finite set V of elements called vertices accompanied by a set E of pairs of vertices called edges." Edges represent the (symmetrical) relationship between two vertices. In graphs, walks, cycles, and paths are defined, in order to efficiently traverse them. A walk is a successive sequence of adjacent edges and it is called closed if its first and last vertices are the same, and open if they are different. Paths are walks that do not contain cycles. The length of a walk or cycle is the number of edges that are traversed in a single walk or cycle. The distance or geodesic distance between two vertices or nodes is defined as the

1http://www.insna.org/

23 3. DATA MINING IN OSNS 3.2. Basic Graph Theory

length of the shortest path between them [30]. In case that there is no path connecting the two vertices, then conventionally the distance is defined as infinite. Graphs can be efficiently represented in a common mathematical form, this of ma- trices, and particularly adjacency matrices. An adjacency matrix A, is a symmetric n n matrix, where n = V is the number of nodes in G. Entries of the adjacency matrix × | | (a ,a ) equals 1 if (v ,v ) E, otherwise it is 0 (zero). The adjacency matrix of a graph i j i j ∈ G when raised to the power of n, results to an interesting property, where entries of the new matrix B = An count the number of paths of length n between pair of nodes

(vi,v j). This is a useful property for several measure of SNA, helping to capture users’ relationships in greater depth.

3.2.1 Unipartite Graph

A binary graph is one that represents a relationship between two vertices(nodes) with the existence or the absence of an edge between them. This kind of graph is also called undirected, simple, symmetric or unipartite. Adjacency matrices is a common way to rep- resent graphs. It is a n n matrix, where n = V is the number of nodes in G. Therefore, × | | it has n rows and n columns labelled by the graph nodes. For an un-weighted non- multiple graph (such as G), the adjacency matrix values are set as follows:

1, if (vi,v j) E A[vi,v j]= ∈ 0, if (v ,v ) / E  i j ∈ 

In the following Figure 3.1 an example of an undirected graph, which can visualize the friendship relationship among users A, B, C and D in a social network, is shown.

A B A B C D A 0 1 1 1

B 1 0 0 0

C 1 0 0 1 C D D 1 0 1 0

Figure 3.1: Example of an undirected graph and its corresponding adjacency matrix.

24 3. DATA MINING IN OSNS 3.2. Basic Graph Theory

A simple graph is considered to be Complete, Kn of order n, when every vertex n is adjacent to every other. The number of edges of the complete graph is given by the following equation 3.1:

n(n 1)/2 (3.1) −

Below in Figure 3.2 the complete graph K4 is depicted responding to the undirected graph of Figure 3.1. In this example, if all four nodes where connected, six edges arise.

A B

C D

Figure 3.2: Example of the complete K4 graph.

In Figure 3.3, it is presented the adjacency matrix B = A2 for the simple graph of Figure 3.1, while it has already been mentioned that entries count the paths of length 2. For example, the colored entry (A, D) means that there is one path of length 2, between vertices A and D. In addition, diagonal entries of the matrix (bi,b j) with i = j, count the in-degree and out-degree of each vertex, that is to say the number of edges that are incident to the node (node A is incident with 3 nodes B, C and D).

A B A B C D A 3 0 1 1

B 0 1 1 1

C 1 1 2 1 C D D 1 1 1 2

Figure 3.3: Example of an undirected graph and its corresponding adjacency matrix in the power of 2.

25 3. DATA MINING IN OSNS 3.2. Basic Graph Theory

Directed graphs or digraphs, use arcs instead of edges, in order to model one way relationships between vertices. In the following Figure 3.4, an example of a directed graph, which responds to a social network, is shown.

A B A B C D A 0 1 1 0

B 0 0 0 0

C 0 0 0 1 C D D 1 0 0 1

Figure 3.4: Example of an directed graph and its corresponding adjacency matrix.

Weighted graphs can be either undirected or directed and have weights (numerical values) associated to their edges or arcs, respectively. Adjacency matrices that respond to weighted graphs include weights in their entries. The following Figure 3.5 shows an example of an undirected weighted graph, which responds to a social network.

2 A B A B C D A 0 2 4 5 5 4 B 2 0 3 0

C 4 3 0 3 C D 3 D 5 0 3 0

Figure 3.5: Example of an undirected weighted graph and its corresponding adjacency matrix.

3.2.2 Bipartite Graph

A bipartite graph (or bigraph) is a graph whose vertices can be divided into two dis- joint sets V and W such that every edge connects a vertex in V to one in W; that is, V and W are independent sets. Bipartite graphs are widely used to visualize a relation- ship that two different kind of entities form. In SNA, bipartite graphs can represent relationships of users with several items, such as groups, tags, preferences and many

26 3. DATA MINING IN OSNS 3.2. Basic Graph Theory

more. In collaborative networks, different authors may have written together a pa- per. In biology [15], bipartite graphs depict connections of different element forming substances.

1, if (vi,w j) E R[vi,w j]= ∈ 0, if (v ,w ) / E  i j ∈ In Figure 3.6 there is an example of a bipartite graph G(V+W, E), which responds to a user-item social network. Users belong to V set of vertices, while items to the U set. Edges connect nodes from the two distinct sets defining one type of relationship among them.

V set of Vertices U set of Vertices

A E F E A 1 1 B B 1 0

C 1 1 F C D 1 0

D

Figure 3.6: Example of a bipartite graph and its corresponding adjacency matrix.

3.2.3 Muti-Modal Graph

For multi graphs, it is assumed that the graph G can contain multiple edges that con- nect two node, thus if two nodes vi,v j are connected with an edge of E, then there can exist also another edge in E also connecting them. This feature enables more than one types of different relationships to be depicted in a graph, simultaneously. Multimodal graphs are highly related with multi-graphs, but the latter allow loops [10]. In the following Figure 3.7, an example of multimodal network is shown, which responds to a social network including more than one different types (different colors) of relationships. Blue edges respond to the friendship network connecting vertices of the V set. In contrast, red lines respond to the bipartite network depicting relationships

27 3. DATA MINING IN OSNS 3.2. Basic Graph Theory

among nodes of set V and nodes of set U. The union of the unipartite and the bipartite graphs denotes the multi-modal graph, which includes both types of different relation- ships that are formed.

V set of Vertices U set of Vertices

A B C D A A 0 1 1 1 B 1 0 0 0 E C 1 0 0 1 B D 1 0 1 0

F C E F A 1 1 D B 1 0 C 1 1

D 1 0

Figure 3.7: Example of a multimodal graph and its corresponding adjacency matrices.

3.2.4 Measures and Metrics At this point, a brief presentation of a number of metrics that are used to describe particular characteristics and topological properties of social networks, in a graph per- spective [11, 9, 68, 23] is presented.

Average distance is the average distance between any two nodes. • Diameter is the maximum distance between any two nodes • Density is proportion of edges among the all nodes of a graph. There are sparse • and dense networks.

– For undirected graphs is defined as:

2 E ρ = | | (3.2) V ( V 1) | | | |− – For directed graphs id defined as:

A ρ = | | (3.3) V ( V 1) | | | |−

28 3. DATA MINING IN OSNS 3.2. Basic Graph Theory

Bridge is an edge such that, if deleting it would cause its endpoints to lie in dif- • ferent components of a graph. If its endpoints share no common neighbors, then the edge is called a local .

Betweenness is a measure showing the number of people who an individual is • connecting indirectly through their direct links.

Closeness is the degree an individual is near all other individuals in a network • (directly/indirectly). Closeness is the inverse of the sum of the shortest distances between each individual and every other person in the network.

Centrality: This measure gives a rough indication of the social power of a node • based on how well they "connect" the network. "Betweenness", "Closeness", and "Degree" are all measures of centrality [22, 23, 11, 9].

– Degree centrality measures the popularity of a node, but also the existing probability to be traversed.

– Betweenness centrality is the number of the times that a node n will have to come through another node r, in order to reach a third node k. It is also defined as the number of shortest’s paths that come through a certain node:

g jik bi = ∑ (3.4) j,k g jk

– Closeness centrality shows the distance of a node from the rest of the nodes of the network. 1 ci = (3.5) ∑ j dij

where di, j is the distance between vertices vi and v j.

Eigenvector centrality measures the importance of a node within a network. One • node’s centrality is proportional to its neighbors’ centralities. This is formed in the below equation as:

ci = α ∑ c jaij (3.6) j=i ￿

Flow betweenness centrality is the degree that a node contributes to sum of max- • imum flow between all pairs of nodes (not that node).

29 3. DATA MINING IN OSNS 3.2. Basic Graph Theory

Degree Centrality Bridge

Eigenvector Betweenness Centrality

Figure 3.8: Example of metrics in a social network.

Centralization is the difference between numbers of links for each node divided • by maximum possible sum of differences.

Clustering coefficient of a node measures if a user or person is part of a cohesive • group or community. 2e c = i (3.7) i k (k 1) i i − where ei is the number of arcs(for directed) or edges (for undirected) between the neighbors of node i, and ki is its degree.

Moreover, centrality measures are, sometimes, encountered as prestige, radiality, reach, structural cohesion and structural equivalent. These terms refer to sociological terminology and concepts.

Prestige in a directed graph is the term used to describe a node’s centrality. •

Radiality is the degree an individual’s network reaches out into the network and • provides novel information and influence.

Reach is the degree any member of a network can reach other members. •

Structural cohesion is the minimum number of members who if removed from a • group, would disconnect the group.

Structural equivalence refers to the nodes have a common set of linkages to other • nodes.

30 3. DATA MINING IN OSNS 3.3. Community Detection

3.3 Community Detection

Social networks and Web 2.0 in general, enable users to connect and communicate based on common interests, beliefs and other items that they share. Thus, it has widely been observed that these connections among users form web communities, with sev- eral structural features. One of these features is the density of the connections (edges) among the members of a community as well as connections between nodes of differ- ent communities. A set of nodes form a conventional community, if every node of is connected with a greater number of edges with other nodes within the community than nodes outside of it [87]. In the Figure 3.9 it is depicted an example of 3 communi- ties(red, green, blue), that satisfy the condition that within colored communities there are more edges among nodes than these connecting them outside.

Figure 3.9: Example of a graph with three communities in a social network.

Social network analysis addresses the emergent problem, of how to detect com- munities exploiting network data, with several methods and applications of graph theory and data mining. Block analysis refers to the approach that rearranges rows and columns of the adjacency matrix, in order to trace communities by creating blocks around the diagonal. Classification methods from the emergent research are of Data mining, such as K- and K-plex methods, address the problem of community de- tection by tracing more or less complete graphs. In addition, spectral analysis uses eigenvectors of the adjacency matrix to trace communities, based on eigenvector cen- trality and the similarity of users. Hierarchical clustering, is an agglomerative method adding similar nodes to similar clusters, in order to form communities [15].

31 3. DATA MINING IN OSNS 3.4. Topological Properties

3.4 Topological Properties

In this section, common topological properties that are observed in social and other networks are being examined. Small-world and scale-free are network categories that appear to have interesting features. Properties of these networks have been first ex- perimentally traced and evaluated. Later, a significant part of research work has been dedicated to the theoretical background as well as to the simulation and generation of such networks.

3.4.1 Small World Networks Small world networks have been first approached by Stanley Milgram, who claimed that everyone in the world could be connected to everyone else through six or seven steps in a human chain [65]. This interesting property was experimentally confirmed in the 1960s, when he followed the human chain that a sent letter used from Nebraska to reach (a complete stranger to the sender), a receiver in Boston. This experiment was stated later by others as the "small world" or "six degrees of separation" experiment. Regardless the fact that a message from a person A to a person B can follow a variety of human chains of different lengths, it was confirmed that the average steps are only six. Recently, Goel et al. [28] supported experimentally from an algorithmic perspective the same claim. They reported that the half of all chains can be completed in 6 or 7 steps. These small worlds models can be found in several models in world and hold the property of the graph diameter (the longest shortest path between two nodes) that in- creases logarithmically with the number of nodes [6]. Small world networks is a com- mon phenomenon in social networks, web communities, email networks, chat con- tacts, co-authorship networks and other networks over the Internet. Several research efforts have been systematically undertaken to study the small world phenomenon [28, 3, 46, 55], in the context of link prediction in networks and the understudying of their evolution.

3.4.2 Scale free Networks Scale free networks belong to the wider category of small word networks, by follow- ing a power law distribution in their node degree [4, 8]. This practically means that many social, communication or biological networks, contain a few number of nodes that have a very high degree and many nodes with low degree. In addition, it has previously been mentioned, that degree refers to the number of edges that are inci- dent on a node. Thus, some nodes are more popular and more influential in a network than others, while particular parts of the network are less or more sparse in terms of

32 3. DATA MINING IN OSNS 3.5. Visualization and Tools

connectivity (edges). These properties are very important when analyzing social net- works, allowing researchers to trace influential users within them.

Nodes in scale free networks follow a power law distribution, therefore they are often called power law networks. This property and especially the high degree of par- ticular nodes has been exploited by algorithms that search in graphs or try to predict new connections among nodes [4]. A node with high degree, is more likely to offer a path or become the hub to other nodes in terms of searching. In the same notation, a popular node is more likely to connect to other new nodes in the future than other with low connectivity. In the link prediction Chapter 4, measures and algorithms based on this property are going to be analyzed such as preferential attachment [8, 55]. Fi- nally, there have been attempts to simulate and study the evolution of such networks, such as Faloutsos et al. [18], who analyzed communication networks with multi-modal features, tracing patterns and ways to artificially generate them.

3.5 Visualization and Tools

Visualization of networks is a powerful tool for researchers and users to efficiently map entities and relationships that appear within social and other networks. Social networks growth and SNA, emerged the need for the study of networks’ static and dy- namic features [69]. Several tools have been created that efficiently map and analyze networks and graphs topology. Specifically, networks layout can be arranged based on their principal component, relate vertices and edges with springs between them or group by attributes. In addition, analysis tools compute SNA measures and enable re- searchers to find cut points, blocks, k-cores or subgroups. Mapping network dynamics has also been recently explored to capture evolution of networks over time. There are two particular visualization techniques for this task, one is to use static mappings over time and the second to add movement in a virtual space.

Visualizing social networks enables researchers to map and analyze networks’ topo- logical and other features, while for users is an enjoyable exploration of their social graph. There is available software or online applications for both tasks, which allow either the visualization and analysis, either the exploration of social graph or related links on Web. For the first one some of them are Pajek, Netdraw or Stocnet and for users mainly it is Social Graph or TouchGraph. In Figure 3.10 there is a screenshot of

33 3. DATA MINING IN OSNS 3.5. Visualization and Tools

Figure 3.10: TouchGraph screenshot visualizing friendships within a Social Network.

Figure 3.11: Example of Netdraw visualizing a social network.

TouchGraph visualizing the friendship relationships that are formed within the Face- book social network. There is a color distinction of nodes and edges representing dif- ferent clusters of friendships. In Figure 3.11, Netdraw is depicted. Here, a social net- work is analyzed and nodes are arranged by the principal components of the network.

34 3. DATA MINING IN OSNS 3.6. Personalization and Recommender Systems

3.6 Personalization and Recommender Systems

World wide web and especially Web 2.0 has gradually enabled users to publish and share in an increasing way digital content. This fact contributed to the already increas- ing volume of available information online and emerged new challenges around effi- ciently online content searching. Emerging research areas such as data mining, infor- mation retrieval and artificial intelligence suggested methods and techniques to sup- port users with a personalized online browsing experience. In order to achieve that, it is necessary users’ preferences to be declared and other personal information to be accessed. Personalization field combines the aforementioned methods to collect personal data, compare them to other online users or items and finally compute similarities among them in order to produce a personalized output to users. Online personalization ad- justs content or provided services to users by collecting information about users’ online browsing behavior and declared status or preferences. This task, of course, requires the processing of personal data and sometimes sensitive personal data, fact that has raised serious concerns in terms of privacy. It is important that users have been previously informed about their personal data processing and have explicitly given their consent. Several online services enable users to personalize and customize their browsing experience based on implicit and explicit data, respectively. Such services include search engines, content providers (news, music, video, etc), business to customers- B2C (e.g Amazon), print media and marketing. All these services exceed the limits of browsers and personal computers, since a great number of applications serve content in different devices. Users enjoy the advantages of online personalization saving time, money and having a better informed browsing. On the other hand, there are disad- vantages such as the aforementioned privacy issues, the lack of content relevance and moreover the lack of returned value. Personalization is considered to be a necessary task, due to the large volumes of the available online information. Moreover, Web 2.0 enables the collaborative evaluation of content and services among users, since they are provided with rating tools (ratings, star scales, thumbs up/down) which are valuable feedback for the users community. On the other hand, there is an interesting point of view expressed in an article2 of the "Wired magazine", reflecting the concern about the over-centralization of content to a few number of providers and its impact on the web. One of the most popular tool for the personalization process is recommender sys- tems, which have become an emerging research field since the mid-90s [5]. General

2http://www.wired.com/magazine/2010/08/ff_webrip/all/1

35 3. DATA MINING IN OSNS 3.6. Personalization and Recommender Systems interest has been shown to recommender systems both from academic and business world, since in many applications users have to deal with information overload need- ing accurate personalized recommendations. We need to highlight that recommender systems is a research area that requires further improvements in the current methods used, since the available online information increases every single moment with users’ needs, simultaneously. Examples of current recommender systems applications concern particularly con- tent sites, online commercial sites and social networks. Speaking of content sites, such as Last.fm, StumbleUpon, AlloCine, etc., the task is to predict rating of items by a given user or find a list of interesting items. Available data for this task is characterized by precise content description and explicit rating for some user. In the commercial as- pect of the recommendation issue, websites like Amazon [56] and Netflix, recommend to users books, CDs and other products. Here the task is to build group of products for bundle sales, in other words build a list of products that the user is likely to buy. Data for this task are derived from the list with all purchases and browsing history for all users. In the advertisement field (e.g Google AdSence, Doubleclick), recommender systems find a list of advertisements optimized according to expected income using data form the browsing history for all users. There are also applications in movie rec- ommendation such as MovieLens [66] as well as cases that some vendors have incor- porated recommendation capabilities into their commerce servers [5]. Recommender systems are usually classified into three categories, which are described in the next subsections.

3.6.1 Content-Based

Content-based recommendations refer to systems that the user will be recommended items similar to those preferred by him/her-self in the past. This type of the content- based approach to recommendation is based on information filtering and information retrieval areas. In content based recommendation methods, in order to recommend items to a user u, the system tries to catch similarities among the item a user u has rated highly in the past. Therefore, only items that are most similar to user’s preferences would be recommended. There are two main methods for content-filtering. The first one is heuristic-based using similarity measures such as cosine similarity or clustering techniques. The second one, is model-based, which use probabilistic models learning predictions of users from attributes. These systems exploit data and features from users’ profiles and data derived from item features and ratings, and their similarity. Such recommender systems have the advantage of receiving the necessary feedback from users, through rating items. On

36 3. DATA MINING IN OSNS 3.6. Personalization and Recommender Systems the other hand there is a number of problems in these types of recommender systems. Overspecialization occurs when the system can only recommend items that receive the highest scores from user. As a result the user is limited to being recommended to items that are similar to those already rated. To address this issue, the introduction of some randomness and the filtering out of too similar items, in case user has already seen something too similar, were proposed. Moreover, the new user problem refers to the inability of accurate recommenda- tions, due to the insufficient number of user rating to items. Recommender systems need feedback in order to provide recommendations and they suffer from a "cold start problem". In addition, content based techniques are based on item features to compute similarities and provide recommendations. In lack of a sufficient set of features or in- ability to automatically gain information about item features, recommendations tend to be worse. Finally, another problem due to limited content analysis, is that if two items are described by the same set of features, they are often indistinguishable [5].

3.6.2 Collaborative Filtering

Collaborative recommender or filtering systems try to provide recommendations to users for items that have been previously been rated from other users. The key element is that the recommendation depends on the similarity of users profiles. User similarity to other users is computed based on measures that depict the relationships of users in a network and their ratings on items. The basic concept is that matching similar users is a task based on preferences, interests and personal profile of users. Users that buy and rate products, while keeping a user profile, they contribute to the constitution of an online community or a rating social network. Collaborative recommender systems exploit users’ profiles to compose a group of users that are more "similar" to a target user and provide recommendations. Practically, a user that rated an item makes him/her more similar to another user, and not the value of the rating [5]. Therefore, there is no need of keeping the item features. On the other hand, new user problem is also present here just like in content-based systems. Moreover, this problem is now extended to items, meaning that new items with few number of ratings will not be recommended. This sparsity of rating new or older items makes the recommendation task more difficult and sometimes insufficient. In order to overcome this problem, systems refuge to compare user profile only, focusing on user demographic features to get some similarities [81]. Another method that addresses sparsity problem is this of the aforementioned SVD which reduces the dimensionality of sparse rating matrices.

37 3. DATA MINING IN OSNS 3.7. OSNs and Recommender Systems

Unfortunately, exploiting users’ demographic features becomes a hard task, be- cause of the privacy issues that are raised. Personal data processing requires the ex- plicit user assent and most of the times users appear to be cautious and resisting.

3.6.3 Hybrid

Hybrid recommendation systems combine both methods of content based and collabo- rative filtering, in order to overcome possible problems derived from a single method. Adomavicius and Tuzhilin [5] classify four ways of combining different recommenda- tion systems. The first one concerns the implementation of collaborative and content- based methods separately and then combines their predictions with a linear combina- tion of ratings or using a voting scheme. Second approach is this of adding content- based characteristics to collaborative models. For example, from the content based model it is possible to keep the non commonly rated items, and then use these so as to compute users’ similarity. Moreover, a third approach is concerned with adding collaborative characteristics to content based models. Latent semantic indexing and SVD is usually applied to a group of content based profiles. The last approach is about developing a single uni- fying recommendation model. Such models include rule-based classifiers, bayesian regression models and Markov chain Monte Carlo methods for parameter estimation and prediction as well as knowledge-based techniques [5].

3.7 OSNs and Recommender Systems

Online social networks (OSNs) enable users to bring online their offline social relation- ships as well as to make new friends. Popular social network web sites like Facebook and Myspace allow users to keep a personal profile, create and join common groups of interest, tag people on pictures and videos, share posts or web links and comment on posts. All these kinds of interactions contribute to higher connectivity of users and creates breeding ground for the expansion of their social graphs. In OSNs a user can decide to expand its social circle by requesting from another user to be part of his or her friend list. The receiver of the friend request is notified by the OSN interface and can choose to accept or reject the pending request. In case he or she accepts, then there is a mutual agreement that they both wish to be connected in an explicit way. Gradually, social network services used automated ways in order to facilitate this task as well as to enhance users’ connectivity. So far, recommender sys- tems (collaborative recommender systems) applied to online social networks exploited users’ friend lists to provide people recommendations based on their common friends.

38 3. DATA MINING IN OSNS 3.7. OSNs and Recommender Systems

After computing the "similarity" of a user’s potential new friends of his/her friends, then they are ranked based on the number of common friends. In that way, users re- ceive justified recommendations from the OSN, suggesting that a new connection can be made due to the number of mutual friends. This reasoning behind the recommenda- tions makes the recommendation system more trusted for users and transparent [37]. The task of discovering new links in the social graph of OSNs users refers to the Link Prediction [55] problem. Actually, recommendation systems try to predict, us- ing the friend list as a ground truth, new possible friends and recommend them to users. The majority of research work in Link Prediction infers new interactions among users by exploiting the structure of the friendship graph. However, it has been men- tioned that OSN users share several kinds of interactions such as co-tagging photos, co-commenting on people’s posts, or join same groups. All these types implicit of com- mon interactions can be used to recommend to users new interesting items to check or join (links, posts, groups) or recommend new friends which may share a common ac- tivity and should be friends for that reason. In addition, these implicit relationships can be used to enhance people recommendation, in cases that the explicit friendship source is not informative enough. This is the main goal and task of this postgraduate the- sis, to study the accuracy and enhancement of the link prediction task in OSNs using multiple explicit and implicit sources, in order to recommend new friends to users.

39 3. DATA MINING IN OSNS 3.7. OSNs and Recommender Systems

40 4

Link Prediction

Online social networks (OSNs) have gradually gained high popularity as a result of easy content sharing capability and the potential of online community formation. A considerable amount of attention has been attracted by the computational analysis of large and complex social networks [55], due to the fact that users connect form- ing different types of relationships and networks. Some examples of such networks allow users find old and make new friends, sharing digital content and co-joining groups(items). Other networks include scientists in particular research areas where connections among them means that they have been collaborated in the past, coau- thoring scientific articles.

Moreover, social rating networks enable users to form relations with other users or items such as products. These two networks co-evolve over time, one depicting relationships among users and another expressing users’ preferences for items. Un- derstanding the underlying mechanism by which these networks evolve, allow us to better understand them and provide new services dedicated to help users to be better connected to other common users or items. Link prediction refers to this task of infer- ring new different types of interactions that will occur in the near future, taking into account the existing connections among users and items.

41 4. LINK PREDICTION 4.1. Mathematical Problem Formulation

4.1 Mathematical Problem Formulation

The Link Prediction problem can be mathematically formulated as:

Let V be a set of vertices V = v n formed in such a way to form a social network depicted { i}i=1 by a graph G =(V,E), where E = e j=k,k with k n(n 1)/2 is the set of observed edges { i, j}i=1,1 ≤ − (links) connecting elements of set V, then the link prediction task is to infer how likely is an unobserved edge e / E to exist between a random pair of nodes(v ,v ) in the network. i, j ∈ i j Kleinberg et al. [55] approach the Link Prediction problem as a basic computational problem in terms of social-network evolution and define it as: "Given a snapshot of a social network at time t, we seek to accurately predict the edges that will be added to the network during the interval of time t to a given future time t’." In the Figure 4.1 above there is a visualization of a social network in four different snapshots with timestamps t1,t2,t3 and tn describing its evolution within an interval of time.

!!!!"# !!!!"$ !!!!"%

&'(!)*!+,*-.&"!"(/!

?

?

?

Figure 4.1: Visualization of a social network evolution and the link prediction task.

42 4. LINK PREDICTION 4.2. Related Work

Social networks appear to have dynamic features since their connection formula- tion changes over time through users; new interactions. It is easy for someone to ob- serve that in the t1 snapshot over the social network there is a particular number of observed edges among users. During the t2 timestamp two new edges arise depicted with orange color. In the next time interval t3, another two edges arise, this time de- picted with red color. It is crucial at this point to consider, that red edges of t3 have particularly arisen, due to the fact that orange edges during t2 opened new routes for the formation of new edges. This evolutional underlying mechanism should be under- stood to address the link prediction problem, which is interpreted in the prediction of the black dashed edges in the timestamp t4.

4.2 Related Work

There are several methods trying to address the Link Prediction problem, which is ap- plicable to a wide variety of areas like recommender systems, co-authorship domain, molecular biology and other fields. Specifically, in recommendation systems and so- cial networks, both supervised and unsupervised methods deal with the link predic- tion task. As far as it concerns supervised methods, structural features are extracted from the network for a mapping function to learn them. On the other hand, unsuper- vised methods use several similarity measures to compute "proximity" among nodes of a network. These measures consider global or local features of the network, dealing with node or path features. Supervised learning approaches are based particularly on training a binary classi- fier to predict whether a link exists between a given pair of nodes or not. This clas- sification task can be applied using well-known supervised learning classification al- gorithms such as K-neighbors, support vector machines (SVM) or decision trees. Link prediction can be considered as a classification task, where two given nodes in a net- work should be classified as connected or not. In order for such entities to be classified, there should be available some features or attributes of the nodes. Such attributes can be interests, affiliations, demographic data, number of friends. As far as it concerns features, they can be graph-based features such as shortest path length or mean first passage time [25]. There has been a lot of related work in the supervised context, Getoor et al. [26] explored the relational link structure with supervised learning in probabilistic mod- els. Taskar et al. [84] focuses on predicting the existence and the type of links between entities in relational data, applying a relational Markov network framework. More- over, Hasan et al. [35], studied link prediction as a supervised task. They compared

43 4. LINK PREDICTION 4.2. Related Work well-known classification algorithms such as decision trees, k-NN, SVM, RBF network, using coauthor-ship graph from scientific publication data of Elsevier BIOBASE and DBLP,in order to predict future link in these social networks. In addition, Sarukkai [79], stated the need for improved link navigation in the WWW and proposed a Markov chain based method for link prediction and path analysis.

4.2.1 Different Link Prediction Tasks

Most of the related research work on link prediction, focuses on the problem of the link existence task. This can be interpreted as whether an unobserved edge between two nodes in a social network will arise in the near future or not. The link existence problem can be easily extended to the problem of link weight/rating because it is common on directed graphs with links that have different weights. Users of online rating social networks rate products or items, adding weights to the relationships of users with items. In some cases the task is to predict this weight, while in other to use weights that users put in order to infer new user-user relationships.

4.2.1.1 Link Prediction

Recommendations in ONSs are particularly based on the friendship graph. Most of the well known OSNs like Facebook and Hi5 recommend new friends exploiting available links among users only from the friendship list. Link existence prediction of unob- served links, tries to infer which new interactions among users of a social network are likely to occur in the near future. There is a variety of local measure such as Adamic and Adar index [2, 3], Algorithm [14], Preferential attachment [55] and other approaches. These local based methods focus on the node structure of the network. There are also many global based approaches focusing on the link prediction in a single network. They particularly focus on the overall structure of the network. Such approaches are RWR [75], SimRank [42] and Katz status index [45]. There are also other methods which focus on the Link Prediction problem by ex- ploiting other data sources such as comments on posts among users, co-authored pa- per or common tagging. For instance, Ido Guy et al. [33], proposed a novel user inter- face widget, in order to provide users with recommendations of other people. Chen et al. [14] evaluated four recommender algorithms (Content Matching, Content-plus- Link, FOAF algorithm and, SONAR) to help users discover new friends on IBM’s OSN. Other probabilistic methods exploiting a single network is this of Clause et. al [15], who suggested an algorithm based on the hierarchical structure of the network. They statistically fit data from a real network using a hierarchical random graph and relate node connection probability with the node depth within the hierarchy. Missing links

44 4. LINK PREDICTION 4.2. Related Work can be predicted by ranking the connection probability of nodes in the network. All these approaches are being discussed in detail in the section 4.3.

4.2.1.2 Rating Prediction

Related work on item recommendation in social networks is a research area that has been approached by many fields such as Collaborative Filtering (CF). Most research work in CF followed two directions, the user-based and the item-based. Firstly, mem- ory based approaches in CF can be reviewed, which have been used for recommen- dation in bipartite social networks, such as user-item where items can be a group in a social network or another affiliation. In the past, Resnick et al. created the GroupLens system [77] which implemented a CF algorithm based on common users preferences, belonging to the first type of user-based CF algorithms. The system employed users’ similarities for the formation of a community of nearest users. Later, a number of improvements have been suggested, such as [13, 36], concerning user-based CF algo- rithms. As far as it concerned the second CF direction, the item-based CF algorithm [80, 44] is based on the items’ similarities for a neighborhood generation of nearest items. Her- locker et al. [38] used similarities weighting by the number of common ratings between users and items. In addition, Karypis et al. [17] apply item-based CF algorithm com- bined with conditional-based probability similarity and cosine similarity. Recently, Jamali and Ester [40] proposed a matrix factorization technique with trust propagation in order to provide users with item recommendations in social networks. At last, Li et al. [54] proposed a novel model, called AffRank, that uses a supervised approach with six features (product community size, member connectivity, social context, affin- ity rank history, evolution distance, and average rating) to predict the future rank of products according to their affinities. There are several methods [29, 40, 88], that combine information from unipartite and bipartite graphs building Multi-modal graphs, focusing in the rating prediction problem. For example, TidalTrust [29] and MoleTrust [62] combine the rating data of CF systems with the unipartite friendship network called trust data of social networks in order to improve the item recommendation accuracy. In particular, TidalTrust [29] performs a modified breadth first search (BFS) in the trust (friendship) network to com- pute a rating prediction. Furthermore, MoleTrust [62] considers paths of friends to a user-defined maximum-depth. Recently, Vasuki et al. [88] proposed affiliation-group recommendations based on the friendship network among users, and the affiliation-group network between users and groups. In order to achieve that, they propose two different approaches, one

45 4. LINK PREDICTION 4.3. Classification of Topological Measures for Link Prediction based on graph proximity and another using matrix latent factor models. Later, they improved their model considering more model’s scalability [89]. Finally, Kunegis et al. [52] specialized general link prediction algorithms to the bipartite graph under the spectral transformation kernel property. They examined their construction to rating and authorship graphs, folksonomies, document-feature networks and other types of bipartite networks.

4.3 Classification of Topological Measures for Link Pre- diction

A basic task of the emergent research field of Link Prediction in social and other net- works, is the discovery of new interactions that are likely to occur in the near fu- ture, among network members. So far, there are two main approaches [55] to ad- dress the link prediction problem that are classified to two clusters of measures. The first one takes account the local characteristics of the network, measuring the "prox- imity" among the nodes of a network. On the other hand, second one is based on global features of the network, trying to detect the path structure of the network. Both types of methods in Link Prediction compute a similarity matrix using a score function score(Vx,Vy), between one pair of nodes Vx and Vy of a graph G. It is also denoted that Γ(x) and Γ(y) are the sets of neighbors of Vx and Vy, respectively.

4.3.1 Node Based Methods

Node-based approaches which address the Link Prediction problem consider node sim- ilarity, in terms of node "proximity". The most popular measure are analyzed in the subsection below, including Katz status index, common neighbors (FOAF), Jaccard similarity, Adamic/Adar measure and preferential attachment. All these measure have been widely used in emergent research approaches for the Link Prediction problem.

Common Neighbors - FOAF •

Common Neighbors index, which is also known as the Friend of a Friend al- gorithm(FOAF) is a very popular and simple node based measure. Many well- known and popular online social networks(OSNs) have adopted it, such as Face-

book and Hi5. The basic concept relies on that, between two nodes Vx and Vy of a graph G, it is more likely to arise a new link, if they share many common connec- tions [14, 55].

46 4. LINK PREDICTION 4.3. Classification of Topological Measures for Link Prediction

The score function of Common Neighbors van be defined as:

score(V ,V )= Γ(x) Γ(y) (4.1) x y | |∩| |

Common neighbors index is related with the effect which is de- rived from sociology. This suggests that edges or links in a network, are more possible to arise between nodes that share a third one. FOAF algorithm, traverse

paths of length ￿ = 2 from a node Vx and a possible friend Vy

Jaccard Similarity •

The Jaccard’s coefficient is a similarity metric that has been used in the past in the

information retrieval field. It measures the probability that both Vx and Vy have a common feature f. Now, it can be assumed that feature f is the neighbors in the graph G [55].The score function of Jaccard’s approach is given by the equation 4.2 bellow:

Γ(Vx) Γ(Vy) score(Vx,Vy)=| |∩| | (4.2) Γ(V ) Γ(V ) | x |∪| y | Adamic/Adar •

Adamic and Adar [2, 3] have defined a similar measure to the Jaccard similarity, in terms of relating two personal home pages. This concept can be extended to relate users on online social networks, thus both users and webpages are con- nected via links. The Adamic/Adar measure could be described as a weighted version of common neighbors. The Γ(z) is defined as the section between two related webpages or ONS users, counting the frequency of common features and weighting rarer features more heavily [55]. In order to achieve this, they com- puted features of the webpages, using the Equation 4.3 below:

1 score(Vx,Vy)= ,z Γ(Vx) Γ(Vy) (4.3) log Γ(z) ∈ ∩ | | Preferential attachment •

Preferential attachment is another similarity measure based on local features of the network and the "proximity" between two nodes. The rapid growth of collab- oration, social and other networks have gradually emerged the need for scaling

47 4. LINK PREDICTION 4.3. Classification of Topological Measures for Link Prediction

in such random networks, and preferential attachment has been a measure for addressing this problem [7]. The basic premise behind preferential attachment measure is that the probability that a new edge involves a node is proportional

to the current number of its neighbors, denoted as Γ(Vx) [55]. In the next Equa- tion 4.4 below is depicted the score function proposed for the preferential attach- ment. It has been proposed by Newman [72], who have been based on empirical

evidence, that the probability of future co-authorship between nodes Vx and Vy is highly related with the product of the number of the existing collaborations between these nodes.

score(V ,V )= Γ(V ) Γ(V ) (4.4) x y | x |·| y |

4.3.2 Path Based Methods There is also a number of methods and measures that are path-dependent and take into account the global features of the network [55]. Specifically, these approaches consider the overall structure of the network, considering all possible paths of dif- ferent length ￿, between two nodes. The most popular approaches are Katz Status Index, Random Walk with Restart algorithm, SimRank algorithm, Betweenness cen- trality measure, Hitting time, or the shortest path algorithm.

Katz Status Index •

Leo Katz [45] introduced a status index combining sociometric analysis with ba- sic linear algebra. Katz’s method computes influential nodes in a social network by summing over the collection of paths of different length ￿, that connect two

nodes Vx and Vy. In order to efficiently weight the paths of different length ￿, he used the concept of an attenuation factor β in such a way that shorter paths connecting two nodes are graded higher, than longer ones. This concept is sum- marized in the Equation 4.5 below:

∞ Katz = ￿ paths￿ (4.5) β ∑ β Vx,Vy ￿=1 | |

￿ ￿ where pathsVx,Vy is the set of all length- paths from node Vx to Vy, which are com- puted by the adjacency matrix. It has been already mentioned in subsection 3.2 that the adjacency matrix raised in power n gives the number of paths of length n connecting two nodes. The Katz measure is factorized in this form (I βA) 1 1, − − − 48 4. LINK PREDICTION 4.3. Classification of Topological Measures for Link Prediction

where A is the adjacency matrix. The Katz measure can be applied on undirected, directed and weighted networks depending on the entries of the adjacency ma- trix. The attenuation factor β is a parameter, that in the context of linear algebra should take such values that it will ensure that the above series converges and 1 allow the computation of the inverse of the adjacency matrix A− . The β attenua- tion factor can take values such as β < 1/λ, where λ is the largest absolute eigen- value of matrix A [45, 20]. Very small values of β, contribute to higher weights of short paths. This fact makes the total score function to focus on the prediction task to the close neighbors of the node.

Hitting time and PageRank •

In a random walk on a graph G, a node Vx starts and iteratively moves to a neigh- bor of the set of neighbors Γ(Vx), which is randomly chosen following a uniform

distribution [55]. The Hitting time HVx,y from a node Vx to Vy is defined as the ex- pected number of steps that are needed for a random walk from Vx to Vy. In order to symmetrize the relationship between the two nodes, commute time is defined

as: CVx,y = HVx,y + HVy,x . These two measures are proximity path-based measures and can be used alone or in a combination. In the Equation 4.6 a normalized ver-

sion of Hitting time is given, where πy or πx are the stationery probability of the nodes, respectively. score(V ,V )= (H π ) (4.6) x y − Vx,y · Vy The PageRank algorithm concept can be efficiently adopted for the link predic- tion task in social networks, due to the fact that random resets is the basic ele- ment of PageRank measure for web pages. The score function can be defined as the rooted PageRank measure with a parameter α [0,1] to be a stationary prob- ∈ ability of Vy when it returns to Vx during a random walk. Thus, the probability of moving randomly to a neighbor is given by 1 α. − SimRank •

SimRank [42], is an algorithm based on the concept, that two nodes similarity is related on what extent they share similar neighbors [55]. The scoring function, that computes node proximity, becomes a similarity function. Therefore, simi-

larity of a node with itself is defined as: similarity(Vx,Vy)=1 and a parameter γ [0,1]. ∈ 49 4. LINK PREDICTION 4.3. Classification of Topological Measures for Link Prediction

Similarity between two different nodes can be quantified using the following equation 4.7.

1, if Vx = Vy (4.7) ∑a Γ(Vx) ∑b Γ(Vy) score(a,b)  ∈ ∈ γ Γ(V ) Γ(V ) , otherwise  · | x |·| y | Random Walk with Restart (RWR) •

RWR is an algorithm [75, 86] based on Marcov-chain theory and its applications, producing a model of random walk through a graph or network. RWR’s basic

concept relies on a random walker, which has its starting root in a node Vx seeking a connection path to a node Vy. Within this route, which is randomly selected among the available edges in every step, the initial node makes a choice with probability c returning back to the start. The similarity matrix among nodes of a graph can be computed using the following Equation 4.8:

1 Kernel =(I αP)− (4.8) RW R −

where I is the identity matrix and P is the transition-probability matrix, whose entries store the probability of a node choosing to move to another. The term Kernel is derived from the linear algebra terminology, referring to the matrix ker- nel and its properties. Matrix kernels have the property of increasing when the number of paths connecting two node changes increasingly, either decreasingly.

Shortest Path - Graph Length •

The Shortest Path algorithm calculates the shortest distance between any pair of

nodes Vx,Vy in a social network, forming a similarity measure. There is a number of well known algorithms which can provide this calculation such as Dijkstra, Floyd Warshall [16], or Fredman-Tarjan [21].

50 4. LINK PREDICTION 4.3. Classification of Topological Measures for Link Prediction

4.3.3 Higher Level Approaches Another category of link prediction approaches is this of high level, including matrix factorization or low-rank approximation and clustering [89, 55].

Low-rank approximation • Low-rank approximation methods allow the computation of a relatively small part of a dataset and derive the most important relationships reducing the dimen- sions and the processed data. It has been already stated that user relationships in a social network or connections among nodes in a graph use the adjacency ma- trix. Common linear algebra techniques can efficiently factorize a large adjacency matrix in such a way that the most important relationships can be highlighted and reduced to smaller dimensions. Specifically, such techniques are Singular Value Decomposition (SVD) from the family of latent semantic analysis (LSA), which is a field of the Information Re- trieval research area. SVD is a common method that analyzes the structure of a large matrix(in this case an adjacency matrix A) and chooses a relatively small

number of k to compute the rank-k matrix Ak, that best approximates the original matrix A. It is much easier for computations to be made in a reduced matrix Ak, having filtered out the "noise", which in this case is the non-existence of an edge. Moreover, large adjacency matrices appear very often to be sparse, meaning that a big part of their entries are filled with 0, without contributing to similarity com- putations. These kind of techniques have been widely used to recommendations of items to users in social networks or Web 2.0 platforms, or categorization of large sets of corpora. Symeonidis et al. [82] use high order SVD in order to recommend posts to users based on common tags that they shared in a political blog. Vasuki et al. [89] use a graph proximity model as well as a latents factor model in a combined matrix, in order to provide with affiliation recommendations users of social network and Youtube. It is now presented the Latent factor model, where SVD is applied on a combined graph C in order to predict one of the four elements of the matrix. In Equation 4.9, there is an illustration of the combined graph C, which contains the user-user network (A), while (R) holds the user-item bipartite graph. The choice of D varies, it can be D = RT R depicting the number ∗ of users which share any two items.

AR C = (4.9) ￿ RT D ￿

51 4. LINK PREDICTION 4.4. Experimental Configuration

Clustering •

Clustering techniques are used in the link prediction task in order to improve the predictor quality by not taking into account the "weak" edges of the graph. The prediction takes place in a subgraph of the G graph. This clustering procedure is as follows: a similarity score is computed using one of the above "proximity" measures using the initial graph G and the whole set of edges E connecting the nodes. Then the (1 ρ) fraction of the edges with the lowest computed score is − deleted, for a parameter ρ [0,1]. Now the similarity score is recomputed using ∈ the subset of edges that has been left with the most confidence [55].

4.4 Experimental Configuration

Experimental configuration in the link prediction problem is designed in similar con- text to data mining techniques. Primarily, data sets of social or other networks should be collected. Web mining provides tools such as web crawlers which are employed to complete this task. Crawlers are automated software agents that explore the world wide web, following the available web links-hyperlinks that exist in web pages. After crawlers locate their target links, they can extract information and store it in a database enabling indexing and searching. Online social networks and other collaboration networks include users and differ- ent links of edges that connect them. There is a variety of network datasets such as OSN, communication (email), citation, collaboration (coauthoring), web graphs, rating websites, roads, and others. Stanford’s large network data set collection - SNAP1 is a webpage dedicated to the commonly used data set from the academic community, in order for different methods and approaches to be tested in a common comparable base. Information is stored in a graph structure, where users are the nodes and the interactions among them are the edges. Edges contained in data sets can be directed, undirected or even contain a weight, which can be a rating from a user to user or from a user to an item. Collection of data sets is a basic task, but the most crucial part in experimental plan- ning is the data and experimental setup. Data should be separated in two different sub- sets, where using the first one, similarity measures for the unsupervised approaches and learning functions for supervised methods should be computed and learned, re- spectively. After this procedure, the second subset is used for the evaluation of the link prediction task.

1http://snap.stanford.edu/data

52 4. LINK PREDICTION 4.4. Experimental Configuration

In a strict definition of the experimental process the following assumptions can be made. Suppose that G =(V,E) is a social network, in which each edge e =(v ,v ) E x y ∈ represents an interaction between vx and vy at a particularly timestamp tn. As depicted in Figure 4.1, four timestamps are chosen t1 < t2 < t3 < t4 and an algorithm is given access to the network G[t0,t1]. The output of the algorithm is a list of edges not present in this time interval, but predicted to appear in the G[t2,t3]. The first time interval G[t0,t1] is defined as training interval and G[t2,t3] as test interval, respectively [55]. OSNs grow and evolve between different time intervals through the addition of new users and new interactions of different types among them. The prediction of these links should be applied to nodes that exist in both training and test interval times. Thus, preprocessing of the original data set is needed in order to exclude the nodes that are not incident to edges from both time intervals sets. Therefore, in the link prediction evaluation process, two different sets are defined training set and test set. The task is to accurately predict the new edges between these two sets.

53 4. LINK PREDICTION 4.4. Experimental Configuration

54 5 Methodology

In this chapter, the proposed methodology we follow is described, in order to pro- vide enhanced user recommendations to users in ONSs. In the first section we give a description of the overall proposed approach, while in the next three sections we analytically study the recommendation process, through a toy example concerning a unipartite and a bipartite network combined in a Multi-modal network.

5.1 Proposed Approach

Most of the related work focus on the link prediction on the unipartite network for friend recommendations in OSNs. Users build this explicit social network by adding each other as friends during the evolution of the network in different time intervals. On the other hand, CF and bipartite link prediction has been used for the user affiliation link prediction or for the rating prediction when users co-share a group in OSNs, co- purchase products on or co-rate movies in social rating networks. However, it has been little previous related work on exploiting the user-item implicit network, to predict new edges among users through several items. It is crucial to use the implicit information in order to enhance the friendship recommendation, when the friendship source is not informative enough. For this task we employ a well-known global-based measure for the computation of "proximity" among users in OSNs, this of Katz status index [45]. As it has been already mentioned in the previous chapter, Katz status index is a path-dependent method,

55 5. METHODOLOGY 5.2. Link Prediction based on User-User Adjacency Matrix

which can be calculated as the weighted sum of the number of paths connecting two nodes with varying lengths. Firstly, Katz measure will be computed for the explicit unipartite friendship network and recommendation will be produced for each user. Next, we transform the user-item bipartite network to a new user-user nework, and apply Katz measure again. At last, we create a combined Multi-modal graph of the two distinct networks and apply Katz, combined Katz and truncated Katz versions, in order to obtain user recommendations in a unified framework.

5.2 Link Prediction based on User-User Adjacency Ma- trix

Let G be a graph with a set of nodes V and a set of edges E. Every edge is defined by a specific pair of graph nodes (v ,v ), where v ,v V. We assume that the graph G is i j i j ∈ undirected and un-weighted, thus the graph edges do not have any weights, plus the order of nodes in an edge is not important. Therefore, (vi,v j) and (v j,vi) denote the same edge on G. We also assume that the graph G can not have multiple edges that connect two nodes, thus if two nodes vi,v j are connected with an edge of E, then there can not exist another edge in E also connecting them. Finally, we assume that there can not be loop edges on G (i.e. a node can not be connected to itself).

!"

!% !$

!#

Figure 5.1: Example of a unipartite friendship Social Network.

A common graph representation is the adjacency matrix A. It is an n n matrix, where × n = V is the number of nodes in G. Therefore, it has n rows and n columns labelled by | | the graph nodes. For an un-weighted non-multiple graph (such as G), the adjacency matrix values are set as follows:

1, if (vi,v j) E A[vi,v j]= ∈ 0, if (v ,v ) / E  i j ∈  56 5. METHODOLOGY 5.2. Link Prediction based on User-User Adjacency Matrix

Following all previous assumptions and definitions, the adjacency matrix of an undirected and un-weighted graph such as G, is a symmetric matrix with values 0 and 1, if two nodes are neighbors or not, respectively. In addition, as there are not any loop edges, the main matrix diagonal have zero values. The adjacency matrix of friendship network for our running example is depicted in Table 5.1.

Table 5.1: Running example: User-User Adjacency matrix A.

U1 U2 U3 U4 U1 0001 U2 001? U3 010? U4 1??0

As we want to investigate the relations with ?, we can assume that initially are equal to 0 (i.e. there are no connections between the corresponding users). It is obvious from

Figure 5.1 and its corresponding adjacency matrix A depicted in Table 5.1 that U1 and U2 are connected with U4 and U3, respectively. Let’s assume in our running example, that we want to propose new friends to user U4. There is a variety of global similarity measures [55] (i.e Katz status index,RWR algo- rithm, SimRank algorithm, etc.) for analyzing the “proximity" of nodes in a network, which are path-dependent. We adopt Katz status index [45] calculated as the weighted sum of the number of paths connecting each graph nodes with varying n lengths.

∞ Katz(A;β)=βA + β 2A2 + β 3A3 + ... = ∑ β iAi (5.1) i=1 which is also factorized as:

∞ i i 1 ∑ β A =(I βA)− I (5.2) i=1 − −

The adjacency matrix entries raised to the n-power count n-length paths in the uni- partite graph. An attenuation factor β < 1/λ, where λ is the largest absolute eigen- value of matrix A [45][20], is defined in order for the series to converge. This factor β n weights the n-length paths in a way that shorter paths are weighted higher than longer paths. Katz measure is formed in a way that very long paths contribute less to the final node similarity. This contribution is becoming extremely small in very high powers of matrix, whereas the computational cost increases. We define a truncated Katz mea- sure, so that we do not have to take into account very long paths. In the following

57 5. METHODOLOGY 5.2. Link Prediction based on User-User Adjacency Matrix

Equation 5.3 the truncated Katz measure is defined, with A representing the source adjacency matrix, β is the attenuation factor, while k is the maximum length of the total length of paths that Katz will take into account in the similarity computation.

k trKatz(A;β;k)=∑ β iAi (5.3) i=1

Below, a 2-length path is denoted in unipartite network if and only if, Ui is adjacent to Uj and Uj to Uk, respectively:

A A user user user i −→ j −→ k We then apply Katz(A;0.05) to unipartite friendship graph G, in order to provide recommendations based on an induced similarity matrix. The parameter β is set 0.05 after we have computed the eigenvalues for adjacency matrix A resulting to 1 and 1. − The largest absolute eigenvalues is 1 and beta can take values β < 1/1. Usually, β is | | chosen to be enough smaller than the allowed, in order to take less into consideration paths of greater length. In this toy example, there is a small number of users, edges and no paths of greater length, so this value of β fits well. User-User similarity matrix entries capture the friendship relationships in unipar- tite social network. Very small values are observed due to the fact that unipartite net- work holds only 1-length paths. Rows of similarity matrix show “proximity" among users. There is no clear indication from the above similarity matrix that U4 should make friend U2 or U3, since both cells (U4,U2) and (U4,U3) entries are zero. As we can observe, unipartite friendship network is not informative enough to provide accurate recommendation to U4. Thus, in the next section we try to capture similarities through common items that users share. In the following Table 5.2 similarities captured by Katz are presented for the unipartite friendship network.

Table 5.2: User-User similarity matrix based on adjacency matrix A.

U1 U2 U3 U4 U1 0 0 0 0.05 U2 0 0. 0.05 0 U3 0 0.05 0 0 U4 0.05 0 0 0

58 5. METHODOLOGY 5.3. Link Prediction based on User-Item Bi-Adjacency Matrix

5.3 Link Prediction based on User-Item Bi-Adjacency Ma- trix

Users can also form several implicit social networks through [88] their daily interac- tions like co-commenting on people’s posts, co-rating products, and co-tagging peo- ple’s photos. These implicit relations form a network that contains edges between two types of entities (vertices in a graph), such as a User-Item bipartite graph.

Let G￿ =(V +W,E) be a bipartite graph with two sets of nodes V and W, and a set of edges E. Every edge is defined by a specific pair of graph nodes (v ,w ), where v V i j i ∈ denotes users set and w W items set. j ∈

!"

%"

!&

!$ %&

!#

Figure 5.2: Example of a bipartite Social Network.

Following the notation of the unipartite adjacency matrix, a bi-adjacency matrix R can be written as: 1, if (vi,w j) E R[vi,w j]= ∈ 0, if (v ,w ) / E  i j ∈ We extend our running example by affiliating users with items, as depicted in Fig- ure 5.2. The bi-adjacency matrix R of Table 5.3 is given by:

Table 5.3: Running example: User-Item matrix R.

I1 I2 U1 11 U2 11 U3 01 U4 10

59 5. METHODOLOGY 5.3. Link Prediction based on User-Item Bi-Adjacency Matrix

Our main task remains the friend recommendation for U4 using this time the auxil- iary user-item R matrix, since unipartite friendship network is not informative enough.

Edges of R matrix represent an 1-length path from a user Ui ending to an item Ij. Using the simple linear algebra property of the multiplication of a matrix R with its transpose RT , we transform these paths into user-user paths including information contribution through items. We define a matrix B as a new unipartite user-user adjacency matrix induced from B(U ,U )=R(U I ) RT (I U ). Its entries count 2-length paths of i j i × j × j × i R RT this type: user item user and it is depicted in Table 5.4. To avoid self node loops −→ −→ the main diagonal of matrix B is set zero, although entries count degree of incoming edges to user-nodes. User-User adjacency B can also represent an undirected weighted graph, whose entries(weights) represent the number of items two users share.

Table 5.4: Running example: User-User Adjacency B = R RT . × U1 U2 U3 U4 U1 0211 U2 2011 U3 1100 U4 1100

Katz(B;0.05) is next applied to user-user adjacency matrix B, in order to obtain a new similarity matrix derived from the user-item auxiliary source.

Table 5.5: User-User similarity matrix based on adjacency matrix B.

U1 U2 U3 U4 U1 0 0.1073 0.0562 0.0562 U2 0.1073 0 0.0562 0.0562 U3 0.0562 0.0562 0 0.0056 U4 0.0562 0.0562 0.0056 0

In Table 5.5, the 4th row of similarity matrix indicates in a clear way that user U4 should make user U2 friend and not U3, with similarity value 0.0562 > 0.0056 respec- tively.

60 5. METHODOLOGY 5.4. Link Prediction based on Multi-modal Graph

5.4 Link Prediction based on Multi-modal Graph

In this section, the approach of combining heterogeneous multiple sources of the uni- partite User-User graph and the bipartite User-Item, it is presented. These two graphs are combined in a multi-modal graph shown in the Figure 5.3 below. This approach enables recommendations to be made in a unified way, opening new paths for users to connect among two distinct sets: users and items. Similarity among users, results from both the explicit user-user friendship and the implicit user-item networks. Therefore, in case that the friendship network fails to capture similarity between two users, the auxiliary user-item network could be used for this task, and vice versa.

!"

%"

!&

!$ ! %& !

!#

Figure 5.3: Example of a Multi-modal Social Network.

We use now a combined form of the Katz measure including both types of unipar- tite and bipartite networks. Unipartite friendship network is the explicit network, while bipartite user-item network, which is transformed to an auxiliary weighted user-user network is the implicit one. It is possible to compute several types of paths combin- ing these two networks in varying lengths. For example, we can consider some types of paths using these two sources, which are not available when using only one single source. In general, for paths of length n, we get 2n combinations for A and B different sources. For instance, in case we use paths of length 2, four (22 = 4) different types of paths can be formed of this type:

A B user user user i −→ j −→ k while for paths of total length 3 we can use eight (23 = 8) different types of sources combination of this type:

A B A user user user user i −→ j −→ k −→ m 61 5. METHODOLOGY 5.4. Link Prediction based on Multi-modal Graph

In Figure 5.4, there is visualization of all the combinations of sources A and B (net- works) for different paths of length 1, 2, and 3.

Length = 1 A B

Length = 2 AA AB BA BB

Length = 3 AAA AAB ABA ABB BAA BAB BBA BBB

Figure 5.4: Illustration of all the possible combined paths.

Such path combinations are computed by the combined Katz for two sources of information in any varying length. In Equation 5.4 below, we present the mathematical formulation: ∞ Katz(A+B;β)=∑ β i(A+B)i (5.4) i=1 We now present an analytical form of the combined Katz measure formulation for 1, 2, 3 to n varying lengths of the combined Katz series.

Katz(A+B;β)=β(A + B) + β 2(A2 + B2 + A B + B A) ∗ ∗ + β 3(A3 + B3 + A2 B + A B A + A B2 + B2 A + B A B + B A2) ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ + ... n n n n 1 n 1 + β (A + B + A − + B − + ...) ∞ = ∑ β i(A+B)i i=1

Following all these notations Katz(A+B;0.05) is computed and presented in the fol- lowing similarity matrix shown in Table 5.6 for our toy example. In Table 5.6, the 4th

62 5. METHODOLOGY 5.4. Link Prediction based on Multi-modal Graph

row of the combined similarity matrix suggests in an unambiguous way that U4 should make user U2 friend and not U3, with similarity value 0.0627 > 0.0117, respectively.

Table 5.6: User-User similarity matrix based on Multi-modal graph.

U1 U2 U3 U4 U1 0 0.1142 0.0627 0.1082 U2 0.1142 0 0.1082 0.0627 U3 0.0627 0.1082 0 0.0117 U4 0.1082 0.0627 0.0117 0

We also define truncated versions of Katz in order not to compute paths of too great length, since after a particular length-value paths are not relative and informa- tive enough. In addition, there is an excessive computation cost for time, space and complexity. The two equations that follow respond to truncated Katz for two com- bined sources of total length 2 and 3.

trKatz(A+B;β;2)=β(A + B)+β 2(A2 + B2 + A B + B A) (5.5) ∗ ∗ In the above Equation 5.5, we compute only possible paths of length 1 and 2, using two information sources, these of explicit (A) and implicit (B) networks.

trKatz(A+B;β;3)=β(A + B)+β 2(A2 + B2 + A B + B A) ∗ ∗ + β 3(A3 + B3 + A2 B + A B A + A B2 + B2 A + B A B + B A2) ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ (5.6)

In the above Equation 5.6, we now compute only possible paths up to length 3, using two information sources, these of explicit (A) and implicit (B) networks. It is easy to observe that now we take into account more possible paths including all of tr_cKatz2 considers plus those of length 3.

63 5. METHODOLOGY 5.4. Link Prediction based on Multi-modal Graph

64 6 Experimental Evaluation and Results

In this chapter, we proceed with the experimental evaluation of the proposed approach that has been described in the previous chapter. As it has been already declared, it is crucial for the link prediction problem in OSNs to take into account both explicit and implicit interactions that occur among users, in order to provide better enhanced rec- ommendations to users. It has also been mentioned that in our approach the path- dependent Katz status index is adopted for the computation of "node" proximity in varying lengths. In order to investigate the degree of enhancement that the implicit network as an auxiliary information source provides, our experimental evaluation fo- cuses on the next four Katz measure extension algorithms:

sKatz(A;β) which refers to the single source (only explicit) Katz measure, for • infinite path length.

cKatz(A+B;β) which refers to the combined (explicit and implicit) source Katz • measure, for infinite path length.

tr_cKatz2(A+B;β;2) which refers to the truncated combined (explicit and im- • plicit) source Katz measure, for paths of length 2.

tr_cKatz3(A+B;β;3) which refers to the truncated combined (explicit and im- • plicit) source Katz measure, for paths of length 3.

In the first section we provide information about the type, the collection and the pre-processing of data sets. Next, the experimental protocol and setup is presented.

65 6. EXPERIMENTAL EVALUATION AND RESULTS 6.1. xSocial Synthetic Data Set

Finally, after the completion of the evaluation process, results are being presented and extensively discussed in the last section. Our experiments were performed on a 2.8 GHz Pentium IV, with 2 GB of memory. All algorithms and data set pre-processing were implemented using Matlab. At last, the experimental process was supported by Visual Studio Basic tool. Source code is available on the Appendix.

6.1 xSocial Synthetic Data Set

In this section, we present the data set that has been used for the the evaluation process. In addition, we provide the pre-processing approach that has been deployed using Matlab, in order to setup the experimental process and evaluation. xSocial Generator, which has been proposed by Faloutsos et al. [18], is a multi- modal graph generator that mimics real social networking sites to form simultaneously a network of friends and a network of their co-participation. In other words, an explicit friendship network and an implicit one is formed. In particular, xSocial consists of a network with N nodes, each of which has a preference value cal fi. At each time, every node (agent) performs three independent actions:

1. write a message

2. add a friend

3. comment on a message

A node chooses his friends using two different ways. It can select a friend based on the number of messages on which they have commented together, which is determined by his preference cal fi. The second way is based on the fact that user can follow the updated status of his friends by adding comments on the corresponding newly written messages. Further details about xSocial can be found in this paper [18] as well as the executable is downloadable here1. The xSocial Generator, can produce the explicit, called friendship (or buddy) net- work and the implicit one, this of the co-commenting network on users posts. The first one is an unweighted friendship network whose edges correspond to friendship rela- tionship between users. In the co-participation network the existence of an edge be- tween two users, represents that two users have once commented on the same post. Therefore, there is no information about which is the exactly post they have com- mented on. This could cause a problem to our approach since we do not have the user-item bipartite matrix declared, but there is no problem since we get directly the

1http://research.nokia.com/people/hao_ui_wang/index.html

66 6. EXPERIMENTAL EVALUATION AND RESULTS 6.1. xSocial Synthetic Data Set transformed adjacency matrix B referred to section 5.3. The only difference is that now that the adjacency matrix B is an unweighted undirected adjacency matrix, where 1 means the existence of the implicit relationship and 0 the absence, respectively. Finally, xSocial Generator also allows us to partially control the density of the edges arising, during the formation of the networks. This can be achieved through the use of an argu- ment that is available in the unix console when we run the generator. The simulation argument defines for how many simulation steps the whole network will evolve. A quite high value of that argument results to denser networks. We run xSocial and create a social network for 1000 users interacting for 100.000 time steps. In this context, multiple text (.txt) files are created describing social net- work’s formation and interactions among users. Next, we only choose two files, these of the friendship and co-commenting networks. These files contain edges, whose ex- istence represent the explicit friendship relationship and the implicit co-commenting relationship, respectively. The pre-processing stage is the next step, where we choose to import files in Matlab and transform edges of the two networks into the correspond- ing adjacency matrices holding ones and zeros. As it has been mentioned in Section 4.4 of Chapter 4 as well as is pointed in the next Section 6.2, train and probe sets must not contain common data. We divide the friendship network into train and test sets, putting one edge in the test and one in the train set, alternately. As far as it concerns the implicit network, we keep it as it is, maintaining the total amount of information. Our target is to enhance the friendship recommendation, and the prediction task is about that, therefore we make prediction on the friendship network keeping the whole of the auxiliary source information. After train and probe sets have been defined, we proceed with the computation of the similarity matrices using the four aforementioned Katz measures. Before that we need to define the parameter β, which ensures that the Katz series will converge and should be β = 1λ (where λ is the largest absolute eigenvalue). We compute eigenval- ues from adjacency matrices A and B, as well for the train A adjacency and the test A adjacency. We should keep the value to incorporate with all the matrices used in computations. The smaller value of β is 0.01, thus we shall define β < 0.01. We choose a considerably smaller value such as β = 0.0005, in order to ensure that the Katz series will converge and to weight the paths of a small length higher. Finally, we compute the 4 similarity matrices of size 1000 1000. Each row or column depicts the "proximity" × between a user and the rest 999. Next, we compute a number of metrics measuring basic topological properties of the data set to get information about sparsity, clustering coefficient and other proper- ties of small-world networks discussed in Chapter 3 3.1.

67 6. EXPERIMENTAL EVALUATION AND RESULTS 6.1. xSocial Synthetic Data Set

Network graph type: the type of the network can be undirected, directed or • weighted.

N: total number of nodes •

E: total number of edges •

ASD: average shortest path distance between node pairs •

ADEG: average node degree •

LCC: average local clustering coefficient •

GD: graph diameter (maximum shortest path distance) •

MAXD: maximum node degree •

In Table 6.1 below, topological properties of friendship and co-comment networks are presented:

Table 6.1: Topological properties of xSocial 1K friendship and co-comment networks.

Data Set Type N E ASD ADEG LCC GD MAXD Friendship(A) Undirected 1000 4578 2.10 6.06 0.20 7 256 Co-comment(B) B = R RT Undirected 1000 1473 0.13 1.95 0.05 8 60 ∗

As shown, the friendship unipartite network appears to have a small average short- est path equal to 2.10 and a relatively high average node degree equal to 6.06. This means that the most users are connected with paths of length 2 and have an average value of 6 friends. This combination indicates that the friendship network belongs to the category of small-world networks. Small-world include sub-networks that appear to have connections between almost any pair of users. This feature is measured by the local clustering coefficient (LLC) and the friendship network has a value of 0.20, showing that there is a significant number of such subgroups. The co-comment net- work, which is also unipartite since it is a result of the transformed bipartite network, appears to have a small value of ASD equal to 0.13. This fact indicates that there is a small number of users who have commented on the same post. Average node de- gree of the co-comment network is 1.95, meaning that most users have commented on approximately 2 common posts.

68 6. EXPERIMENTAL EVALUATION AND RESULTS 6.2. Experimental Setup

6.2 Experimental Setup

In this section, we present the evaluation method that has been followed as well as the performance measures used to evaluate the experimental process.

6.2.1 Evaluation Method Our evaluation considers the division of friends of each target user into two sets. The first one is the training set ET , which is treated as known information. The second one is the probe(test) set EP which is used for testing. It is clear that no information in the probe set is allowed to be used for the prediction task. It is obvious that, E = ET EP ∪ and ET EP = . Therefore, for a target user we generate recommendations based only ∩ ￿ on the friends in ET . After the division of the data set to train and probe set, we proceede with the com- putation of similarity matrices for each Katz algorithm extension. For each of the four produced Katz algorithms, we will evaluate them in relation with the a number of val- ues of provided recommendations and several values of the percentage of observed links of the train set. Specifically, we will evaluate the performance of the algorithms using precision, recall and AUC measures, while the number of provided recommen- dations will vary with values (1, 2, 3, 4). In addition, the percentage of observed links of the train set that will be used, shall vary with values set to: 0.2, 0.4, 0.6 and 0.8, 1. For the overall average performance evaluation of each algorithm only the 0.6 percentage will be used.

6.2.2 Performance Measures The widely-used from the research area of Information Retrieval and Data Mining fields, Precision and Recall are employed as performance measures for friend recom- mendations. We may assume two classes, one positive and one negative. The positive class represents the prediction of edge existence between two users in the near future, that of the friendship recommendation. Unlike the positive class, we define the nega- tive one that represent the absence of a predicted link between two user, thus the ab- sence of recommendation. Let the true positive to be the users that have been correctly predicted to be recommend as friends in the top-k list, while we let the true negative to be the user in the list that have been correctly predicted not to be recommended as friends. Moreover, let the false positive to be the users that have been wrongly pre- dicted to be recommended as friends, as well as we let the false negative to be users that have been wrongly considered not to be recommended as friends but they should have.

69 6. EXPERIMENTAL EVALUATION AND RESULTS 6.2. Experimental Setup

For a user that belongs to the test set, denoted as test user receiving a list of k rec- ommended friends (a top-k list), we define precision and recall as follows:

Precision • Precision is the ratio of the number of relevant users in the top-k list to k. Specif- ically, those in the top-k that belong in to probe set E P of friends of the target user.

true_positive precision = | | (6.1) true_positive + f alse_positive | | | | Recall •

Recall is the ratio of the number of relevant users in the top-k list to the total number of relevant users. Specifically, all friends in the probe(test) set E P of the target user.

true_positive recall = | | (6.2) true_positive + f alse_negative | | | | Precision and recall ratio tend to appear an inversely proportional behavior de- picted in Figure 6.1. For instance, in the link prediction problem as we recommend a greater number of friends to each user by increasing k in the top-list, we get a higher value of recall, while precision decreases. That is, because in a such greater number of recommendations, we may infer more possible future connection but it is sure that we retrieve an also greater number of irrelevant recommendations.

Figure 6.1: Typical Presicion-Recall curve

70 6. EXPERIMENTAL EVALUATION AND RESULTS 6.2. Experimental Setup

Link prediction is highly concerned particular with the precision metric, without ignoring the recall. There is always a balance point between these two metrics. Best results are expected when the curve approximates the upper right area taking high values in both metrics. Moreover, we employ the AUC statistic measure to quantify the accuracy of predic- tion algorithms and test how much better they perform than pure chance. This mea- sure has been previously used from several related work of link prediction [15]. AUC measure represents the probability that a randomly chosen missing link, part of the probe set EP, is given a higher similarity values than a randomly chosen non-existent link, belonging to the U ET where U is the universal set. − Area Under Curve - AUC •

We follow the next notation to define the AUC measure depicted in the Equa- tion 6.3 where, among n times of independent comparisons, if there are n’ times the missing link having higher similarity value and n” times the missing link and non existing link having the same similarity value, we get:

n + 0.5 n AU C = ￿ × ￿￿ (6.3) n

The accuracy is set to be 0.5, because all similarity values are generated from an independent distribution. Thus, whenever the accuracy degree is over 0.5, then we have an indication of how much better the algorithm performed.

71 6. EXPERIMENTAL EVALUATION AND RESULTS 6.3. Results and Discussion

6.3 Results and Discussion

In this section, we present the results of the experimental evaluation of the four algo- rithms. We computed four different aspects of Katz measure including single source and combined explicit and implicit sources of varying lengths(infinite, 2 and 3). Our main purpose is to examine if and in what degree the use of auxiliary implicit informa- tion will finally enhance user recommendations. In the following Table 6.2 results of Precision and Recall are presented, for a average fraction of observed edges equal to 0.6 and for a different number of recommended friends varying from 1, 2, 3 and 4.

Table 6.2: Precision/Recall(%) curve comparing tr_cKatz3, tr_cKatz2, cKatz, sKatz.

tr_cKatz3 tr_cKatz2 cKatz sKatz # Recom. Fr. Prec(%) Rec(%) Prec(%) Rec(%) Prec(%) Rec(%) Prec(%) Rec(%) 1 50 0,1310 41,66 0,1092 33,33 0,0873 25 0,0783 2 25,83 0,2284 20,83 0,2184 18,75 0,1965 16,25 0,1565 3 17,96 0,3558 13,88 0.3276 12,96 0,3098 2,77 0,3081 4 12,93 0,4587 10,41 0.4368 9,89 0,4150 1,56 0,3950

'!

&!

%!

)*+,-.)/$ $! )*+,-.)/# !"#$%&'(')* #! ,-.)/ 0-.)/ "!

! ! !(" !(# !($ !(% !(& !"+%&,--

Figure 6.2: Comparison'! of tr_cKatz3, tr_cKatz2, cKatz, sKatz in terms of Precision/Re- call(%). &!

In the Figure%! 6.2 below, there is a visualization of the Precision/Recall(%) curve. As we can observe, tr_cKatz3 algorithm outperforms the other three algorithms)*+,-.)/$ in terms $! of precision and recall for all different (1 , 2, 3, 4) numbers of recommended)*+,-.)/# friends. !"#$%&'(')* ,-.)/ #! 72 0-.)/ "!

! " # $ % .")/"+%&)00%*1%1"2$'%*1( 6. EXPERIMENTAL EVALUATION AND RESULTS 6.3. Results and Discussion

Specifically, tr_cKatz3 achieves a 50% score in terms of precision accuracy, when the number of recommended friends equals to 1. Additionally, for the same number of friends tr_cKatz2, cKatz and sKatz score 41,66%, 33,33%, and 25%, respectively. The tr_cKatz3 takes into account both friendship and co-comment network considering paths of length'! 3. The second better accuracy performance in terms of both precision and recall is achieved by tr_cKatz2, which also uses information from both networks &! to provide predications, but considers only paths of length 2.

In addition,%! it is remarkable the fact that the two truncated versions of the com- bined Katz outperform the cKatz algorithm that takes into account paths of infinite )*+,-.)/$ length (after$! a particular path length the series converges as a result paths of very long )*+,-.)/# length stop!"#$%&'(')* to contribute, due to the convergence of the similarity values). The cKatz #! ,-.)/ also uses combines source of information from the two networks but it0-.)/ is clear that miss to capture"! similarities in an efficient way. Finally, the sKatz presents the worse performance, since it uses only one source of information to provide recommenda- ! tions. From the first observations, there is a clear indication that in the first 3 cases ! !(" !(# !($ !(% !(& where Katz used an auxiliary source of information the link prediction task has been !"+%&,-- significantly enhanced.

'!

&!

%!

)*+,-.)/$ $! )*+,-.)/#

!"#$%&'(')* ,-.)/ #! 0-.)/

"!

! " # $ % .")/"+%&)00%*1%1"2$'%*1(

Figure 6.3: Precision(%) diagram compared to the # of Recommended Friends.

In Figure 6.3 there is a visualization of precision percentage related to the number of recommended friends, while in Figure 6.4 recall percentage is combined with the number of recommended friends. We observe that there is a common feature for all the algorithms, they all present lower values of precision and higher values of recall, as the

73 6. EXPERIMENTAL EVALUATION AND RESULTS 6.3. Results and Discussion

number of the recommended friends increases. This fact has a reasonable explanation, since as k from the top-k list increases, we actually ask for a greater number of relevant users to be recommended. Thus, it may be returned a greater number of relevant users, but it is also returned a greater number of irrelevant, resulting in worse precision performance. In the precision diagram in 6.3, it is easy to see that best performance for all algorithms is achieved when the number of recommended friends equals to 1 and the worse when equals to 4. Unlike precision, recall achieves higher scores for all algorithms when the number of recommended friends equals to 4 and lower when equals to 1. In addition, tr_cKatz3 performs better than the other three algorithms for all different numbers(1, 2, 3, 4) of recommended friends. The second better performance is achieved by tr_cKatz2, while the third by cKatz in all different numbers of friend recommendations. All the above extensions of Katz approach consider two different sources of information and outperform sKatz, which considers only one single source (friendship network).

!"# !"'# !"' !"&# !"& ()*+,-(.& !"%#

!"#$%&'' ()*+,-(.% !"% +,-(. !"$# /,-(. !"$ !"!# ! $ % & ' (")*"#$%)++$,-$-"./0$,-1

Figure 6.4: Recall(%) diagram compared to the # of Recommended Friends.

2!

AUC statistic1! is also used in the evaluation process. AUC takes account the overall ability of an algorithm0! to rank all the missing connections over the non existent ones. Specifically, that#! means that AUC is not concerned only with the easiest links to predict. ()*+,-(.& In the following'! Table 6.3 we have brought together the scores of AUC for the four

!"234 ()*+,-(.% &! algorithms, when the fraction of observed edges take an average value+,-(. of 0.6. %! /,-(. 74 $!

! $ % & ' (")*"#$%)++$,-$-"./0$,-1 !"# !"'# !"' !"&# 6. EXPERIMENTAL EVALUATION AND RESULTS 6.3. Results and Discussion !"& ()*+,-(.& !"%#

!"#$%&'' ()*+,-(.% !"% +,-(. Table 6.3: Comparison!"$# of tr_cKatz3, tr_cKatz2, cKatz, sKatz in terms of AUC(%) statistic. /,-(. !"$ tr_cKatz3 tr_cKatz2 cKatz sKatz !"!# # Recom.Friends AUC(%) AUC(%) AUC(%) AUC(%) ! 1 $ 75% 70& 66,67' 62,50 2 65,41(")*"#$%)++$,-$-"./0$,-160,41 59.37 53,12 3 58,48 56,94 56.48 51,38 4 56,46 55,46 54.94 50,78

2!

1!

0!

#! ()*+,-(.& '!

!"234 ()*+,-(.% &! +,-(. %! /,-(.

$!

! $ % & ' (")*"#$%)++$,-$-"./0$,-1

Figure 6.5: AUC(%) statistic compared to the # of Recommended Friends.

In Figure 6.5 we can take an overall view of the AUC statistic related with the num- ber of recommended friends. In general, tr_cKatz3 outperforms the other three algo- rithms, achieving the higher scores for all different number of recommended friends. Especially, in case that the number of recommended friends is equal to 1, tr_cKatz3 scores 75%. In general terms, tr_cKatz2 performs second, but appears to have similar performance with cKatz for greater number of friend recommendations. In addition, cKatz is steadily behind in terms of performance from the two truncated Katz algo- rithms. The sKatz algorithm performs overall worse since it depends in a single source of information. At last, we notice that best scores for all algorithms are achieved when the number of recommended friends equals to 1 and gradually falls as we recommend more users. We choose to collect the best performed evidence of AUC statistic for all algorithms when the number of recommended friends is equal to 1. Then we let the fraction of observed edges of the training set to vary from 0.2 to 1. Pure chance performs 0.5 for all different fractions of edges observed. In Figure 6.6, we demonstrate a curve plot for AUC vs. the fraction of observed links used in the training set of our data set, for all

75 6. EXPERIMENTAL EVALUATION AND RESULTS 6.3. Results and Discussion

Table 6.4: Comparison of tr_cKatz3, tr_cKatz2, cKatz, sKatz AUC statistic vs. Fraction of Observed Edges.

tr_cKatz3 tr_cKatz2 cKatz sKatz Fraction of Edges Obs. AUC AUC AUC AUC 1 0,81 0,77 0,75 0,69 0,8 0,78 0,75 0,70 0,64 0,6 0,75 0,70 0,66 0,62 0,4 0,67 0,64 0,61 0,55 0,2 0,66 0,62 0,60 0,53

algorithms. We now obtain a clear view of the overall performance for all algorithms when only one friend is being recommended.

) ,-./01,2+ ;7<=7>6/

!"%

!"$

!"# ! !") !"* !"+ !"# !"$ !"% !"& !"' !"( ) $%&'()*+,*-,./012,*321%41/

Figure 6.6: AUC statistic compared to the Fraction of Edges observed.

In terms of AUC, it is clear that tr_cKatz3 overall performs better when it uses the to- tal amount of train set information for a single recommendation. The tr_cKatz3 scores 88,1%, when at the same time sKatz scores 69%. Even if the available train information is only 20% of the training set, tr_cKatz3 manages to achieve a score of 66%, while the single source sKatz performs only 53%. Experimental evaluation of the four algorithms using the global-based Katz Status index measure, has been conducted in terms of precision, recall and AUC statistic. Results indicate that using an auxiliary source combined with the explicit friendship network enhances the link prediction task, providing better recommendations in terms of accuracy. Additionally, all the truncated methods that consider only paths of length 3 and 2, outperform the infinite length cases which prove to be insufficient and more

76 6. EXPERIMENTAL EVALUATION AND RESULTS 6.3. Results and Discussion costly. Finally, we experimentally verified that searching for possible friends in a social network using paths of length 3 performs better recommendations than the common- used length 2 by many OSNs.

77 6. EXPERIMENTAL EVALUATION AND RESULTS 6.3. Results and Discussion

78 7

Conclusion and Future Work

Online social networks (OSNs) have become very popular since the introduction and the rapid evolution of Web 2.0 technologies. Users are allowed to share content, ex- press opinions and sentiments, and expand their social circle by establishing new friendships. Online social rating systems (SRN) are also a significant part of Web 2.0, where users collaboratively rate items and share opinions, leading to an informed pur- chase of products and a personalization of the online information and items that users come in touch with. ONSs like Facebook and Myspace recommend new friends to users based on the existing explicit friendship network that they have already formed. Recommender sys- tems of OSNs use Link prediction approaches to predict new friendships that are likely to occur in the near future. The majority of the earlier work in Link Prediction use the explicit friendship network, which is a single source of information in order to provide recommendations. However, users of OSNs form as well various implicit relationships such as co-commenting on posts, co-tagging content such as photos, co-share a group. These interactions can provide valuable auxiliary information in order to enhance the recommendations of users or items, when the friendship network fails to capture sim- ilarities among users. Previous work of the use of such auxiliary information has been only used to pro- vide users with item recommendations. We focused on using the implicit user-item net- work to derive relationships among users and expand their social circle through com- mon shared items. We showed using single versus to various combined extensions of

79 7. CONCLUSION AND FUTURE WORK

Katz status index, a path-based global measure, that beyond friendship network there is a potential of exploiting several implicit sources to enhance user recommendations to users. Future work requires that we should study both path-based and node-based meth- ods for providing recommendations about users through several shared items. In ad- dition, there is also a growing number of OSNs that provide such networks for exper- imental evaluation. There is also little work on recommending items on SRN, using not only the ratings but this time the explicit friendship network. Most approaches concerning ratings social networks takes only account of the rating that users put on items such as products. There is a potential to use the friendship network in predicting ratings and recommending products. Link Prediction approaches and Data Mining will continue to have a central role in Web since volumes of information is shared online in an increasing rate. Users will need more powerful tools and services in order to manage information in an efficient way. The exploration and exploitation of multiple sources of information in a unified framework may be the key for the more accurate performance of recommender sys- tems.

80 Bibliography

[1] L. Adamic. The Political Blogosphere and the 2004 U . S . Election : Divided They Blog. Methodology, pages 36–43, 2005.

[2] L. Adamic and E. Adar. Friends and neighbors on the web. Social Networks, 25:211–230, 2003.

[3] L. Adamic and E. Adar. How to search a social network. Social Networks, 27(3):187– 203, July 2005.

[4] L. Adamic, R. Lukose, A. Puniyani, and B. Huberman. Search in power-law net- works. Phys. Rev. E, 64(4):046135, Sep 2001.

[5] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE TRANS- ACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 17(6):734–749, 2005.

[6] L. Amaral, A. Scala, M. Barthélémy, and H. Stanley. Classes of small-world net- works. Proceedings of the National Academy of Sciences of the United States of America, 97(21):11149–11152, October 2000.

[7] A. Barabasi and R. Albert. Emergence of scaling in random networks. Science (New York, N.Y.), 286(5439):509–512, October 1999.

[8] A. Barabasi and E. Bonabeau. Scale-free networks. Sci. Am., 288(5):50–59, 2003.

[9] P. Bonacich. Power and Centrality: A Family of Measures. The American Journal of Sociology, 92(5):1170–1182, March 1987.

[10] J. Bondy and U. Murty. Graph Theory. Springer Publishing Company, Incorpo- rated, 1st edition, 2008.

81 BIBLIOGRAPHY

[11] S. Borgatti. Centrality and network flow. Social Networks, 27(1):55 – 71, 2005.

[12] D. Boyd and N. Ellison. Social Network Sites: Definition, History, and Scholar- ship. Journal of Computer-Mediated Communication, 13(1):210–230, October 2007.

[13] J. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proc. Conf. on Uncertainty in Artificial Intelligence, pages 43–52, 1998.

[14] J. Chen, W. Geyer, C. Dugan, M. Muller, and I. Guy. Make new friends, but keep the old: recommending people on social networking sites. In CHI ’09: Proceedings of the 27th international conference on Human factors in computing systems, pages 201– 210, 2009.

[15] A. Clauset, C. Moore, and M. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98–101, May 2008.

[16] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms (MIT Electrical Engineering and Computer Science). The MIT Press, June 1990.

[17] M. Deshpande and G. Karypis. Item-based top-n recommendation algorithms. ACM Trans. on Information Systems, 22(1):143–177, 2004.

[18] N. Du, H. Wang, and C. Faloutsos. Analysis of large multi-modal social networks: Patterns and a generator. In ECML/PKDD (1), pages 393–408, 2010.

[19] T. Elliman, A. Macintosh, and Z. Irani. Argument maps as policy memories for informed deliberation: A research note. In European and Mediterranean Conference on Information Systems, pages 1–7. Citeseer, 2006.

[20] K. Foster, S. Muth, J. Potterat, and R. Rothenberg. A faster Katz status score algo- rithm. Computational & Mathematical Organization Theory, 7(4):275–285, 2001.

[21] M. Fredman and R. Tarjan. Fibonacci heaps and their uses in improved network optimization algorithms. J. ACM, 34:596–615, July 1987.

[22] L. Freeman. A set of measures of centrality based on betweenness. Sociometry, 40(1):35–41, 1977.

[23] L. Freeman, S. Borgatti, and D. White. Centrality in valued graphs: a measure of betweenness based on network flow. Social Networks, 13(2):141–154, 1991.

82 BIBLIOGRAPHY

[24] B. Gay and E. Loubier. Dynamics and evolution patterns of business networks. In Proceedings of the 2009 International Conference on Advances in Social Network Analy- sis and Mining, pages 290–295, Washington, DC, USA, 2009. IEEE Computer Soci- ety.

[25] L. Getoor and C. Diehl. Link mining: a survey. SIGKDD Explor. Newsl., 7:3–12, December 2005.

[26] L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3:679–707, 2002.

[27] A. Go, R. Bhayani, and L. Huang. Twitter Sentiment Classification using Distant Supervision. Technical report, Stanford University.

[28] S. Goel, R. Muhamad, and D. Watts. Social search in "Small-World" experiments. Proceedings of the 18th international conference on World wide web - WWW ’09, page 701, 2009.

[29] J. Golbeck. Personalizing applications through integration of inferred trust values in semantic web-based social networks. In Semantic Network Analysis Workshop at the 4th International Semantic Web Conference, 2005.

[30] A. Goldberg and C. Harrelson. Computing the shortest path: A* search meets graph theory. Technical Report MSR-TR-2004-24, 2004.

[31] S. Golder and B. Huberman. The structure of collaborative tagging systems. Jour- nal of Information Science, 32(2):198–208, April 2006.

[32] S. Golder and B. Huberman. Usage patterns of collaborative tagging systems. J. Inf. Sci., 32:198–208, April 2006.

[33] I. Guy, I. Ronen, and E. Wilcox. Do you know?: recommending people to invite into your social network. In Proceedings 13th International Conference on Intelligent User Interfaces (IUI), pages 77–86, 2009.

[34] H. Halpin, V. Robu, and H. Shepherd. The complex dynamics of collaborative tagging. In Proceedings of the 16th international conference on World Wide Web, WWW ’07, pages 211–220, New York, NY, USA, 2007. ACM.

[35] M. Hasan, V. Chaoji, S. Salem, and M. Zaki. Link Prediction using Supervised Learning. New York.

83 BIBLIOGRAPHY

[36] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In Proc. ACM SIGIR Conf., pages 230–237, 1999.

[37] J. Herlocker, J. Konstan, and J. Riedl. Explaining collaborative filtering recommen- dations. Proceedings of the 2000 ACM conference on Computer supported cooperative work - CSCW ’00, pages 241–250, 2000.

[38] J. Herlocker, J. Konstan, and J. Riedl. An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms. Information Retrieval, 5(4):287–310, 2002.

[39] M. Hu, A. Sun, and E. Lim. Comments-oriented blog summarization by sentence extraction. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM ’07, page 901, 2007.

[40] M. Jamali and M. Ester. A matrix factorization technique with trust propagation for recommendation in social networks. In Proceedings of the fourth ACM conference on Recommender systems, pages 135–142, 2010.

[41] S. Jamali and H. Rangwala. Digging Digg: Comment Mining, Popularity Predic- tion, and Social Network Analysis. 2009 International Conference on Web Information Systems and Mining, pages 32–38, November 2009.

[42] G. Jeh and J. Widom. SimRank: a measure of structural-context similarity. In Pro- ceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’02, pages 538–543, New York, NY, USA, 2002. ACM.

[43] G. Karypis. Evaluation of Item-Based Top- N Recommendation Algorithms. Pro- ceedings of the tenth international conference on Information and knowledge management - CIKM’01, page 247, 2001.

[44] G. Karypis. Evaluation of item-based top-n recommendation algorithms. In Proc. ACM CIKM Conf., pages 247–254, 2001.

[45] L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, March 1953.

[46] J. Kleinberg. The small-world phenomenon: An algorithmic perspective. In An- nual ACM symposium on theory of computing, volume 32, pages 163–170. Citeseer, 2000.

[47] D. Knoke and J. Kuklinski. Network analysis. Number v. 28; v. 1982 in Quantitative applications in the social sciences. Sage Publications, 1982.

84 BIBLIOGRAPHY

[48] D. Knoke and S. Yang. Social network analysis. Quantitative applications in the social sciences. Sage Publications, 2008.

[49] F. Kokkoras, E. Lampridou, K. Ntonas, and I. Vlahavas. Mopis: A multiple opin- ion summarizer. In Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications, SETN ’08, pages 110–122, Berlin, Heidelberg, 2008. Springer-Verlag.

[50] G. Kolaczek. An approach to identity theft detection using social network anal- ysis. In Proceedings of the 2009 First Asian Conference on Intelligent Information and Database Systems, pages 78–81, Washington, DC, USA, 2009. IEEE Computer Soci- ety.

[51] K. Kotis, A. Papasalouros, N. Pappas, K. Zoumpatianos, and G. Vouros. e-class in ontology engineering: integrating ontologies to argumentation and semantic wiki technology. In K. Kotis, editor, Workshop on Intelligent and Innovative Sup- port for Collaborative Learning Activities (WIISCOLA), 8th International Conference on Computer Supported Collaborative Learning, 09/2009 2009.

[52] J. Kunegis, E. De Luca, and S. Albayrak. The link prediction problem in bipartite networks. In Proceedings of the Computational intelligence for knowledge-based systems design, and 13th international conference on Information processing and management of uncertainty, IPMU’10, pages 380–389, Berlin, Heidelberg, 2010. Springer-Verlag.

[53] B. Leuf and W. Cunningham. The Wiki Way: Quick Collaboration on the Web. Addison-Wesley Professional, April 2001.

[54] H. Li, S. Bhowmick, and A. Sun. Affrank: Affinity-driven ranking of products in online social rating networks. Journal of the American Society for Information Science and Technology.

[55] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social net- works. Journal of the American Society for Information Science and Technology, 58(7):1019–1031, 2007.

[56] G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item collaborative filtering. IEEE Internet Computing, 7:76–80, January 2003.

[57] C. Liu. Elements of discrete mathematics (McGraw-Hill computer science series). McGraw-Hill Pub. Co., 1977.

85 BIBLIOGRAPHY

[58] Y. Luo and C. Hsu. An empirical study of research collaboration using social net- work analysis. In Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 04, pages 921–926, Washington, DC, USA, 2009. IEEE Computer Society.

[59] A. Macintosh. E-democracy and e-participation research in europe. In Hsinchun Chen, Lawrence Brandt, Valerie Gregg, Roland Traunmuller, Sharon Dawes, Ed- uard Hovy, Ann Macintosh, and Catherine A. Larson, editors, Digital Government, volume 17 of Integrated Series in Information Systems, pages 85–102. Springer US, 2008.

[60] A. Macintosh, T. Gordon, and A. Renton. Providing argument support for e- participation. Journal of Information Technology Politics, 6(1):43–59, 2009.

[61] R. Malouf and T. Mullen. Graph-based user classification for informal online po- litical discourse. In Proceedings of the 1st Workshop on Information Credibility on the Web. Citeseer, 2007.

[62] P. Massa and P. Avesani. Trust-aware collaborative filtering for recommender sys- tems. In In Proc. of Federated Int. Conference On The Move to Meaningful Internet: CoopIS, DOA, ODBASE, pages 492–508, 2004.

[63] A. Mathes. Folksonomies - cooperative classification and communication through shared metadata, 2004.

[64] A. Mcafee. Enterprise 2.0: The Dawn of Emergent Collaboration. MIT Sloan Man- agement Review, 47(3):21–28, 2006.

[65] S. Milgram. The small world problem. Psychology Today, 22:61–67, 1967.

[66] B. Miller, I. Albert, S. Lam, J. Konstan, and J. Riedl. Movielens unplugged: expe- riences with an occasionally connected recommender system. In Proceedings of the 8th international conference on Intelligent user interfaces, IUI ’03, pages 263–266, New York, NY, USA, 2003. ACM.

[67] G. Mishne and N. Glance. Leave a reply: An analysis of weblog comments. In Third annual workshop on the Weblogging ecosystem. Citeseer, 2006.

[68] A. Mislove, M. Marcon, K. Gummadi, P. Druschel, and B. Bhattacharjee. Measure- ment and analysis of online social networks. Proceedings of the 7th ACM SIGCOMM conference on Internet measurement - IMC ’07, page 29, 2007.

86 BIBLIOGRAPHY

[69] J. Moody, D. Mcfarland, and S. Bender-Demoll. Dynamic Network Visualization. American Journal of Sociology, 110(4):1206–1241, January 2005.

[70] T. Mullen and R. Malouf. A preliminary investigation into sentiment analysis of informal political discourse. In AAAI symposium on computational approaches to analysing weblogs (AAAI-CAAW), pages 159–162, 2006.

[71] S. Mullins and A. Dolnik. An exploratory, dynamic application of Social Network Analysis for modelling the development of Islamist terror-cells in the West. Be- havioral Sciences of Terrorism and Political Aggression, 2(1):3, 2010.

[72] M. Newman. Clustering and preferential attachment in growing networks. Phys. Rev. E, 64, 2001.

[73] T. O’Reilly. What Is Web 2.0: Design Patterns and Business Models for the Next Generation of Software.

[74] K. Oyvind. Political parties on Web 2 . 0 : The Norwegian Case *. Focus, (September):0–27, 2009.

[75] J. Pan, H. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In KDD ’04: Proceedings of the 10th ACM SIGKDD interna- tional conference on Knowledge discovery and data mining, pages 653–658, 2004.

[76] B. Pang and L. Lee. Opinion mining and sentiment analysis. Found. Trends Inf. Retr., 2:1–135, January 2008.

[77] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for collaborative filtering on netnews. In Proc. Conf. Computer Supported Collaborative Work, pages 175–186, 1994.

[78] J. Rose and O. Saebo. Designing Deliberation Systems. The Information Society, 26(3):228–240, May 2010.

[79] R Sarukkai. Link prediction and path analysis using Markov chains. Computer Networks, 33(1-6):377–386, June 2000.

[80] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In Proc. WWW Conf., pages 285–295, 2001.

[81] X. Su and T. Khoshgoftaar. A Survey of Collaborative Filtering Techniques. Ad- vances in Artificial Intelligence, 2009(Section 3):1–20, 2009.

87 BIBLIOGRAPHY

[82] P. Symeonidis and A. Deligiaouri. Recommending Posts in Political Blogs based on Tensor Dimensionality Reduction. Communication, 2010.

[83] P. Symeonidis, A. Nanopoulos, and Y. Manolopoulos. A unified framework for providing recommendations in social tagging systems based on ternary semantic analysis. IEEE Transactions on Knowledge and Data Engineering (to appear), 2009.

[84] B. Taskar, M. Wong, P. Abbeel, and D. Koller. Link Prediction in Relational Data. In in Neural Information Processing Systems, 2003.

[85] J. Thom-Santelli, M. Muller, and D. Millen. Social tagging roles: publishers, evan- gelists, leaders. In Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, CHI ’08, pages 1041–1044, New York, NY, USA, 2008. ACM.

[86] H. Tong, C. Faloutsos, and J. Pan. Fast random walk with restart and its applica- tions. In ICDM ’06: Proceedings of the 6th International Conference on Data Mining, pages 613–622. IEEE Computer Society, 2006.

[87] A. Vakali and Y. Kompatsiaris. Detecting and Understanding Web communities Yiannis Kompatsiaris. Technology, 1999.

[88] V. Vasuki, N. Natarajan, Z. Lu, and I. Dhillon. Affiliation Recommendation using Auxiliary Networks. Science, pages 103–110, 2010.

[89] V. Vasuki, N. Natarajan, Z. Lu, B. Savas, and I. Dhillon. Scalable affiliation recom- mendation using auxiliary networks. Submitted to ACM TIST, pages 1–17, 2010.

[90] S. Wasserman and K. Faust. Social Network Analysis. Methods and Applications. 1994.

[91] T. Yano, W. Cohen, and N. Smith. Predicting response to political blog posts with topic models. Proceedings of Human Language Technologies: The 2009 Annual Confer- ence of the North American Chapter of the Association for Computational Linguistics on - NAACL ’09, page 477, 2009.

[92] B. Yu, S. Kaufmann, and D. Diermeier. Exploring the characteristics of opinion ex- pressions for political opinion classification. In Proceedings of the 2008 international conference on Digital government research, pages 82–91. Digital Government Society of North America, 2008.

88 A Appendix

89 A. APPENDIX !.1 xSocial Generator xSocial generator provided the synthetic data set. The code of xSocial is complied on Redhat Linux AS4 64-bit, so in order to use it we employed a Intel-based (64-bit compatible) macbook with a virtual machine.

Documentation provided by xSocial: xSocial is a multi-modal graph generator to mimic the formation and co-evolution of multiple weighted social networks emerging from different relations simultaneously. The code is compiled on Redhat Linux AS4 64-bit.

SYNOPSIS xSocial [the number of agents] [simulation times] for example, to generate 100,000 agents for 10,000 simulations, just type "./xSocial 100000 10000" xSocial generates the following files:

1. weighted_buddy_network.txt (node id1, node id2, weight): edges represent friendship between agents, and the edge weight is the total number of times that the two agents comment on each other's posts. 2. network_buddy.txt(node id1, node id2) : unweighted buddy network with edges corresponding to friendship between agents. 3. network_comment.txt(node id1, node id2) : edges represent comment relation between agetns. 4. network_cocomment.txt(node id1, node id2) : there is an edge between two agents if they have once commented on the same post. 5. Distribution_Comment.txt(#comments, fraction) : pdf of the number of comments people have put. 6. Distribution_post_comments.txt(#comments, fraction) : pdf of the number of comments that posts have received. 7. Distribution_user_posts.txt(#posts, fraction) : pdf of the number of posts that users have made. 8. Buddy2Cocomment.txt(#cocomments, probability of being frends) : CPF 9. components_comment_network.txt(time_tick, size of the 1st largest connected component, size of the 2nd largest connected component, size of the 3rd largest connected component) : evolution of connected components in comment network 10. components_weighted_buddy_network.txt(time_tick, size of the 1st largest connected component, size of the 2nd largest connected component, size of the 3rd largest connected component) : evolution of connected components in weighted buddy network 11. directory "snapshot_comment_network" comprises the snapshots of the comment network for the first 1000 time steps. 12. directory "snapshot_weighted_buddy_network" comprises the snapshots of the weighted buddy network for the first 1000 time steps.

90 A. APPENDIX the "analysis" file is a tool to produce other existing patterns on weighted and unweighted networks.

SYNOPSIS analysis [network_file] [w or u] w is for u is for unweighted network for example, run "./analysis weighted_buddy_network w" will generate the following files: weighted_buddy_network_CDPL.txt(degree, average number of maximal ) : Clique Degree Power-Law weighted_buddy_network_CPL.txt(number of maximal cliques, fraction) : Clique Participation Law weighted_buddy_network_components.txt(size of components, fraction) : pdf of connected components weighted_buddy_network_EdgeWeightDistribution.txt(edge weight, fraction) : pdf of edge weight weighted_buddy_network_NodeWeightDistribution.txt(node weight, fraction) : pdf of node weight weighted_buddy_network_DegreeDistribution.txt(degree, fraction) : pdf of degree

91 A. APPENDIX !.2 Matlab Code

Matlab source code was designed for the pre-processing of data sets (train and test set division) and the computation of the similarity matrices for each algorithm extension.

% Apo ton network_buddy dimiourgw ton Adjacecny Filias A_Adj

M=1000; A=zeros(M,M); for i=1:size(network_buddy,1) index1=network_buddy(i,1); index2=network_buddy(i,2); A(index1+1,index2+1)=1; A(index2+1,index1+1)=1; end clear index1 index2 ; dlmwrite('A.txt',A, ' ');

% Apo ton network_coccoment dimiourgw ton Adjacecny Cocomment B_Adj

M=1000; B=zeros(M,M); for i=1:size(network_cocomment,1) index1=network_cocomment(i,1); index2=network_cocomment(i,2); B(index1+1,index2+1)=1; B(index2+1,index1+1)=1; end clear index1 index2 ; dlmwrite('B_Adj.txt',B, ' ');

%Apo to network_buddy (A Adjacency) Filias, dimiourgw to train set mou A_train=zeros(M,M); for i=1:size(network_buddy,1) %mia edge sto train, mia sto test enallax if mod(i,2)==0 index1=network_buddy(i,1); index2=network_buddy(i,2); A_train(index1+1,index2+1)=1; A_train(index2+1,index1+1)=1; end end clear index1 index2 ; dlmwrite('A_train.txt',A_train, ' ');

92 A. APPENDIX %Apo to network_buddy (A Adjacency) Filias, dimiourgw to test set mou

A_test=zeros(M,M); for i=1:size(network_buddy,1) % enallax mia edge train mia test if mod(i,2)~=0 index1=network_buddy(i,1); index2=network_buddy(i,2); A_test(index1+1,index2+1)=1; A_test(index2+1,index1+1)=1; end end clear index1 index2 ; dlmwrite('A_test.txt',A_test, ' ');

%dimiourgia monadiaou temp = ones(1,M); I = diag(temp);

%dimiourgia Similarities b=0.0005; sKatz = inv(I-b*A_train)-I; dlmwrite('sKatz.txt',sKatz, ' '); clear sKatz ;

% Sources A and B cKatz = inv(I-b*(A_train+B))-I; dlmwrite('cKatz.txt',cKatz, ' '); clear cKatz ;

cKatz2 = b*(A_train+B)+b^2*(A_train^2+B^2+A_train*B+B*A_train); dlmwrite('cKatz2.txt',cKatz2, ' '); clear cKatz2 ; cKatz3 = b*(A_train+B)+b^2*(A_train^2+B^2+A_train*B+B*A_train)+b^3*(A_train^3+B^3+A_tr ain^2*B+A_train*B*A+A_train*B^2+B^2*A_train+B*A_train*B+B*A_train^2); dlmwrite('cKatz3.txt',cKatz3, ' '); clear cKatz3 ;

93 A. APPENDIX

!.3 Visual Basic Code

For the experimental process, Visual Studio Basic source code has been provided by the delab.csd.auth.gr and parameterized in our experimental needs.

Public Class Form1

Public Structure myrows Dim value As Double Dim thesi As Integer End Structure Dim mydim = 999 '935 Public myrow(0 To mydim) As myrows Public assoi(0 To 999)

Private Sub Form1_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load

Dim adj(mydim, mydim) As Double Dim realf(mydim, mydim) As Double Dim w, num_rec, temp2 As Integer Dim temp, prec, map, map2 As Double Dim start As Integer Dim pososto_train As Double start = 0

Dim FILE_NAME As String = "C:\Users\chris\Desktop\diplwmatiki_peiramata\cKatz.txt"

num_rec = 4 'posous proteines tha proteino se kathe grammi pososto_train = 60 ' percentage of observed links ,pososto train (diastima timon apo 20 eos 80)

Dim TextLine As String = "" Dim tmpRow(mydim) As String Dim prec2, rec2 As Double prec2 = 0 If System.IO.File.Exists(FILE_NAME) = True Then

Dim objReader As New System.IO.StreamReader(FILE_NAME)

For intCount = start To mydim

TextLine = objReader.ReadLine() tmpRow = TextLine.Split(" ")

For secondInt = 0 To mydim ' tmpRow.Length - 2

94 A. APPENDIX

'If tmpRow(secondInt) < 0.5 Then ' tmpRow(secondInt) = 0 ' End If

adj(intCount, secondInt) = tmpRow(secondInt)

Next secondInt

Next intCount

Else

MsgBox("File Does Not Exist")

End If

Dim FILE_NAME1 As String = "C:\Users\chris\Desktop\diplwmatiki_peiramata\A_test.txt" Dim TextLine1 As String = "" Dim tmpRow1(mydim) As String

If System.IO.File.Exists(FILE_NAME1) = True Then

Dim objReader1 As New System.IO.StreamReader(FILE_NAME1) For intCount = start To mydim TextLine1 = objReader1.ReadLine() tmpRow1 = TextLine1.Split(" ") For secondInt = 0 To mydim 'tmpRow.Length - 2 realf(intCount, secondInt) = tmpRow1(secondInt Next secondInt Next intCount

Else

MsgBox("File Does Not Exist")

End If

Dim FILE_NAME2 As String = "C:\Users\chris\Desktop\diplwmatiki_peiramata\A_train.txt" Dim TextLine2 As String = "" Dim tmpRow2(mydim) As String

If System.IO.File.Exists(FILE_NAME2) = True Then

Dim objReader2 As New System.IO.StreamReader(FILE_NAME2)

95 A. APPENDIX For intCount = start To 200 '1924 'oso to meiwnw auksanetai to precisino enw to recall menei idio

TextLine2 = objReader2.ReadLine() tmpRow2 = TextLine2.Split(" ")

For secondInt = 0 To 200 'tmpRow.Length - 2 If tmpRow2(secondInt) > 0 Then assoi(intCount) = assoi(intCount) + 1 End If Next secondInt Next intCount

Else

MsgBox("File Does Not Exist")

End If

'------Dim k = 0 Dim mm = 0 map2 = 0 Dim metr_for_recall = 0

For i = 0 To mydim For j = 0 To mydim

myrow(j).value = adj(i, j) myrow(j).thesi = j

Next j

For ii = 0 To mydim - 1 For jj = ii + 1 To mydim If myrow(ii).value < myrow(jj).value Then temp = myrow(ii).value temp2 = myrow(ii).thesi myrow(ii).value = myrow(jj).value myrow(ii).thesi = myrow(jj).thesi

myrow(jj).value = temp myrow(jj).thesi = temp2 End If Next jj Next ii

96 A. APPENDIX

For mmm = 0 To mydim If realf(i, mmm) = 1 Then metr_for_recall = metr_for_recall + 1 End If Next

k = 0 : prec = 0 : map = 0

For w = 0 To num_rec - 1

If myrow(w).value > 0 And assoi(i) > (pososto_train / 20) Then

If realf(i, myrow(w).thesi) = 1 Then prec = prec + 1 map = map + (prec / i) k = k + 1

Else k = k + 1

End If End If

Next w

If k > 0 Then rec2 = rec2 + prec prec2 = prec2 + prec / k map2 = map2 + map mm = mm + k End If Next i

TextBox1.Text += "Recommended Friends : " & num_rec : TextBox1.Text += " Precision :" : TextBox1.Text += Str(prec2 / mm) : TextBox1.Text += " Recall :" : TextBox1.Text += Str(rec2 / metr_for_recall) TextBox2.Text += "AUC : " & Str((prec2 + 0.5 * (mm - prec2)) / mm) TextBox2.Text += " MAP : " & Str(map2 / metr_for_recall)

End Sub

Private Sub TextBox1_TextChanged(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles TextBox1.TextChanged

End Sub End Class

97 A. APPENDIX !.4 Topological Properties Computation in C++

In this section, source code implemented in C++, computes networks topological properties.

#include #include #include int getch(void) { struct termios oldt, newt; int ch; tcgetattr( STDIN_FILENO, &oldt ); newt = oldt; newt.c_lflag &= ~( ICANON | ECHO ); tcsetattr( STDIN_FILENO, TCSANOW, &newt ); ch = getchar(); tcsetattr( STDIN_FILENO, TCSANOW, &oldt ); return ch; } const int N = 1000;//1150; //5000; //1000; //1000; //1000; //1000; //100; //3694; //900; //584; //97; //7; //Number of Nodes const int E = 52538;// 1473;// 4578;//3862; //37466; //44975; //22470; //15031; //7503; //738; //13692; //3317; //1177; //314; //10; //Number of Edges short A[N][N]; //Adjacency Matrix int Dist[N][N]; //Distances Matrix int Deg[N]; //Degree Matrix int FindDegree(int i) { int sm,h; sm = 0;

for(h=0;h

return(sm); } float ComputeLocalClusteringCoefficient(int i) { int h,u,w,k; int v[1000]; int totalv;

98 A. APPENDIX int s; float lcc;

totalv = 0; for(h=0;h

//printf("\n%d",totalv);

s=0; for(h=0;h

//printf("\t%d",s);

if(totalv!=0 && totalv!=1) { lcc=(float)2*s/(totalv*(totalv-1)); } else { lcc=0; }

//printf("\t%f",lcc);

return(lcc); } void PrintDegreeMatrix() { int i;

99 A. APPENDIX printf("Degreee Matrix:\n"); printf("------\n");

for(i=0;i

for(i=0;i

for(i=0;i

100 A. APPENDIX int inque[N]; int q[4*N]; const int INF = 999999; int i,v,u; int mind,temp; int qe,qs,qw; int p; //int vp; for(i=0;i dist[v] + A[v][u]) { dist[u] = dist[v] + A[v][u]; parent[u] = v; } if (inque[u] == 0 && visit[u] == 0) { q[qe] = u; qe = qe + 1; inque[u] = 1; } } } mind = INF;

101 A. APPENDIX qw = qs; for(i=qs;i

p=dist[vend]; } return(p); }

102 A. APPENDIX int main( int argc, const char* argv[] ) { int id1,id2,w; int i,j; int maxSP,maxDeg; float avgSP,avgDeg,avgLCC; int p; FILE *f; FILE *fout;

if(argc < 2) { printf("Usage: %s filename\n", argv[0]); return 0; } else { f = fopen(argv[1], "r"); }

//read graph data //f = fopen("graph.txt","r"); //f = fopen("100Users4friendsVC(N97-E314).txt","r");

//f = fopen("Real-Facebook-1K-Users4friends(N900-E3317).txt","r"); //f = fopen("Real-Facebook-4K-Users4friends(N3694-E13692).txt","r"); //f = fopen("Real-Hi5-0.5K-Users4friends(N584-E1177).txt","r");

//f = fopen("Synthetic-Friends-N1000-20friends-10train(N1000-E7503).txt","r"); //f = fopen("Synthetic-Friends-N1000-40friends-20train(N1000-E15031).txt","r"); //f = fopen("Synthetic-Friends-N1000-60friends-30train(N1000-E22470).txt","r"); //f = fopen("Synthetic-Friends-N1000-120friends-60train(N1000-E44975).txt","r"); //f = fopen("Synthetic-Time-N100-20friends-10train(N100-E738).txt","r"); //f = fopen("Synthetic-Time-N1000-20friends-10train(N1000-E7503).txt","r"); //f = fopen("Synthetic-Time-N5000-20friends-10train(N5000-E37466).txt","r"); //f = fopen("Real-Hi5-final-train-N1150(N1150-E3862).txt","r");

if(f==NULL) { printf("\nFile Not Found...\n"); return(0); } for(i=0;i

103 A. APPENDIX } fclose(f);

//set up degree matrix, find max and avg degree maxDeg=0; avgDeg=0; for(i=0;imaxDeg) maxDeg=Deg[i]; avgDeg=avgDeg+(float)Deg[i]; } avgDeg=(float)(avgDeg/N);

//compute average local clustering coefficient avgLCC=0; for(i=0;i

//print matrices //PrintAdjacencyMatrix(); //PrintDegreeMatrix(); //getch();

//printf("\nN=%d (total number of Nodes)",N); //printf("\nE=%d (total number of Links-Edges)",E); //printf("\nADEG=%f (average node degree)",avgDeg); //printf("\nLCC=%f (average local clustering coefficient)",avgLCC); //printf("\nMAXD=%d (maximum node degree)",maxDeg); //getch();

//compute basic distances for(i=0;i

104 A. APPENDIX } else Dist[i][j] = 0; } }

//print distances before shortest path computations //printf(">BEFORE shortest path computations\n"); //PrintDistancesMatrix();

//update similarities with transitive shortest path calculation (where there are zeros) //maxSP=0; //avgSP=0; for(i=0;imaxSP) maxSP=Dist[i][j]; //avgSP=avgSP+(float)Dist[i][j]; } } //printf("\ntempASD=%f",avgSP); //printf("\ttempGD=%d",maxSP); }

//print distances after shortest path computations //printf(">AFTER shortest path computations\n"); //PrintDistancesMatrix();

//Find Max SP Distance and Average (Max=Graph Diameter) maxSP=0; avgSP=0; for(i=0;imaxSP) maxSP=Dist[i][j]; avgSP=avgSP+(float)Dist[i][j]; } }

105 A. APPENDIX avgSP=(float)(avgSP/(N*N));

//save distances matrix fout = fopen("DistancesTable.txt","w"); for(i=0;i

//Report Results (Topological Properties) printf("\nN=%d (total number of Nodes)",N); printf("\nE=%d (total number of Links-Edges)",E); printf("\nASD=%f (average shortest distance between pair nodes)",avgSP); printf("\nADEG=%f (average node degree)",avgDeg); printf("\nLCC=%f (average local clustering coefficient)",avgLCC); printf("\nGD=%d (graph diameter or maximum shortest path distance)",maxSP); printf("\nMAXD=%d (maximum node degree)",maxDeg);

printf("\n\nEnd of Calculations.\n"); getch(); return(0); }

106