Social and Information Networks
Total Page:16
File Type:pdf, Size:1020Kb
Social and Information Networks University of Toronto CSC303 Winter/Spring 2019 Week 2: January 14,16 (2019) 1 / 42 Todays agenda Last weeks lectures and tutorial: a number of basic graph-theoretic concepts: I undirected vs directed graphs I unweighted and vertex/edge weighted graphs I paths and cycles I breadth first search I connected components and strongly connected components I giant component in a graph I bipartite graphs This lecture: Chapter 3 of the textbook on \Strong and Weak Ties". But let's first briefly return to the \romantic relations" table to see how graph structure may or may not align with our understanding of sociological phenomena. 2 / 42 The dispersion of edge In their paper, Backstrom and Kleinberg introduce various dispersion measures for an edge. The general idea of the dispersion of an edge (u; v) is that mutual neighbours of u and v should not be well \well-connected" to one another. They consider dense subgraphs of the Facebook network (where a known relationship has been stated) and study how well certain dispersion measures work in predicting a relationship edge (marriage, engagement, relationship) when one is known to exist. Another problem is to discover which individuals have a romantic relationship. Note that discovering the romantic edge or if one exists raises questions of the privacy of data. They compare their dispersion measures against the embeddedmess of an edge (to be defined) and also against certain semantic interaction features (e.g., appearance together in a photo, viewing the profile of a neighbour). 3 / 42 Performance of (Dispersion + b)^a/(Embededness + c) type embed rec.disp. photo prof.view. 0.51 all 0.391 0.688 0.528 0.389 0.505 Precision@1 married 0.462 0.758 0.561 0.319 Some experimental 0.5 results for the fractionmarried of (fem) correct0.488 0.764 0.538 0.350 0.495 0.49 married (male) 0.435 0.751 0.586 0.287 0.485 engaged 0.335 0.652 0.553 0.457 predictions 0.48 engaged (fem) 0.375 0.674 0.536 0.492 0.475 Recall that we argued(since 0.47 the median number of friendsengaged was (male) 200), that0.296 0.630 0.568 0.424 0.465 relationship 0.277 0.572 0.460 0.498 Precision at first position 0.46 relationship (fem) 0.318 0.600 0.440 0.545 the fraction might bearound 0.455 .005 when randomly choosing an edge to be 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 relationship (male) 0.239 0.546 0.479 0.455 the edge to be link to the significantExponent partner. alpha Do you findFigure anything 5. The performance of the four measures from Figure 4 when the goal is to identify the partner or a family member in the first position Figure 3. Performance of (disp(u, v)+b)↵/(emb(u, v)+c) as a func- of the ranked list. The difference in performance between the genders surprising in this table?tion of ↵, when choosing optimal values of b and c. has largely vanished, and in some cases is inverted relative to Figure 4. type embed rec.disp. photo prof.view. plying the idea of dispersion recursively — identifying nodes all 0.247 0.506 0.415 0.301 v for which the u-v link achieves a high normalized disper- married 0.321 0.607 0.449 0.210 sion based on a set of common neighbors Cuv who, in turn, married (fem) 0.296 0.551 0.391 0.202 also have high normalized dispersion in their links with u.To married (male) 0.347 0.667 0.511 0.220 carry out this recursive idea, we assign values to the nodes engaged 0.179 0.446 0.442 0.391 reflecting the dispersion of their links with u, and then update engaged (fem) 0.171 0.399 0.386 0.401 these values in terms of the dispersion values associated with engaged (male) 0.185 0.490 0.495 0.381 other nodes. Specifically, we initially define xv =1for all relationship 0.132 0.344 0.347 0.441 neighbors v of u, and then iteratively update each xv to be relationship (fem) 0.139 0.316 0.290 0.467 2 w C xw +2 s,t C dv(s, t)xsxt relationship (male) 0.125 0.369 0.399 0.418 2 uv 2 uv . Figure 4. The performance of different measures for identifying spouses P embP(u, v) and romantic partners: the numbers in the table give the precision at the Note that after the first iteration, x is 1+2 norm(u, v), and first position — the fraction of instances in which the user ranked first by v · the measuretype is in fact themax. true partner.max. Averagedall. over allall. instances,comb. re- hencewhere ranking we nodesdivide by (c)x byv after total the number first iteration of friends is equiv- a user has. cursive dispersion performsstruct. approximatelyinter. twicestruct. as wellinter. as the standard alentThus, to ranking the input nodes to byournorm machine(u, v) learning. We find algorithms the highest has 480 notionall of embeddedness,0.506 and also0.415 better overall0.531 than measures0.560 based0.705 on performance when we rank nodes by the values of x after profile viewing and presence in the same photo. features derived from 120 values per instance. Inv addition to married 0.607 0.449 0.624 0.526 0.716 the thirdthe full iteration. set of features, For purposes we also of later compute discussion, performance we will using engaged 0.446 0.442 0.472 0.615 0.708 callonly this value the structuralxv in the features, third iteration and only the therecursive interaction disper- features. of non-neighboring nodes in Cub that have no neighbors in sion rec(u, v), and will focus on this as the main examplar commonrelationship in G 0.344u, b . 0.441 0.377 0.605 0.682 u − { } fromWe our performed family of related initial dispersion-based experiments with measures. different (See machine Figure 10. The performance of methods based on machine learning learning algorithms and found that gradient tree boosting [13] Strengtheningsthat combine sets of of Dispersion. features. The first two columns show the highest the Appendix for further mathematical properties of the re- out-performed logistic4 regression,/ 42 as well as other tree-based Weperforming can learn individual a function structural that and predicts interaction whether features; ornot the thirdv is and cursive dispersion.) fourth columns show the highest performance of machine learning clas- methods. Thus, all of our in-depth analysis is conducted with thesifiers partner that combine of u in structural terms andof the interaction two variables features respectively;disp(u, v) and and emb(u, v), where the latter denotes the embeddedness PERFORMANCEthis algorithm. OF In ourSTRUCTURAL experiments, AND we divide INTERACTION the data so that the fifth column shows the performance of a classifier that combines all 50% of the users go into a training set and 50% go into a test ofstructural the u-v andlink. interaction We find features that together. performance is highest for MEASURES functions that are monotonically increasing in disp(u, v) and Weset. summarize We perform the performance 12 such divisions of our methodsinto sets inA andFigure B; 4.for each links over their time on Facebook, and it is also correlated monotonically decreasing in emb(u, v): for a fixed value of Lookingdivision initially we train at just on the set first A two and columns test on B, in theand top then row train of on B dispwith(u, the v) time, increased since the embeddedness relationship is was in fact first a reported. negative pre-(As we numbersand test (‘all’), on A. we For have each the user overallu, we performance average over ofembed- the 12 runs in dictorwill see of whetherlater in Figurev is the 11, partner performance of u. A variessimple as combina- a function dednesswhich andu recursivewas a test dispersion user to get — a the final fraction prediction. of instances tionof this of these latter two quantity quantities as well.) that comes To understand within a few whether percent there on which the highest-ranked node under these measures is ofis more any effectcomplicated of a user’s functional time forms on site can beyond be obtained its relation by the to in factPerformance the partner. of As Machine we will Learning see below Methods. in the discussion expressionthese otherdisp parameters,(u, v)/emb we(u, consider v), which a we subset term of the usersnormal- where aroundWe Figure find (Figure 6, recursive 10) that dispersion by using also boosted has higher decision perfor- trees to izedwerestrict dispersion thenorm neighborhood(u, v) since size it to normalizes lie between the 100 absolute and 150, mancecombine than a allwide ofrange the 48 ofstructural other basic features structural we measures.analyzed, we can dispersionand the time by since the embeddedness. the relationship Predicting was reportedu’s to partner lie between to increase performance from 50.8% to 53.1%. We can use the We can also compare these structural measures to features de- be100 the and individual 200 days.v maximizing Figure 9 showsnorm that(u, forv) gives this subset, the correct there is same technique to predict relationships based on interaction rived from a variety of different forms of real-time interaction answera weak in increase48.0% of in all performance instances. as a function of time on site; features.