Link Prediction and the Evolution of Communities on Twitter
Total Page:16
File Type:pdf, Size:1020Kb
NATURE | Vol 453 | 1 May 2008 LETTERS graph can capture behaviour of this kind using probabilities pr that observed network. These pairs we consider the most likely candidates decrease as we move higher up the tree. Conversely, probabilities that for missing connections. (Technical details of the procedure are given increase as we move up the tree correspond to ‘disassortative’ struc- in Supplementary Information.) tures in which vertices are less likely to be connected on small scales We demonstrate the method by using our three example networks than on large ones. By letting the pr values vary arbitrarily throughout again. For each network we remove a subset of connections chosen the dendrogram, the hierarchical random graph can capture both uniformly at random and then attempt to predict, on the basis of the assortative and disassortative structure, as well as arbitrary mixtures remaining connections, which have been removed. A standard metric of the two, at all scales and in all parts of the network. for quantifying the accuracy of prediction algorithms, commonly To demonstrate our method we have used it to construct hierarch- used in the medical and machine learning communities, is the ical decompositions of three example networks drawn from disparate AUC statistic, which is equivalent to the area under the receiver fields: the metabolic network of the spirochaete Treponema palli- operating characteristic (ROC) curve29. In the present context, the dum18, a network of associations between terrorists19, and a food AUC statistic can be interpreted as the probability that a randomly web of grassland species20. To test whether these decompositions chosen missing connection (a true positive) is given a higher score by accurately capture the important structural features of the networks, our method than a randomly chosen pair of unconnected vertices (a we use the sampled dendrograms to generate new networks, different true negative). Thus, the degree to which the AUC exceeds 0.5 indi- in detail from the originals but, by definition, having similar hier- cates how much better our predictions are than chance. Figure 2 archical structure (see Supplementary Information for more details). shows the AUC statistic for the three networks as a function of the We find that these ‘resampled’ networks match the statistical pro- fraction of the connections known to the algorithm. For all three perties of the originals closely, including their degree distributions, networks our algorithm does far better than chance, indicating that clustering coefficients, and distributions of shortest path lengths hierarchy is a strong general predictor of missing structure. It is also between pairs of vertices, despite the fact that none of these properties instructive to compare the performance of our method with that of is explicitly represented in the hierarchical random graph (Table 1, other methods for link prediction8. Previously proposed methods and Supplementary Fig. 3). It therefore seems that a network’s hier- include assuming that vertices are likely to be connected if they have archical structure is capable of explaining a wide variety of other many common neighbours, if there are short paths between them, or network features as well. if the product of their degrees is large. These approaches work well The dendrograms produced by our method are also of interest in for strongly assortative networks such as collaboration and citation themselves, as a graphical representation and summary of the hier- archical structure of the observed network. As discussed above, our a method can generate not just a single dendrogram but a set of den- drograms, each of which is a good fit to the data. From this set we can, by using techniques from phylogeny reconstruction21, create a single consensus dendrogram, which captures the topological features that appear consistently across all or a large fraction of the dendrograms and typically is a better summary of the network’s structure than any Link Prediction and the Evolution of individual dendrogram. Figure 2a shows such a consensus dendro- gram for the grassland species network, which clearly reveals com- Communities on Twitter munities and subcommunities of plants, herbivores, parasitoids and hyperparasitoids. Master's Thesis Another application of the hierarchical decomposition is the pre- diction of missing interactions in networks. In many settings, the discovery of interactions in a network requires significant experi- mental effort in the laboratory or the field. As a result, our current pictures of many networks are substantially incomplete22–28. An b alternative to checking exhaustively for a connection between every pair of vertices in a network is to try to predict, in advance and on the basis of the connections already observed, which vertices are most likely to be connected, so that scarce experimental resources can be focused on testing for those interactions. If our predictions are good, we can in this way substantially reduce the effort required to establish the network’s topology. The hierarchical decomposition can be used as the basis for an effective method of predicting missing interactions as follows. Given an observed but incomplete network, we generate, as described above, a set of hierarchical random graphs—dendrograms and the associated probabilities pr—that fit that network. Then we look for pairs of vertices that have a high average probability of connection within these hierarchical random graphs but are unconnected in the Table 1 | Comparison of original and resampled networks Figure 2 | Application of the hierarchical decomposition to the network of Network Ækæ Ækæ C C d d real samp real samp real samp grassland species interactions. a, Consensus dendrogram reconstructed T. pallidum 4.83.7(1) 0.0625 0.0444(2) 3.690 3.940(6) from the sampled hierarchical models. b, A visualization of the network in Terrorists 4.95.1(2) 0.361 0.352(1) 2.575 2.794(7) which the upper few levels of the consensus dendrogram are shown as boxes Grassland 3.02.9(1) 0.174 0.168(1) 3.29 3.69(2) around species (plants, herbivores, parasitoids, hyperparasitoidsOscar Casta~neda and hyper- Statistics are shown for the three example networks studied and for new networks generated by hyperparasitoids are shown as circles, boxes, down triangles, up triangles resampling from our hierarchical model. The generated networks closely match the average and diamonds, respectively). Note that in several cases a set of parasitoids is degree Ækæ, clustering coefficient C and average vertex–vertex distance d in each case, suggesting that they capture much of the structure of the real networks. Parenthetical values grouped into a disassortative community by the algorithm, not because they indicate standard errors on the final digits. prey on each other but because they prey on the same herbivore. 99 © 2008 Nature Publishing Group Link Prediction and the Evolution of Communities on Twitter MASTER'S THESIS submitted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE Track Information Architecture by Oscar Casta~neda born in Guatemala, Guatemala Web Information Systems Group Department of Software Technology Faculty EEMCS, Delft University of Technology Delft, the Netherlands http://eemcs.tudelft.nl c 2011 Oscar Casta~neda. Coverpicture: Network visualization from Clauset et al. [17]. Link Prediction and the Evolution of Communities on Twitter Author: Oscar Casta~neda Student id: 1398946 Email: [email protected] Graduation Date: 24 November 2011 Graduation Section: Web Information Systems Abstract This research is about the influence of link prediction on the evolution of communities on Twitter. We collected tweets from three technology micro- bloggers who led us through their followings and tweets to tens of thousands of unique users over several weeks. We analyzed conventional and alternative information streams for these micro-bloggers based on URLs embedded in their tweets and in tweets of followees and followees-of-followees. We model users based on the most recent URLs embedded on their tweets and the latest users they follow, from which we infer links and extract semantic entities that are indicative of their interests. Furthermore, we propose a pipeline of methods for user modeling and personalization of communities of interest on Twitter. We test the performance of different organizational principles in community design, including the principles of hierarchy, user interests and the baseline follower mechanism on Twitter, which is based on user intuitions. The goal of this thesis is to create a better notion of community by au- tomatically calculating adaptive and personalized structures of followees that produce highly interesting content. Designing communities in this way is use- ful because it enables people to know in which community they are organized during a given period of time and because it enables community-based rec- ommendations. Furthermore, designing communities based on organizational principles enables their automatic construction. Currently, communities are manually constructed by users through a tedious process of following and unfollowing which is based on disconnected user intuitions. We investigate whether it is possible to infer links between Twitter users who are not explic- itly connected on Twitter and explore whether such automatically inferred social networks would allow for improving content recommendations on Twit- ter. Thesis Committee: