Twitter Graph Assignment

I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 Twitter Graph assignment Assignment Goals The goal of this assignment is to give you experience with the Twitter social graph, and a bit of exposure to some simple graph algorithms and graph code, by having you acquire data from the Twitter social graph and process it in various ways. Background You have probably learned about graph algorithms in your other courses; some are especially relevant to social media. We’ve already looked a bit at friending and following on Twitter. We can link up friends and followers into a graph, which can be considered the “interest graph”. We can also look at mentions of one user by another as another kind of “conversation graph” within Twitter. (A mention occurs when one user includes the handle of another user in the text of their tweet, via the @ symbol. A retweet might also be considered a mention but isn’t strictly one.) For this assignment, we will explore both the conversation graph and the interest graph and examine their properties a bit. Since we have students who use both python and java in the class, this assignment tries to accommodate both. This time python is the easier tool to use, however, and will be the tool we focus on, but you are free to use a Java equivalent. The tools to download to follow the instructions as given here are NetworkX (python) and GraphViz (which is programming language neutral). Java options are JUNG or JGraphT or any other package that you like. This link compares JGraphT and JUNG. Another graph plotting library that looks good (but which I haven’t tried yet) is Gephi. Part 1: Take a Look at the Conversation Graph We first need to build up a graph of who mentions whom. We’ll use the streaming API from Assignment 2 to build this. What we want is a large set of ordered pairs of the form of (User A, User B) in which user A mentions user B in a tweet. (If you want more details, these steps are based on a series of blog posts by Isaac Hepworth.) We are going to feed this into a graph visualization tool. One that you can use, and for which instructions are provided below, is GraphViz (but you are welcome to use 1 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 other visualization tools instead). An advantage of GraphViz for this assignment is that it has the ability to extract connected components, which will be useful for plotting the biggest connected structures. Here are some steps to follow. (a) Download 25,000 @ mentions into a file using the sample stream. You’ll need to record the screen name of the person who wrote the tweet as well as who they mentioned (there is a user_mention type built into entities portion of the streaming API). I ran the streaming API for about 15 minutes to get this many unique mentions. (b) Lowercase the screen names (it’s easiest to do this while you download). (c) Remove duplicates (you might want to use unix commands sort | uniq). (d) Create a file of the format that is acceptable to your graph plotting program. For GraphViz, you need it to read like the sample below (spaces don’t matter). digraph mentions { "000icm000" -> "miyavi_official" "005dw" -> "u_td5" "007villegas" -> "carolacortez327" "00amnah" -> "sontk2011" "00genetaylor00" -> "tcannada" "00nere00" -> "rafamoratete" "00sleepy00" -> "brjohnson2000" "0101s" -> "aro3_brodcast" } (e) Once you have this, you can plot it in GraphViz. Because it can take a long time to render, it’s best not to just plot the whole thing in GraphViz. Instead, you can use the connected components tool to plot the subgraphs. GraphViz has several different plotting tools and parameters. Here we just give the incantation for plotting these graphs with a black background and white lines, in a square layout. Assuming your digraph is named mentions.gv, this command will create a png file that has the top 1000 connected components. If you change the 1000 to 5, you’ll get the 5 largest connected components instead. ccomps -zX#0-1000 mentions.gv | \ grep "-" | cat <(echo "digraph mentions {") - <(echo "}") | \ sfdp -Gbgcolor=black -Ncolor=white -Ecolor=white \ -Nwidth=0.02 -Nheight=0.02 -Nfixedsize=true \ -Nlabel='' -Earrowsize=0.4 -Gsize=75 -Gratio=fill \ -Tpng > ccomp0-1000.large.png 2 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 Here is the plot that I got for largest 5 connected components for 26,993 unique mentions, as well as the largest 1000. 3 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 It would be cool to modify the graph to show labels of the user screen names for the largest connected components. To turn in for part 1: At least two images of a Twitter mention graph based on at least 20,000 edges, although not all edges need to be in the image (as shown above, not all the edges are there, because the connected components function removed many of them). Your images may look different than what I’ve shown here although you can follow the instructions here if you want, but this is just meant to get you started. Feel free to use your imagination! If you want to do more graph analysis, please do! If you want to use more nodes, or gather mentions in some other way, or use retweets somehow, please do! These instructions are meant to be a starting point to help you explore. Points: 4 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 • [8 pts] For producing your own graph and at least two images, and your commentary on what the image appears to mean. • [1 pt] For enhancing the way the visualization in some manner. • [1 pt] For enhancing the processing of the tweets or the graph in some manner. Part 2: Take a Look at the Interests Graph Now we’ll take a look at the interests graph. You can do this either with your own friends network or you can choose some user of interest to start with. In the explanations below I’m going to assume you are using your own friends network. It’s a good idea to make sure there aren’t too many friends or too few friends in the network, however; between 100 (so not too sparse) and 500 (so not too big) is best. (This is based on a blog post by Drew Conway.) The idea is to find users who share your interests closely but who you may be unaware of. The first step is to gather up the people connected to you via your direct friends, then find the friends of those friends. This will lead to a lot of single- chain links, so these are eliminated, and a core of more tightly linked people will be extracted out, by using k-core analysis. From the core, you will then look for the friends of friends that most of this group follows, and that you don’t already follow. In other words, the code shows you how to effectively close the open triads. You’ll see which of these missing links are most popular, and then you can decide if you want to consider following them or not. We’ll also look at the graph properties of these networks as we go. The first thing you need to do is download and install NetworkX if you are using python, or another graph package if you prefer. You can use different steps if you like to get a similar outcome if you don’t use python. (Here is a very nice introduction to NetworkX and graph properties.) You may follow the steps and comments in the enclosed python file. Here are the top results for UCBTweeter, the Twitter account for our course; looks like there are some interesting suggestions to follow! 20 of your friends are already following UCBTweeter 8 of your friends are already following MarsCuriosity 7 of your friends are already following ruslansv 7 of your friends are already following ML_Hipster 7 of your friends are already following FakeDorsey 6 of your friends are already following sippey 6 of your friends are already following neiltyson 5 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 6 of your friends are already following marissamayer 5 of your friends are already following vkhosla 5 of your friends are already following twoffice 5 of your friends are already following sagemintblue 5 of your friends are already following daltonc 5 of your friends are already following chanian 5 of your friends are already following MartiHearst 5 of your friends are already following BorowitzReport 4 of your friends are already following wattenberg 4 of your friends are already following wangtian 4 of your friends are already following squarecog 4 of your friends are already following paul_irish 4 of your friends are already following mlevchin 4 of your friends are already following mccv 4 of your friends are already following matei_zaharia 4 of your friends are already following mashable 4 of your friends are already following hackweek Now there are other things you can do with this graph. If you convert it from a directed graph (DiGraph) to an undirected graph, you can compute the clustering coefficients and the number of triangles. The clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together, and can be computed as the fraction of possible triangles that exist; gaps suggest potentials for recommendations of new edges (or followers).

Load more