I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 Twitter Graph assignment

Assignment Goals

The goal of this assignment is to give you experience with the Twitter social graph, and a bit of exposure to some simple graph algorithms and graph code, by having you acquire data from the Twitter social graph and process it in various ways.

Background

You have probably learned about graph algorithms in your other courses; some are especially relevant to social media. We’ve already looked a bit at friending and following on Twitter. We can link up friends and followers into a graph, which can be considered the “interest graph”. We can also look at mentions of one user by another as another kind of “conversation graph” within Twitter. (A mention occurs when one user includes the handle of another user in the text of their tweet, via the @ symbol. A retweet might also be considered a mention but isn’t strictly one.)

For this assignment, we will explore both the conversation graph and the interest graph and examine their properties a bit.

Since we have students who use both python and java in the class, this assignment tries to accommodate both. This time python is the easier tool to use, however, and will be the tool we focus on, but you are free to use a Java equivalent. The tools to download to follow the instructions as given here are NetworkX (python) and (which is programming language neutral). Java options are JUNG or JGraphT or any other package that you like. This link compares JGraphT and JUNG. Another graph plotting library that looks good (but which I haven’t tried yet) is .

Part 1: Take a Look at the Conversation Graph

We first need to build up a graph of who mentions whom. We’ll use the streaming API from Assignment 2 to build this. What we want is a large set of ordered pairs of the form of (User A, User B) in which user A mentions user B in a tweet. (If you want more details, these steps are based on a series of blog posts by Isaac Hepworth.)

We are going to feed this into a graph visualization tool. One that you can use, and for which instructions are provided below, is GraphViz (but you are welcome to use

1 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 other visualization tools instead). An advantage of GraphViz for this assignment is that it has the ability to extract connected components, which will be useful for plotting the biggest connected structures.

Here are some steps to follow.

(a) Download 25,000 @ mentions into a file using the sample stream. You’ll need to record the screen name of the person who wrote the tweet as well as who they mentioned (there is a user_mention type built into entities portion of the streaming API). I ran the streaming API for about 15 minutes to get this many unique mentions. (b) Lowercase the screen names (it’s easiest to do this while you download). () Remove duplicates (you might want to use unix commands sort | uniq). (d) Create a file of the format that is acceptable to your graph plotting program. For GraphViz, you need it to read like the sample below (spaces don’t matter).

digraph mentions { "000icm000" -> "miyavi_official" "005dw" -> "u_td5" "007villegas" -> "carolacortez327" "00amnah" -> "sontk2011" "00genetaylor00" -> "tcannada" "00nere00" -> "rafamoratete" "00sleepy00" -> "brjohnson2000" "0101s" -> "aro3_brodcast" }

(e) Once you have this, you can plot it in GraphViz. Because it can take a long time to render, it’s best not to just plot the whole thing in GraphViz. Instead, you can use the connected components tool to plot the subgraphs. GraphViz has several different plotting tools and parameters. Here we just give the incantation for plotting these graphs with a black background and white lines, in a square layout.

Assuming your digraph is named mentions.gv, this command will create a png file that has the top 1000 connected components. If you change the 1000 to 5, you’ll get the 5 largest connected components instead.

ccomps -zX#0-1000 mentions.gv | \ grep "-" | cat <(echo "digraph mentions {") - <(echo "}") | \ sfdp -Gbgcolor=black -Ncolor=white -Ecolor=white \ -Nwidth=0.02 -Nheight=0.02 -Nfixedsize=true \ -Nlabel='' -Earrowsize=0.4 -Gsize=75 -Gratio=fill \ -Tpng > ccomp0-1000.large.png

2 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3

Here is the plot that I got for largest 5 connected components for 26,993 unique mentions, as well as the largest 1000.

3 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3

It would be cool to modify the graph to show labels of the user screen names for the largest connected components.

To turn in for part 1: At least two images of a Twitter mention graph based on at least 20,000 edges, although not all edges need to be in the image (as shown above, not all the edges are there, because the connected components function removed many of them). Your images may look different than what I’ve shown here although you can follow the instructions here if you want, but this is just meant to get you started. Feel free to use your imagination!

If you want to do more graph analysis, please do! If you want to use more nodes, or gather mentions in some other way, or use retweets somehow, please do! These instructions are meant to be a starting point to help you explore.

Points:

4 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 • [8 pts] For producing your own graph and at least two images, and your commentary on what the image appears to mean. • [1 pt] For enhancing the way the visualization in some manner. • [1 pt] For enhancing the processing of the tweets or the graph in some manner.

Part 2: Take a Look at the Interests Graph

Now we’ll take a look at the interests graph. You can do this either with your own friends network or you can choose some user of interest to start with. In the explanations below I’m going to assume you are using your own friends network. It’s a good idea to make sure there aren’t too many friends or too few friends in the network, however; between 100 (so not too sparse) and 500 (so not too big) is best. (This is based on a blog post by Drew Conway.)

The idea is to find users who share your interests closely but who you may be unaware of. The first step is to gather up the people connected to you via your direct friends, then find the friends of those friends. This will lead to a lot of single- chain links, so these are eliminated, and a core of more tightly linked people will be extracted out, by using k-core analysis. From the core, you will then look for the friends of friends that most of this group follows, and that you don’t already follow. In other words, the code shows you how to effectively close the open triads. You’ll see which of these missing links are most popular, and then you can decide if you want to consider following them or not.

We’ll also look at the graph properties of these networks as we go.

The first thing you need to do is download and install NetworkX if you are using python, or another graph package if you prefer. You can use different steps if you like to get a similar outcome if you don’t use python. (Here is a very nice introduction to NetworkX and graph properties.)

You may follow the steps and comments in the enclosed python file. Here are the top results for UCBTweeter, the Twitter account for our course; looks like there are some interesting suggestions to follow!

20 of your friends are already following UCBTweeter 8 of your friends are already following MarsCuriosity 7 of your friends are already following ruslansv 7 of your friends are already following ML_Hipster 7 of your friends are already following FakeDorsey 6 of your friends are already following sippey 6 of your friends are already following neiltyson

5 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 6 of your friends are already following marissamayer 5 of your friends are already following vkhosla 5 of your friends are already following twoffice 5 of your friends are already following sagemintblue 5 of your friends are already following daltonc 5 of your friends are already following chanian 5 of your friends are already following MartiHearst 5 of your friends are already following BorowitzReport 4 of your friends are already following wattenberg 4 of your friends are already following wangtian 4 of your friends are already following squarecog 4 of your friends are already following paul_irish 4 of your friends are already following mlevchin 4 of your friends are already following mccv 4 of your friends are already following matei_zaharia 4 of your friends are already following mashable 4 of your friends are already following hackweek

Now there are other things you can do with this graph. If you convert it from a (DiGraph) to an undirected graph, you can compute the clustering coefficients and the number of triangles. The is a measure of the to which nodes in a graph tend to cluster together, and can be computed as the fraction of possible triangles that exist; gaps suggest potentials for recommendations of new edges (or followers). Play around with these tools with your interest graph.

You can also output the NetworkX graphs and visualize them in GraphViz. You’ll have to write code to convert the NetworkX format to the GraphViz format or else install the pygraphviz library (I haven’t tried that yet).

To turn in for part 2: Turn in a list of 10 recommended Twitter users to follow based on closing triads in a follower list. You may use the code supplied here, modify the code, or develop your own algorithms. You may base it on your own user id or on some other Twitter handle of interest.

Points: • [8 points] For following this code and producing your own code to generate an interest list. Discuss whether or not you’d be interested in following these discovered links. • [1 point] For experimenting with cluster coefficients and other graph measures: explain your findings! • [1 point] For visualizing your results in way that sheds light on the graphs.

6 I290-1 Fall 2012: Analyzing Big Data with Twitter Assignment 3 Part 3: Compare the Interests Graph with the Conversations Graph

Now let’s put these two parts together. In this final step, you’ll compare the @mention graph (the conversation graph) to the @friends (or @follow) graph (the interest graph). There are several ways you can to about this. a) For those users who are most central in the interest graph of Part 2, download as many of their tweets as the API will allow you to do, extracting out the mentions from those tweets. Build a graph from those mentions, and compare it to the @friends graph that you built it from. It’s up to you how you do the comparison – but most likely you will want to use some combination of the graph analysis and visualization tools you used in Parts 1 and 2. b) An alternative is to start with the @mentions graph from Part 1 and look at a subset of the @friends or @followers graphs, say of the centrally connected components of the @mention graph. Compare the new @friend graph to the centrally connected components derived from the @mentions graph. It’s up to you to decide how many users’ friends to look at, which ones to look at, etc. Of course, it’s best if your choices are well-motivated. c) Your idea goes here. Please motivate your choices. If you want to do something else with Twitter graphs besides mentions/interest, please contact the Prof about it first, at least one week before the due date.

To turn in for Part 3: • [10 points] A comparison of the interest graph and the conversation graph for some user or users. You should act as an analyst and provide a thoughtful commentary backed up by the data that you are showing.

Important Note

As in the earlier assignments, you are expected to do you own work. For this assignment, you are allowed to find helpful code and ideas online, but to the degree that you do, be sure to provide attribution to that code and those ideas, stating where you got it from and what parts you contributed (but you don’t need to do this for the code provided explicitly in the assignment). If you do use outside code, you should not take all code verbatim, but instead should combine provided code with your own, and you should not copy solutions from your classmates or other people.

This assignment was written with a great deal of help from @gilad.

7