Examining the Six Degrees in Film: George Clooney As the New Kevin Bacon

Home , George Clooney, Harrison Ford, Tom Hanks

Gerardo Pelayo García, 16gpg1, [email protected] Gary Chen, 18gc5, [email protected] Michael Zuo, 18mz4, [email protected]

Examining the Six Degrees in Film: George Clooney as the New Kevin Bacon

Introduction

“Six Degrees of Kevin Bacon” is a parlor game stating that any actor can be linked to the eponymous celebrity within six degrees of separation or fewer through association in film or television1. The concept is also seen in the Erdős number, which measures degrees of separation through mathematical paper authorship to mathematician Paul Erdős, and can be generalized to the theory of six degrees of separation, which describes all human beings. This project aimed to algorithmically find actors or actresses who exhibit high traits of “Baconability”, or centrality within the network of actors, using the Internet Movie Database (IMDb).2. The end goal was to find a suitable replacement or runnerup to Kevin Bacon in the context of his own parlor game.

1 http://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon 2 http://www.imdb.com/interfaces

1 Methodology

To address this problem, we treated the data as a graph of actors and films. This was implemented through the use of nested lists: a list of actors each with his own list of associated works, and a list of films each with its own list of participating actors. One could imagine our data structure as nodes of actors connected by edges of common works, or as a graph of two

“colors” of nodes connected by simple association. Either way, our program works the same way algorithmically. To assess the centrality of each actor, we considered using criteria such as number of actors of direct association (the degree centrality of the actor), number of unique works of the actor, and degrees of separation (or distance) from Bacon. We decided to calculate the percentage of all actors connected within a certain depth of separation to the actor. We applied this assessment to 188 popular actors, which were found by running our algorithm on one thousand randomlygenerated subsets.

The IMDb provides two data files which list millions of pairings of actors and their films.

Every actorfilm association has its own line in the data file (meaning multiple lines for each actor and multiple lines for each work). We cleaned this data of extraneous information like character roles and television episodes, and further simplified it by assigning each actor and each film a unique integer index that could later be referenced. This allowed functions within our project to not have to handle the text strings directly. The program used to format data is included, but our data is not, due to size constraints.

The core coveragecalculating algorithm we implemented finds a set of actors connected to a particular actor within a depth of association of three using a breadthfirst graph search. A depth of three was found to produce the most meaningful results without obscuring data by

2 completely covering the graph. The algorithm iteratively adds the actor and each of his associates to a result array, whose size is then taken. In order to avoid duplicates through circulation, we tracked the actors already added to the array and ignored them if they had been counted.

Because naïvely evaluating every actor individually would have taken an impractical amount of time—it is O(n) in actors to evaluate a single actor, so O(n2) for all actors—we first took many small, randomized subsets of the data and ran our algorithms on each actor in them.

Since the most central actors will be redundantly connected (a reasonable assumption considering the nature of the television and film industry), with a large enough number of samples, the top actors were fairly consistent. The 188 most cumulatively popular actors in all samples were selected and evaluated individually against the entire data set.

Note that our task is very parallelizable; a modified version of the program which does in fact evaluate every single actor is included. It can be run on a fast computer with speedup roughly proportional to the number of processors available. With minor modification the program can be run on multiple computers with similar speed improvement.

3 Results

Top 10 BestConnected Actors… and Kevin Bacon

Rank Name Associations

1 George Clooney 2,950,145

2 Samuel L. Jackson 2,948,974

3 Will Smith 2,947,396

4 Tom Hanks 2,947,162

5 Arnold Schwarzenegger 2,947,153

6 Hugh Jackman 2,946,940

7 Morgan Freeman 2,945,567

8 Bill Clinton 2,945,504

9 Harrison Ford 2,945,390

10 Matt Damon 2,945,346

4 55 Kevin Bacon 2,935,433

Of the 188 actors we evaluated against the data set, the top ten most popular actors are presented in the table above. A list of complete results for all 188 is included with this document.

George Clooney tops our list, with 2,950,145 associated actors. Surprisingly, former President

Bill Clinton appears at eighth in our list, possibly due to his presence on television as himself.

The lack of diversity amongst the most popular actors should be noted. Of the top ten actors, no women are present, only two races are represented (seven of the ten are white), and only one was born outside of the United States. The first woman to appear in our rankings is

Whoopi Goldberg, who places fourteenth with a score of 2,944,303. Jennifer Lopez, the highest ranked Latina in our list, comes in at number fifteen.

The most compelling result of our project was that Kevin Bacon himself actually ranks only 55th on our list. This suggests that Kevin Bacon is not the best subject for the six degrees of film experiment, and that George Clooney is in fact, for this purpose, a more suitable Kevin

Bacon than Bacon himself.