Visualisation and Analysis of the Internet Movie Database∗
Total Page:16
File Type:pdf, Size:1020Kb
Visualisation and Analysis of the Internet Movie Database∗ Adel Ahmed† Vladimir Batagelj‡ Xiaoyan Fu§ School of IT, University of Sydney Discrete and Computational Mathematics NICTA, Australia NICTA, Australia University of Ljubljana, Slovenia Seok-Hee Hong¶ Damian Merrick Andrej Mrvar∗∗ School of IT, University of Sydney School of IT, University of Sydney Social Science Informatics NICTA, Australia NICTA, Australia University of Ljubljana, Slovenia ABSTRACT Understanding these networks is a key enabler for many appli- In this paper, we present a case study for the visualisation and anal- cations. Good analysis methods are needed for these networks, and ysis of large and complex temporal multivariate networks derived some are available. However, such methods are not useful unless from the Internet Movie DataBase (IMDB). Our approach is to in- the results are effectively communicated to humans. Visualisation tegrate network analysis methods with visualisation in order to ad- can be an effective tool for the understanding of such networks. dress scalability and complexity issues. In particular, we defined Good visualisation reveals the hidden structure of the networks and new analysis methods such as (p,q)-core and 4-ring to identify im- amplifies human understanding, thus leading to new insights, new portant dense subgraphs and short cycles from the huge bipartite findings and possible predictions for the future. graphs. We applied island analysis for a specific time slice in order We can identify the following challenging research issues for to identify important and meaningful subgraphs. Further, a tem- analysis and visualisation of large and complex networks: poral Kevin Bacon graph and a temporal two mode network are • Scalability: Webgraphs or telephone call graphs gathered by extracted in order to provide insight and knowledge on the evolu- AT&T have billions of nodes. In some cases, it is impossible tion. to visualise the whole graph, or one cannot possibly load the Keywords: Large and Complex Networks, Case Study, Visualisa- whole graph in a main memory. Hence, the design of new tion, Network Analysis, IMDB. analysis and visualisation methods for huge networks is a key research challenge from databases to computer graphics. Index Terms: H.5.2 [Information Interfaces and Presentation]: User Interfaces—Algorithms; I.3.6 [Computer Graphics]: Method- • Complexity: Relationships between actors in a social net- ology and Techniques— work, for example, can have a multitude of attributes (for ex- ample, observed behavior can be confirmed or unconfirmed, 1INTRODUCTION relationships can be directed or undirected, and weighted by Recent technological advances have led to the production of a lot of probabilities). Also, biological networks are quite complex data, and consequently have led to many large and complex network in nature; for example, metabolic pathways have only a few models across a number of domains. Examples include: thousand nodes, but their relationships and interactions are • very complex. The data may be given by nature, but some Webgraphs: where the entities are web pages and relation- parts of the data may be unknown to human scientists. The ships are hyperlinks; these are huge: the whole graph consists design of analysis and visualisation methods to resolve these of billions of nodes. complexity issues is the second research challenge. • Social networks: These include telephone call graphs (used • Network Dynamics: Real world networks are always chang- to trace terrorists), money movement networks (used to de- ing over time. Many social networks, such as webgraphs, tect money laundering), and citation networks or collabora- evolve relatively slowly over time. In some cases, such as tele- tion networks. The size of the network can be medium to very phone call networks, the data is a very fast-streamed graph. large. Effective and efficient modeling, analysis and visualisation • Biological networks: Protein-protein interaction (PPI) net- for dynamic networks are challenging research topics. works, metabolic pathways, gene regulatory networks and One approach to solve these challenging issues is an integra- phylogenetic networks are used by biologists to analyse and tion of analysis with visualisation and interaction. Analysis tools engineer biochemical materials. In general, they are smaller, for networks are not useful without visualisation, and visualisation with thousands of nodes. However, the relationships in these tools are not useful unless they are linked to analysis. Further, in- networks are very complex. teraction is necessary to find out more details or insights from the ∗This paper is based on the winning entry of the Graph Drawing Com- visualisation. petition 2005 [7] and invited presentation at Sunbelt Viszard Session [9]. In this paper, we present a case study for our approach to inte- †e-mail: [email protected] grating analysis, visualisation and interaction using large and com- ‡e-mail:[email protected] plex temporal multivariate networks derived from the IMDB (Inter- §e-mail:[email protected] net Movie Data Base). In general, the IMDB is a huge and very ¶e-mail:[email protected] rich data set with many attributes. Note that the IMDB data set has e-mail:[email protected] become a challenging data set for visualisation researchers [7, 9]. ∗∗e-mail:[email protected] For example, a multi-scale approach for visualisation of small world networks was used for data sets from IMDB [3]. A visual- Asia-Pacific Symposium on Visualisation 2007 ization approach for dynamic affiliation networks in which events 5 - 7 February, Sydney, NSW, Australia are characterized by a set of descriptors was presented [6]. A ra- 1-4244-0809-1/07/$20.00 © 2007 IEEE dial ripple metaphor was devised to display the passing of time and 17 ’EnquŒtes du commissaire Maigret, Les’ Popular Science Unusual Occupations Table 1: (p,q : n1,n2) for IMDB Richard, Jean (I) Whitman, Gayne Carpenter, Ken (I) Hutton, Timothy Heinrichs, Dirk 1 1590: 1590 1 | 22 24: 1854 1153 | 43 14: 29 83 ’Nero Wolfe Mystery, A’ Fox, Colin (I) Gawlich, Cathlen Dunn, Conrad ’Sitte, Die’ Bhm, Iris 2 516: 788 3 | 23 23: 47 56 | 44 14: 29 83 Chaykin, Maury Boyd, Karin 3 212: 1705 18 | 24 23: 34 39 | 45 13: 30 95 Abatantuono, Diego Panczak, Hans Georg ’Commissario Corso, Il’ Maggio, Rosalia ’Operation Phoenix - Jger zwischen denMartens, Welten’ Dirk (I) 4 151: 4330 154 | 25 22: 42 53 | 46 13: 29 94 Jarczyk, Robert Pfohl, Lawrence Starrcade Bock, Alana 5 131: 4282 209 | 26 22: 31 38 | 47 12: 29 101 Flair, Ric Borden, Steve (I) Eurovision Song Contest, The Kelehan, Noel 6 115: 3635 223 | 27 22: 31 38 | 48 12: 28 100 Berry, Colin Rasmussen, Tommy (I) Olsen, Jłrgen 7 101: 3224 244 | 28 20: 36 53 | 49 12: 26 95 Dansk melodi grand prix Statsministerens nytrstale Schlter, Poul Heick, Keld Rasmussen, Poul Nyrup de Mylius, Jłrgen 8 88: 2860 263 | 29 20: 35 52 | 50 11: 27 111 Siggaard, Kirsten Cream of Comedy Sims, Tim Hłeg, Jannie Leese, Lindsay 9 77: 3467 393 | 30 19: 35 59 | 51 11: 26 110 Kennedy Center Honors: A Celebration of the Performing Arts, The Dronningens nytrstale 10 69: 3150 428 | 31 19: 35 59 | 52 11: 16 79 11 63: 2442 382 | 32 19: 34 57 | 53 10: 35 162 Cronkite, Walter Margrethe II 12 56: 2479 454 | 33 18: 34 62 | 54 10: 35 162 Levesque, Paul Michael Jacobs, Glen Gunn, Billy (II) 13 50: 3330 716 | 34 18: 34 62 | 55 10: 34 162 Hickenbottom, Michael Hart, Owen Royal Rumble Hart, Bret Summerslam 14 46: 2460 596 | 35 18: 33 61 | 56 10: 34 162 Traylor, Raymond DiBiase, Ted Smith, Davey Boy 15 42: 2663 739 | 36 17: 33 65 | 57 9: 35 187 Anoai, Solofatu Survivor Series Lawler, Jerry Ross, Jim (III) McMahon, Vince 16 39: 2173 678 | 37 16: 33 75 | 58 9: 33 180 King of the Ring Eaton, Mark (II) Calaway, Mark 17 35: 2791 995 | 38 16: 30 73 | 59 9: 33 180 18 32: 2684 1080 | 39 16: 29 70 | 60 9: 32 178 19 30: 2395 1063 | 40 15: 29 77 | 61 9: 31 177 20 28: 2216 1087 | 41 15: 28 76 | 62 9: 31 177 Figure 1: Arcs with multiplicity at least 8 21 26: 1988 1087 | 42 15: 28 76 | 63 8: 31 202 conveys relations among the different constituents through appro- priate layout. Note that the method is suitable for an egocentric • 4-ring weights on lines perspective. ( , ) As the first step of our approach, we integrate network analysis 3.1 p q -core Analysis methods [5, 10] with visualisation. In particular, we defined the The subset of vertices C ⊆ V is a (p,q)-core in a bipartite (2-mode) new analysis methods such as (p,q)-core and 4-ring to identify im- network N =(V1,V2;L), V = V1 ∪V2 if and only if portant dense subgraphs and short cycles from the huge bipartite a =( , ( )) = ∩ graphs. We applied island analysis for a specific time slice in order . in the induced subnetwork K C1 C2;L C , C1 C V1, = ∩ ∀ ∈ ( ) ≥ ∀ ∈ to identify important and meaningful subgraphs of the large and C2 C V2 it holds v C1 :degK v p and v C2 : ( ) ≥ complex network. Further, a temporal Kevin Bacon graph and a degK v q ; temporal two mode network are extracted and visualised in order to b a provide insight and knowledge on the evolution of the IMDB data . C is the maximal subset of V satisfying condition . set. The basic properties of bipartite cores are: This paper is organised as follows. In the next Section, we present a simple analysis of the IMDB data set. In Section 3, we • C(0,0)=V present the integration of network analysis methods with visualisa- tion for large bipartite graphs including (p,q)-core, 4-ring and is- • K(p,q) is not always connected land.