Introduction to and Link Analysis

David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference Boston, MA

1

1 © 2006 Knowledge Integrity Incorporated 2 www.knowledge-integrity.com (301) 754-6350

2 Half-Day Agenda

 Introduction to Networks, social and otherwise  Network Connectivity Basics  Link and Network Analysis  Issues and Considerations for BI

© 2006 Knowledge Integrity Incorporated 3 www.knowledge-integrity.com (301) 754-6350

In this talk, we will discuss the notion of connectivity, and why models for analyzing connections can add value to a business intelligence initiative. By reviewing the ways that objects interact through networks, we will explore whether the results of this analysis can enhance profiles, predictive analytics, and general business intelligence

Objectives: •Understand network connectivity basics •Explore ways to represent networks •Understand the types of analysis that can be performed •Envisioning network analytics, data extraction and preparation

3 Introduction to Networks, Social and Otherwise

© 2006 Knowledge Integrity Incorporated 4 www.knowledge-integrity.com (301) 754-6350

4 Networks, Links, and Coincidences?

 How many people do you know?  Family, friends, co-workers, conference attendees  Dozens? Hundreds? Thousands?  How well do you know them?  Very close, know them well, acquaintances, “just met”  Concepts of Connectivity?  You know 1,000 people  They each know 1,000 people  Therefore, you are potentially connected to 1,000,000 people through just 1 link  By 2 links, the network could potentially extend to 1,000,000,000 people!  “Small World Theory” – we are all connected through a very small number of links (See Milgram, Bacon)

© 2006 Knowledge Integrity Incorporated 5 www.knowledge-integrity.com (301) 754-6350

The notion of connectivity is intriguing, especially when considering individuals, other types of parties, and the knowledge that can be derived through the analysis of connections.

For example, think about the example in this slide. Let’s assume that we all know about 1000 people. But is it really true that each individual is therefore linked to 1,000,000 people? Conceptually, that would be true as long as none of the 1000 people I know are completely different than the 1000 people that you know.

But in reality, we all seem to run around in similar circles, and so there is a great likelihood that many of the people that I know are the same people that you know. The consequences of this is the effective “self-organization” of communities ( as well as sub-communities, and sub-sub-communities). By examining the relationships that exist among groups of people, we can learn who are the influencers, who are the influenced, who spans critical communication boundaries, and how information (or commerce, or viruses, etc.) flow through the selected community.

5 Euler’s Insight

Bridges of Konigsberg

© 2006 Knowledge Integrity Incorporated 6 www.knowledge-integrity.com (301) 754-6350

One pastime of the residents of Konigsberg was to walk around town over the bridges between the different land areas of town. One game was to see if one could start at one location, walk over every bridge just once, and end up at the starting point. Mathematician Leonhard Euler abstracted the problem into a “graph” – acollection of nodes and links between them. By examining the graph, he was able to determine that based on the degrees of the links between nodes, the challenge of the bridges of Konigsberg was actually impossible. However, this insight created the branch of mathematics referred to as graph theory, which is the fundamental basis of network (and consequently, social network) analysis.

6 Network and Link Analysis

 Linkages exist everywhere  Between individuals (“MCI Friends and Family”)  Between locations (“Bridges of Konigsberg”)  Between other types of objects (“Telephone network”)  Between individuals and other kinds of objects (“Purchasing Preferences”)  Between businesses (“D&B corporate hierarchies”)  There are different kinds of links  Each link has some sort of attribution  Analyzing networks can provide insight for evaluating behavior patterns for different intelligence activities

© 2006 Knowledge Integrity Incorporated 7 www.knowledge-integrity.com (301) 754-6350

There are many applications that rely on the power of the network. Each of these networks represents some attempt to exploit the different kinds of connections that exist among small groups of individuals, larger groups of individuals as well as how the groups themselves interact. Applications may be designed to seek out some interesting pattern within the network or to exploit the communication and information exchanges provided by the network.

Every node and their each of their corresponding links carries certain characteristics. Each node represents an entity, while each link carries attributes that describe the nature of the relationship.

7 Applications of Network/Link Analysis

 Enforcement: Criminal analysis, money laundering  Fraud detection: spambot detection, call pattern analysis  Marketing: Customer Behavior analysis, Segmentation, collaborative filtering  Community analyses: Account proxy (account used by more than one individual, many accounts used by one individual), research collaboration, communities of interest

© 2006 Knowledge Integrity Incorporated 8 www.knowledge-integrity.com (301) 754-6350

8 More Applications…

 Health care: Contagion, disease control  Physical: Supply chain analysis  Transfer/Communications: Spheres of Influence, Information flows, business partnerships  Formal Relationships: Working relationships, Influential individuals, ownership, accountability, corporate structure  Informal Relationships: Friendship networks, extended families, social interactions, insider networks

© 2006 Knowledge Integrity Incorporated 9 www.knowledge-integrity.com (301) 754-6350

9 Evidence is All Around…

 Databases  Transaction systems, logs, data warehouses  Semi-structured data  Email, web pages, public records, filings   News items, prospectuses, filings

© 2006 Knowledge Integrity Incorporated 10 www.knowledge-integrity.com (301) 754-6350

Typical business intelligence applications focus on the ability to organize information for reporting and analysis across one or more dimensions, but are not usually configured to enable network analysis. Yet data warehouses contain significant amounts of connectivity information that is suitable to network and link analysis. Other sources of information provide connectivity data – transaction systems, database logs, software activity logs, as well as less structured systems such as emails, web logs, electronic public data filings, other public records (e.g., real estate transactions, Uniform Commercial Code, etc.). In addition, text analysis applications can extract individual data out of unstructured data to establish connections.

10 Example: Death Notice

 Richard J. Palaima, Of Mattapan, Suddenly, Nov. 18, 2002. Beloved husband of Jonadee (Badayos) Palaima. Devoted son of Madelyn L. (George) Palaima of Braintree, and the late Richard A. Palaima. Devoted brother of John A. Palaima of Braintree, nephew of Catherine Cunningham of Rockland, Cousin & Godson of Robert Cunningham of Rockland. Funeral from the Mortimer N. Peck-Russell Peck Funeral Home, 516 Washington St., BRAINTREE, on Saturday at 9 a.m. Funeral Mass in St. Gregory's Church, Dorchester, at 10 a.m. Relatives and friends invited. Visiting hours Friday 2-4 & 7-9 p.m. Memorial donations may be sent to St. Gregory's Church, 2215 Dorchester Ave., Dorchester 02124.

© 2006 Knowledge Integrity Incorporated 11 www.knowledge-integrity.com (301) 754-6350

This example was taken from the Boston Globe, and was available on line from 11/20/2002 - 11/21/2002. Death, birth, engagement, and wedding notices are good examples of publicly available information (published in the newspaper) configured in semi-structured form that provide a lot of data about connections. In this example, we have a description of one individual and his immediate family, his location, and his religious affiliation.

11 Example: Extracted Entities and Their Links

lives in

Richard A Plaima married to Braintree, MA lives in Madelyn L. (George) Palaima has living state of

sister of Deceased

Catherine Cunningham father of Mattapan, MA mother of has living state of mother of John A Palaima lives in St. Gregory's Church brother of Robert Cunningham lives in

located in is religiously affiliated with lives in Richard J Paliama married to Jondalee (Badayos) Palaima Dorchester, MA is godson/godfather of Rockland, MA is cousin of

© 2006 Knowledge Integrity Incorporated 12 www.knowledge-integrity.com (301) 754-6350

12 Good SNA Resource

 Many examples taken from Robert A. Hanneman and Mark Riddle’s online text  Introduction to social network methods  http://faculty.ucr.edu/~hanneman/nettext/

© 2006 Knowledge Integrity Incorporated 13 www.knowledge-integrity.com (301) 754-6350

13 Integrating Network/Link Analysis with BI

 Network information may be embedded within data warehouse  However:  Representations may not be appropriate for analysis  Data may need to be transformed and managed using non- relational data structures  Analysis lends itself to visual representation  Must understand concepts associated with networks, connectivity, and qualification of linkage  Objective: Gain a conceptual understanding of  Network data and its representation  Characteristics of network relationships  Types of analysis performed

© 2006 Knowledge Integrity Incorporated 14 www.knowledge-integrity.com (301) 754-6350

14 Network Connectivity Basics

© 2006 Knowledge Integrity Incorporated 15 www.knowledge-integrity.com (301) 754-6350

15 Representing Network Data

 What is network data?  Two types of objects:  Actors (entities that participate in the network)  Links (established relationships between the actors)  Analysis focuses on:  Who the actors are  What their relationship is to holistic view of the community  How the actors organize within the framework  Different approaches to representation:  Rectangular Data  Adjacency Matrices  Graph representation

© 2006 Knowledge Integrity Incorporated 16 www.knowledge-integrity.com (301) 754-6350

The next set of slides introduces basic concepts of social networks: -Actors -Links -Representations

16 Rectangular Data

 Relational structure provides a “rectangular” view of the “who knows who” relationship

From To 1 2 1 3 Name ID Sex Age Degree 1 6 John 1 M 23 4 1 5 Betsy 2 F 25 2 2 1 2 3 George 3 M 31 2 3 1 Martha 4 F 27 1 3 2 Sam 5 M 34 3 3 4

Abigail 6 F 32 2 4 3 5 1 5 6 6 1 6 5

© 2006 Knowledge Integrity Incorporated 17 www.knowledge-integrity.com (301) 754-6350

Standard “rectangular” data, as it appears in most databases, can be used to manage network links, but may be problematic for analysis. Links are represented via an associated table, and while this provides information about the individual, the rectangular format makes it difficult to assess “holistic” information, either about the modeled community as a whole, or about segments or patterns that emerge.

17 Adjacency Matrix

 This matrix Choice represents the undirected, “who knows who” Chooser John Betsy George Martha Sam Abigail network among our actors using John - 1 1 0 1 1 an adjacency

matrix Betsy 1 - 1 0 0 0

George 1 1 - 1 0 0

Martha 0 0 1 - 0 0

Sam 1 0 0 0 - 1

Abigail 1 0 0 0 1 -

© 2006 Knowledge Integrity Incorporated 18 www.knowledge-integrity.com (301) 754-6350

In an adjacency matrix: A ‘0’ means that there is no link between the chooser and choice A ‘1’ means that there is a link between the chooser and choice An adjacency matrix lends itself to certain types of analysis that cannot be done through standard rectangular representations.

18 Graphs

 Graphs contain vertices (nodes) and edges (links)

Abigail Betsy

John Sam

George Martha

© 2006 Knowledge Integrity Incorporated 19 www.knowledge-integrity.com (301) 754-6350

A graph is an abstract representation of the same set of information contained within the adjacency matrix. In a graph: A vertex represents the actor An edge represents a link between actors Graph representations can feed front ends, which supports different analytical processes and methods.

19 Connectivity Concepts

 Relationships and links  Binary, directed, signed  Measures of relationships  Grouped ordinal, ranked ordinal, categorical, interval measures

© 2006 Knowledge Integrity Incorporated 20 www.knowledge-integrity.com (301) 754-6350

The link between two individuals can be simple, such as the “who knows who” relationship, or can be much more complex. The different classifications of connections may be based on how much information they carry. The next slides describe different levels of information and provides some examples.

20 Binary Connectivity

 Represents a relationship determined by the answer to a true/false question  Examples:  Person A and person B know each other  Organization X and organization Y contribute to the same charity  Person A and person B have purchased product P

© 2006 Knowledge Integrity Incorporated 21 www.knowledge-integrity.com (301) 754-6350

A binary connection essentially represents the positive response to a true/false question, while the absence of the link reflects a negative response. Not that the examples provided here, the connection is reflexive, or undirected. In other words, if A knows B, then B knows A.

21 Directed Connectivity

 The link is established based on how a question relates two entities in a directed way  Examples:  Person A has emailed person B  Person A has visited web site W  Organization X has purchased services from organization Y

© 2006 Knowledge Integrity Incorporated 22 www.knowledge-integrity.com (301) 754-6350

A directed link differs from an undirected link in that there is no assumption of reflexiveness. For example, A may have emailed to B, but that does not mean that B emailed A. If the relationship exists in both directions, then there will be two directed arcs: one from A to B and one from B to A.

22 Signed Connectivity

 The link describes the nature of the relationship (e.g. positive, neutral, or negative)  Examples:  Person A hates product P  Organization X has been categorized as an approved vendor  Person A has provided a neutral score for Airline U’s service

© 2006 Knowledge Integrity Incorporated 23 www.knowledge-integrity.com (301) 754-6350

This is the first characteristic of the connection that carries a value, although the values indicate gross level data about the relationship. In this case, the sign indicates the nature of the relationship. For example, a +1 is a positive connection, 0 is a neutral connection, and -1 is a negative connection.

This notion suggests that descriptive metadata about links can embed more complex knowledge about the network and how information flows through the network.

23 Grouped and Ranked Ordinal

 The links have magnitude, such as “dislikes” = -1, “strongly dislikes” = -2, “vehemently dislikes” = -3  This provides more meaningful description of the connective characteristics  Examples:  Individuals and vacation destinations (desire to visit)  Job references (references)  In ranked ordinal, links are ranked in order of magnitude  Actor X ranks the other actors in terms of who is liked most to the least

© 2006 Knowledge Integrity Incorporated 24 www.knowledge-integrity.com (301) 754-6350

In a grouped ordinal model, the links carry both sign (to indicate the positive/negative nature) and magnitude of the relationship. The greater the absolute magnitude, the greater the connective characteristic. Grouped ordinal characterizes different quantitative measures: - the “strength” of the connections - the frequency of the interaction - the intensity of the relationship

In the ranked ordinal model, instead of gauging the links based on a quantitative measure, the links are ordered based on their associative rank. While this is not common in network analysis, the information is often easy to assemble. For example, one can calculate the order of “email connectivity” by counting the number of emails that are exchanged and putting together the ranking based on the counts.

24 Categorical

 Relationships are defined by category  A business relationship is type “1,” personal relationship is type “2,” etc.

© 2006 Knowledge Integrity Incorporated 25 www.knowledge-integrity.com (301) 754-6350

In a categorical linkage model, the attribution of the link reflects a segmentation of the relationships based of their qualitative type.

25 Interval Measures

 Rankings reflect scaling  Difference between 2 and 3 is same as difference between 20 and 21

© 2006 Knowledge Integrity Incorporated 26 www.knowledge-integrity.com (301) 754-6350

Interval measures not only provide a ranked order, they also capture the relative difference in “intensity” based on the interval order. This approach is the most sophisticated measurement framework. Interval measures can always be reduced in complexity to one of the other measurement types.

26 Matrix Analysis

Col 1 Col 2 Col 3 Col 4

Row 1 (1,1) (1,2) (1,3) (1,4) Row 2 (2,1) (2,2) (2,3) (2,4) Row 3 (3,1) (3,2) (3,3) (3,4)

© 2006 Knowledge Integrity Incorporated 27 www.knowledge-integrity.com (301) 754-6350

A matrix is a rectangular arrangement of link data. Each row and column is assigned an identifier, and each cell within the matrix is indexed by its (row, column) address. For example, the shaded cell in the second row and the third column is addressed “(2,3).” Matrices are one way to capture a representation of a network. Each cell contains the information associated with the link between the entity represented by the row and the entity represented by the column.

27 Example: Adjacency Matrices

A B

Directed graph

C D

ADB C

A - 1 1 0

Adjacency matrix B 0 - 1 0

C 1 1 - 1

D 0 0 1 -

© 2006 Knowledge Integrity Incorporated 28 www.knowledge-integrity.com (301) 754-6350

An adjacency matrix presents the network connectivity using labeled rows and columns, with values within each cell representing the nature of the link. In this example, there is a directed graph representing the relationships among a set of four people. The adjacency matrix shows the same set of relationships – if there is a directed arc one person to another, then the corresponding labeled cell has a “1,” and has a “0” otherwise. In this case, there is no concept of the relationship existing from an entity to itself, so the diagonal, which represents the self-directed arc is left with a “-.”

For an undirected graph, the relationships are essentially reciprocal, and the matrix would be symmetric about the diagonal.

28 More About Matrices

1 ADB C A - 1 1 0

B 0 - 1 0 2 ADB C C 1 1 - 1 A - 1 1 0 D 0 0 1 - B 0 - 1 1 3 ADB C C 0 1 - 1 A - 0 1 0 D 0 1 1 - B 1 - 0 0 4 ADB C C 1 0 - 1 A - 1 0 1 D 1 0 1 - B 0 - 0 0 C 1 1 - 1 Multiple relationships can be captured in D 1 1 0 - matrices of higher dimensionality

© 2006 Knowledge Integrity Incorporated 29 www.knowledge-integrity.com (301) 754-6350

A matrix captured one set of relationships, but the analysis may capture multiple relationships that exist between the same set of actors. In this case, we can layer a set of two-dimensional matrices into a third dimension to capture the complete set. In this slide, each matrix is at its own layer, and we can examine the associations between all actors for one relationship by looking at one layer, or we can look at all the links between any pair of actors by looking across the third dimension. In this slide, the highlighted cells for position (C, B) show the links between C and B.

29 Matrix Operations

 Matrix ADB C ADB C transpose A - 1 1 0 A - 0 1 0 B 0 - 1 0 O B 1 - 1 0 C 1 1 - 1 C 1 1 - 1 D 0 0 1 - D 0 0 1 -

ADB C ADB C ADB C  Matrix addition A - 1 1 0 A - 0 1 0 A - 1 2 0 and subtraction B 0 - 1 0 +B 1 - 1 0 = B 1 - 2 0 C 1 1 - 1 C 1 1 - 1 C 2 2 - 2 D 0 0 1 - D 0 0 1 - D 0 0 2 -

1 2 1 8 7 9 1 1 5 1 0 2 4 1 15 12 14 1 *3 0 2 0 =  Matrix 1 3 1 12 4 1 11 7 11 1 multiplication 2 0 1 3 12 6 1

© 2006 Knowledge Integrity Incorporated 30 www.knowledge-integrity.com (301) 754-6350

Matrix A is the transpose of matrix B if, for each cell (i,j), the value of B(j,i) is the same as A(i,j)

Matrix addition and subtraction are simplest: If we are adding matrices A and B into matrix C, the value of C(i,j) is equal to A(i,j) + B(i,j). Subtraction is the same, except we subtract B(i,j) from A(i,j).

Matrix multiplication is more complex – the value of C(i,j) is equal to the sum of A(i,k) multiplied by B(k,j), for k=1 to the number of elements in each column of matrix A and row of matrix B. In the example here, C(2,1) = (A(2,1) * B(1,2)) + (A(2,2) * B(2,2)) + (A(2,3) * B(3,2)) , which is equal to (1*1) + (2*3) + (1*1) = 8.

30 Adjacency Matrices and Multiplication

 Multiplying an A B adjacency matrix by itself once results in a matrix that counts the number of paths between nodes of length 2 C D

ADB C ADB C ADB C A 0 1 1 0 A 0 0 1 0 A 1 1 1 1 B 0 0 1 0 * B 1 0 1 0 = B 1 1 0 1 C 1 1 0 1 C 1 1 0 1 C 0 1 3 0 D 0 0 1 0 D 0 0 1 0 D 1 1 0 1

© 2006 Knowledge Integrity Incorporated 31 www.knowledge-integrity.com (301) 754-6350

The power X to which an adjacency matrix is raised results in a matrix counting the number of paths between nodes of length X. In this example, we see that from node B to A there is one path of length 2, and from C to itself there are 3 paths of length 2. In turn, computing the Boolean square of the adjacency matrix tells us if there exists a path of length 2 between any two nodes.

Why is this relevant? Because the nature of network analysis is to explore connectivity, it is valuable to not review direct links between actors, but to explore how connected the actors are, including the strength/weakness of connections, the distances, “influence,” and the robustness of the connections, among other properties.

31 Example: Knoke Information Exchange Network

 Map of exchange of information between 10 organizations involved in the local political economy of social welfare services in a Midwestern city

© 2006 Knowledge Integrity Incorporated 32 www.knowledge-integrity.com (301) 754-6350

32 Example: Knoke Information Network

 Graph and Adjacency matrix for the Knoke Information Network

© 2006 Knowledge Integrity Incorporated 33 www.knowledge-integrity.com (301) 754-6350

33 Visualization Techniques

 Characterizing attribution of actors by shape, size of node, color  Characterizing relationship/link by line size, type, thickness, decorations  Example:  Blue for non- government, red for government  Square for generalists, circle for specialists

© 2006 Knowledge Integrity Incorporated 34 www.knowledge-integrity.com (301) 754-6350

34 Graph Analysis

 Simple properties tell a lot about interaction and connectivity  Social structures reflect aspects of both  Global properties – the way the entire population interacts  Local properties – the ways that individuals within small groups interact  Analyze both the physical structure and the patterns of structure

© 2006 Knowledge Integrity Incorporated 35 www.knowledge-integrity.com (301) 754-6350

Global properties are those that describe aspects of the entire community, while local properties describe how small groups and individuals interact together within the communities. These properties are analyzed based on the kinds of structures that exist inside the graph (subgraphs, cliques, components) as well as the patterns of the structures. An example of this latter point might be the recognition of a common linkage pattern that carries some sociological meaning, such as the relationships between a pair of parents and their children.

Locality – Dyads and Triads The most common subsets to review are Dyads: groups of two actors Triads: groups of three actors With directed data, there are 4 possible relationships between 2 actors With directed data, there are 64 relationships possible among 3 actors Relationships exhibit hierarchy, equality, exclusion, “social standing,” patterns of behavior

35 Simple Graph Properties

 Counts  Number of actors  Number of possible connections  Number of connections present  Characteristics of network  Size of population,  How small groups differ from large groups: “cohesion, solidarity, moral density”  Characteristics of individuals  Number of connections  Density of the network  Source vs. sink

© 2006 Knowledge Integrity Incorporated 36 www.knowledge-integrity.com (301) 754-6350

The first place to begin in network analysis is at the global level, looking at properties that describe the entire network. For example, the counts associated with the graph provide some insight into its “density” – the number of possible connections vs. the number of actually present connections. Next, characteristics of the network as reflected by the size of the population and the gross-level review of groupings within the network. At a more granular level, examining the direct relationships among the individual entities and their relative connectivity gives some insight into the population itself.

36 Size and Density

 Network size  The count of the number of nodes  Potential links  There are (k*(k-1)) unique ordered pairs or actors  The number of possible relationships grows exponentially  Density is the ratio of links to the number of possible links  The proportion of all possible links that are present  Equal to the sum of links/number of possible links  Provides insight into movement across the network, qualifications of specific actors within the network

© 2006 Knowledge Integrity Incorporated 37 www.knowledge-integrity.com (301) 754-6350

Many network analyses focus on the flow of “information” across the network, and that becomes a recurring theme. In the next few slides, let’s look at network properties and consider their “information exchange” features. For example, the ability to propagate information across the network is related both to its size and its density. The larger the network is, the more connections are needed to effectively propagate information, which is why we explore its sized and its density.

37 Degree

 The number of edges/links attached to the actor  The upper limit on number of connections each actor may have is (k-1)  In-degree is the number of directed links into the node  The out-degree is the number of directed links out of the node  High in-degree indicates a “sink”  High out-degree indicates a “source”

© 2006 Knowledge Integrity Incorporated 38 www.knowledge-integrity.com (301) 754-6350

The degree of a node describes its level of connectedness in the network. Here we distinguish between undirected and directed graphs. In undirected graphs, we just look at the degree, but in directed graphs, which have arcs that travel from one node to another, we discuss in-degree, which is the number of directed arcs into a node, and out-degree, which is the number of directed arcs that leave a node.

38 Reachability and Point Connectivity

 Reachability:  An actor is reachable by another if there are connections that can trace between them  Provides insight into communication capability, robustness  Connectivity:  The number of nodes that have to be removed so that one actor could no longer reach another  Provides insight into “redundancy” and robustness

© 2006 Knowledge Integrity Incorporated 39 www.knowledge-integrity.com (301) 754-6350

39 Distance

 How far is it between nodes?  Conceptually, the number of nodes that must be traversed to establish the connection between two actors  Walk: A sequence that shows a traversal in the graph between two actors (ABC, ABDC, ABEBC)  Cycle: A closed walk of distinct actors except for the originating node, which is also the destination (BCDB)  Path: A walk in which each other actor and each other relation in the graph may be used at most one time

© 2006 Knowledge Integrity Incorporated 40 www.knowledge-integrity.com (301) 754-6350

There may be different walks of different lengths that connect two actors Distance can be assessed based on the characteristics of different kinds of walks: The total number of walks of a particular length between any two actors Distances can be scaled based on the size of the network To assess lengths in instances where links have value, sum the values along shortest path, or the minimum of the sums across all paths of size n or smaller

40 Geodesic Distance

 Geodesic distance is the number of relations in the shortest walk from one actor to another  Flow: similar to bandwidth capacity – how many different paths are there that connect two actors  Diameter: Largest geodesic distance in the graph

© 2006 Knowledge Integrity Incorporated 41 www.knowledge-integrity.com (301) 754-6350

Geodesic distance is used to characterize the most efficient or optimal connection between two actors. In any network, two actors may be connected via numerous paths, but depending on the nature of the connectivity, it might be assumed that communication between any pair of actors would be performed along the shortest path.

Dense networks have mostly short geodesic distances. The largest geodesic distance for each actor is called its “eccentricity.” In graphs that are not completely connected,

41 A Simple Network

David Jill

Ted Helen Bob Len

Joe Frank

Rene Jack

Kate

© 2006 Knowledge Integrity Incorporated 42 www.knowledge-integrity.com (301) 754-6350

42 Univariate and Inferencing

 An examination of the statistics across rows (i.e., “sources”) or columns (“sinks”)

Variance: How variable or Sum: number of links “predictable” the actor is to remaining nodes with respect to others

Mean: percentage of remaining nodes to which this nodes send link

© 2006 Knowledge Integrity Incorporated 43 www.knowledge-integrity.com (301) 754-6350

The statistics here review the differences between the “roles” each node takes on as sources of information. In this example, there are some significant sources – 2, 3, 5, and 8 are similar in terms of their role as information providers. Reviewing these statistics allows us to make some inferences about the population. For example:

Actors that have high out-degree may be “communicators” or “influencers” Actors with low out-degree are less likely to be influencers unless they are connected to the “right” other actors Actors with high in-degree may be “powerful” in terms of information gathering Actors with high in- and out-degree may be “facilitators” – they receive information and pass it along Actors with high out-degree and low in-degree may be “wannabes” or “outsiders”

43

 The concept of centrality is bound to the concept of “power” or “influence” within the network  The way that actors are embedded within networks imposes constraints on, or offers opportunities for the actor and the network  Provides qualification of favorable position, control, essentialness within the network  Three types of measures:  Degree  Closeness  Betweenness

© 2006 Knowledge Integrity Incorporated 44 www.knowledge-integrity.com (301) 754-6350

44 Examples

Star Circle

Line

© 2006 Knowledge Integrity Incorporated 45 www.knowledge-integrity.com (301) 754-6350

These three graphs demonstrate different kinds of centrality characteristics. In the star graph, node A has a high measure of centrality, but in the Circle graph no node has any greater centrality than any other. In the line graph, there are differing approaches to looking at centrality. In one, the edge nodes (A and G) have less centrality than the others, but a different approach may incorporate position in the line into the centrality measures also (e.g., D is more central than C or E, etc.).

45 Degree Centrality

 Degree centrality is based on the number of links in and out of each node  Actors with high in- degree are “prominent”  Actors with high out- degree are “influential”  Normalized degrees are based on percentage of remaining actors

© 2006 Knowledge Integrity Incorporated 46 www.knowledge-integrity.com (301) 754-6350

46 Closeness and Betweenness

 Closeness measures how close each node is to the others in the network  Betweenness characterizes the degree to which any individual node exists between other nodes

© 2006 Knowledge Integrity Incorporated 47 www.knowledge-integrity.com (301) 754-6350

The geodesic distance enables actors in favorable positions to communicate faster. Closeness examines the shortest distances between a node and each of the other nodes. For example, in the star network, the A node has a higher degree of closeness to the other nodes than any other one. In the circle network, the closeness measure is essentially identical for each of the nodes. However, in the line network, the node in the center of the line is “closer” in total to the other nodes than the others.

Betweenness is a measure of how a node in the network lies on the critical paths between other nodes. For example, in the star network, node A has a high level of betweenness, since it is one the path between every other set of nodes.

47 Grouping and Gr aph Basics

 The structure of connectivity in which an actor is embedded within some structure, which is then embedded within a larger structure, etc.  Components  Clique  Blocks and Cutpoints  Lots of other stuff!!!

© 2006 Knowledge Integrity Incorporated 48 www.knowledge-integrity.com (301) 754-6350

Two vertices are in a connected component if there exists a path between them. Connected components can be drawn as subgraphs with empty space between them.

A clique is a subgraph where every node is connected to every other node.

Blocks are locations in the graph that would become disjoint if a node or link were removed. That removed link or node is called a cutpoint.

48 Link and Network Analysis

© 2006 Knowledge Integrity Incorporated 49 www.knowledge-integrity.com (301) 754-6350

49 Structural Analysis

 Evaluating substructures that exist and interact within the network  Dyads and Triads compose many graphs  Looking at position and structure within graph provides insight into social structure and embeddedness  Dyads and Triads form into larger graph substructures  Evaluating overlapping membership in graph substructures exposes “influential” entities within the network

© 2006 Knowledge Integrity Incorporated 50 www.knowledge-integrity.com (301) 754-6350

50 Cliques

 A clique is a maximally connected subgraph  In a clique, each node is connected to every other node

© 2006 Knowledge Integrity Incorporated 51 www.knowledge-integrity.com (301) 754-6350

Cliques indicate a close relationship among the members, and often exist within communities based on entity profile similarities. For example, individuals with the same racial, ethnic, or religious identities may form smaller, tighter group relationships.

The smallest cliques are dyads, followed by triads. These building blocks may be combined into larger cliques as well.

By looking at the types of relationships between nodes in the cliques, you can see behavior patterns emerge as well. For example, in a triad, the link between two of the members may be much stronger than between either of those two and the third member.

Another interesting aspect is to evaluate overlapping membership. Actors that are members of multiple cliques may have greater influence, while sets of actors that share clique memberships may be particularly close also.

51 Example

 In our Knoke example, the nodes for COMM and MAYOR share 5 clique memberships

© 2006 Knowledge Integrity Incorporated 52 www.knowledge-integrity.com (301) 754-6350

52 N-Clique and N-Clans

 The definition of a clique is very restrictive  There are subgraph relationships that relax the clique requirements  An N-clique is a subgraph where every node is connected to every other node by a distance of N  N-cliques may be connected by nodes that are not member of the clique, suggesting a stricter definition for an n-clan  An n-clan has the additional constraint that the connections must be made through other members of the n-clique

© 2006 Knowledge Integrity Incorporated 53 www.knowledge-integrity.com (301) 754-6350

For n-cliques, the most frequent value used is 2. This is conceptually equivalent to “ a friend of a friend.”

There are other approaches to modifying the constraints associated with clique- style connectivity, such as: K-plex, which defines the clique if every node has a connection to all but K out of N nodes K-core, which defines the clique if every node is connected to K out of N nodes

53 Components

 Components are subgraphs that are connected within the network, but are disjoint from other subgraphs

© 2006 Knowledge Integrity Incorporated 54 www.knowledge-integrity.com (301) 754-6350

54 Blocks and Cutpoints

 If a node or a link were removed, how would that affect the connectivity structure?  A node that, when it is removed subdivides the graph into components is called a “cutpoint”  Those divisions are called “blocks”

© 2006 Knowledge Integrity Incorporated 55 www.knowledge-integrity.com (301) 754-6350

By removing a node in the graph and bisecting the network into blocks, we identify nodes that are key to the communication or information exchange patterns.

55 Summary: Link Analysis

 Explores the relationships between objects, and how objects are linked  Identify nodes that are “key” within a network  Determine the links are critical to the operations of the network  Assess the existence of relevant sub-networks  Evaluate “spheres of influence”  Clustering entities into communities of interest  Geographic  Intellectual  Psychographic

© 2006 Knowledge Integrity Incorporated 56 www.knowledge-integrity.com (301) 754-6350

56 Example – Collaborative Filtering

Business Intelligence: The Savvy Manager's Guide (The Savvy Manager's Guides) (Paperback) by David Loshin "Imagine that you are the sales manager for a large retail organization and that you were able, within some probability, to predict how much money..." (more ) Key Phrases: productivity analytics , business rules system , business rules approach , Postal Service , Data Warehousing Institute , United States (more... ) (5 customer reviews)

Customers who bought this item also bought Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications by Larissa T. Moss Performance Dashboards: Measuring, Monitoring, and Managing Your Business by Wayne W. Eckerson Enterprise Dashboards: Design and Best Practices for IT by Shadan Malik The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition) by Ralph Kimball Business Intelligence for the Enterprise by Mike Biere

© 2006 Knowledge Integrity Incorporated 57 www.knowledge-integrity.com (301) 754-6350

57 Example – Organizational Behavior

 Evaluation of communication patterns between individuals within an organization  Attribution of actors with demographic data (age, seniority, gender, department, education, etc.)  Assess:  Is work being performed within or outside of the formal structure of the organizations?  Are there self-organized “invisible barriers” emerging based on individual characteristics?  Are there informal communities of interest that may seed innovation across division boundaries?  How effective are the different communication channels?

© 2006 Knowledge Integrity Incorporated 58 www.knowledge-integrity.com (301) 754-6350

58 Example - Terrorist Network

 Connectivity map by Valdis Krebs, based on data available from news sources on the world wide web

© 2006 Knowledge Integrity Incorporated 59 www.knowledge-integrity.com (301) 754-6350

59 Other Examples/Resources

 Analyzing degree, centrality to assess risk of infection in a population (http://intl-aje.oxfordjournals.org/cgi/content/abstract/162/10/1024 )  Integrating raw data from multiple sources, including phone records, bank transactions, surveillance reports, vehicle sales to create a criminal network analysis program (http://ai.bpa.arizona.edu/COPLINK/publications/crimenet/Xu_CACM.doc )  Enabling targeted marketing within self-organizing networks (www.linkedin.com , www.myspace.com )  Krebs’ article on Identifying Terrorist Networks (http://firstmonday.org/issues/issue7_4/krebs/ )  Corporate board members and influencers (http://www.theyrule.net/ )

© 2006 Knowledge Integrity Incorporated 60 www.knowledge-integrity.com (301) 754-6350 http://ai.bpa.arizona.edu/COPLINK/publications/crimenet/Xu_CACM.doc

60 Issues and Considerations for Business Intelligence

© 2006 Knowledge Integrity Incorporated 61 www.knowledge-integrity.com (301) 754-6350

61 Challenges

 Establishing a business justification  Network Models and Metadata  Data warehouse data extraction and transformation  Semantic Analysis, Entity Extraction, and Establishing Linkage  Graph Management

© 2006 Knowledge Integrity Incorporated 62 www.knowledge-integrity.com (301) 754-6350

62 Business Justification

 Are all of these examples proving known concepts after the fact?  Where can this technique add value to existing activities?  What are the costs related to implementation and socialization?

© 2006 Knowledge Integrity Incorporated 63 www.knowledge-integrity.com (301) 754-6350

One of the criticisms of network and link analysis is the belief that the analysis exposes interesting notions that are actually already known. In evaluating the terrorist data, for example, are we determining (after the fact) that certain parties were connected, even though that fact was already known? Alternatively, one might ask the same question a different way: Today, would this kind of analysis contribute to the determination of suspicious group behavior that would warrant action?

At a more concrete level, we can evaluate the types of applications to which SNA is applied and review what benefits are expected out of the process and how those benefits are measured. For example, consider Amazon’s use of collaborative filtering – the cost is not significant, but there is a perception of high value, especially as it facilitates self-organized predictive modeling.

63 Data Extraction and Transformation

 Linearized or rectangular data is not necessarily suitable for network or link analysis  Synchronization of warehouse data with network data  Challenge:  Develop models for representation of network data  Provide services for transformation into and out of graph structures  Provide front-end applications for visualization

© 2006 Knowledge Integrity Incorporated 64 www.knowledge-integrity.com (301) 754-6350

64 Networks and Data Models

 SNA applications require data configured to represent nodes, links, and associated link weights  Example:  Nodes: Ken Dave Jill Bob  Arcs:  Ken Dave 10  Ken Jill 1  Dave Bob 6

© 2006 Knowledge Integrity Incorporated 65 www.knowledge-integrity.com (301) 754-6350

Different applications expect the input data representing the network to be configured in a way that can be parsed easily and configured into the tool. In the example here, the node names are enumerated, followed by an enumeration of arcs consisting of the source node, the target node, and the weight of the link. In turn, the applications will maintain an internal representation of the network (perhaps in an adjacency matrix), as well as maintain additional metadata related to actions taken by the analyst, which means that the internal representation of the network may be modified and written out during the analysis.

The challenge is to create the transformation mechanisms to extract rectangular data from traditional data sets, organize the data into the appropriate network representation, as well as transforming the output of the SNA application back into rectangular format for use in traditional BI activities. This requires understanding the SNA application’s model as well as how the connectivity is represented within the source data sets.

65 Social Network Metadata

 Actors  Label or name  Identifier  Actor Demographics  Links  Relational characteristic  Connectivity measure  Weight/value  Other issues to consider:  “Link distance”  Positioning  Graph structure

© 2006 Knowledge Integrity Incorporated 66 www.knowledge-integrity.com (301) 754-6350

The network embeds relationships, but the nature of how those relationships correspond to the modeled objects must be maintained. This means that the metadata associated with each of the objects should be captured, especially if it can be visualized along with the network itself. For example, in the Knoke network, the shapes and colors used indicated different aspects of the modeled actors.

The same is true for the links – different types may be represented using different line types, while the weights may be indicated using line thickness or even distance between the nodes. Because visualization is critical, the positioning of nodes on the “template” of the map may be relevant, as well as the shape that the graph should take (e.g., circle, hub and spokes, random node placement, etc.).

Eventually, most analyses will need to be reintegrated with the original source data sets; identifiers need to be assigned to each node, while these identifiers also need to be linked back to the original data.

66 Semantic Analysis and Entity Extraction

 Semi-structured and unstructured text contain references to entities and their relationships  Challenge:  Provide ability to identify entities within a document  Characterize entity types based on context  Establish connectivity on more than a naïve basis  Transform extracted information into format suitable for integration and analysis

© 2006 Knowledge Integrity Incorporated 67 www.knowledge-integrity.com (301) 754-6350

67 Entity Extraction

 Recall our earlier example:  Richard J. Palaima, Of Mattapan, Suddenly, Nov. 18, 2002. Beloved husband of Jonadee (Badayos) Palaima. Devoted son of Madelyn L. (George) Palaima of Braintree, and the late Richard A. Palaima. Devoted brother of John A. Palaima of Braintree, nephew of Catherine Cunningham of Rockland, Cousin & Godson of Robert Cunningham of Rockland. Funeral from the Mortimer N. Peck-Russell Peck Funeral Home, 516 Washington St., BRAINTREE, on Saturday at 9 a.m. Funeral Mass in St. Gregory's Church, Dorchester, at 10 a.m. Relatives and friends invited. Visiting hours Friday 2-4 & 7-9 p.m. Memorial donations may be sent to St. Gregory's Church, 2215 Dorchester Ave., Dorchester 02124.  Text mining applications can identify names, locations, roles, and organizations within semi- and unstructured data  Often the relationship may be as simple as “appearing in the same context”

© 2006 Knowledge Integrity Incorporated 68 www.knowledge-integrity.com (301) 754-6350

One approach for evaluating connections is to analyze text, extract the entities and related metadata (locations, roles, etc.), insert that data into a database, then extract it for the purposes for network analysis. By analyzing many similar document corpuses (e.g., public records, filings, directories, news articles) and inserting them into a data set, linkages can emerge from the data. For a good example, consult www.theyrule.net.

68 Establishing Linkage

 Quality of data used to refer to entities may be variant and suspect  Challenge:  Effectively characterize how attribution contributes to similarity scoring  Exploit data quality tools for standardization and linkage from different kinds of source data

© 2006 Knowledge Integrity Incorporated 69 www.knowledge-integrity.com (301) 754-6350

Data cleansing, matching, and linkage tools have been used for many years for identifying duplicates among data records, as well as using similarity scoring mechanisms for householding and establishing hierarchies. The same techniques can be used to examine attributes associated with each record, parse and standardize the data, then apply the similarity scoring and linkage capabilities to the candidate entities.

69 Graph Management

 Graph data structures are different than relational databases  Graphs do not inherently provide persistence  Large networks take up large amount of memory space  Issues:  Managing large graphs in a performance-efficient manner  Manipulating graph models in real time  Persistence of graph models  Drawing conclusions about what is demonstrated in the graph

© 2006 Knowledge Integrity Incorporated 70 www.knowledge-integrity.com (301) 754-6350

Some of the interesting challenges that need to be addressed include: Providing a usable graph management utility that provides reasonable performance Providing persistence for graphs Enabling transformation into and out of graph format

70 Interesting Resources

 Books  “Linked: How Everything Is Connected to Everything Else and What It Means,” Albert-Laszlo Barabasi  “The Tipping Point,” Malcolm Gladwell

 Web sites  http://www.insna.org/  http://www.ire.org/sna/  http://faculty.ucr.edu/~hanneman/nettext/  Software  The R project http://www.r-project.org/  SNA tools for R http://erzuli.ss.uci.edu/R.stuff/  UCINET trial download http://www.analytictech.com/downloaduc6.htm  Pajek http://vlado.fmf.uni-lj.si/pub/networks/pajek/

© 2006 Knowledge Integrity Incorporated 71 www.knowledge-integrity.com (301) 754-6350

71 Questions?

 If you have questions, comments, or suggestions, please contact me David Loshin 301-754-6350 [email protected]

© 2006 Knowledge Integrity Incorporated 72 www.knowledge-integrity.com (301) 754-6350

72