Introduction to Social Network and Link Analysis
David Loshin Knowledge Integrity, Inc. TDWI 2007 Spring Conference Boston, MA
1
1 © 2006 Knowledge Integrity Incorporated 2 www.knowledge-integrity.com (301) 754-6350
2 Half-Day Agenda
Introduction to Networks, social and otherwise Network Connectivity Basics Link and Network Analysis Issues and Considerations for BI
© 2006 Knowledge Integrity Incorporated 3 www.knowledge-integrity.com (301) 754-6350
In this talk, we will discuss the notion of connectivity, and why models for analyzing connections can add value to a business intelligence initiative. By reviewing the ways that objects interact through networks, we will explore whether the results of this analysis can enhance profiles, predictive analytics, and general business intelligence
Objectives: •Understand network connectivity basics •Explore ways to represent networks •Understand the types of analysis that can be performed •Envisioning network analytics, data extraction and preparation
3 Introduction to Networks, Social and Otherwise
© 2006 Knowledge Integrity Incorporated 4 www.knowledge-integrity.com (301) 754-6350
4 Networks, Links, and Coincidences?
How many people do you know? Family, friends, co-workers, conference attendees Dozens? Hundreds? Thousands? How well do you know them? Very close, know them well, acquaintances, “just met” Concepts of Connectivity? You know 1,000 people They each know 1,000 people Therefore, you are potentially connected to 1,000,000 people through just 1 link By 2 links, the network could potentially extend to 1,000,000,000 people! “Small World Theory” – we are all connected through a very small number of links (See Milgram, Bacon)
© 2006 Knowledge Integrity Incorporated 5 www.knowledge-integrity.com (301) 754-6350
The notion of connectivity is intriguing, especially when considering individuals, other types of parties, and the knowledge that can be derived through the analysis of connections.
For example, think about the example in this slide. Let’s assume that we all know about 1000 people. But is it really true that each individual is therefore linked to 1,000,000 people? Conceptually, that would be true as long as none of the 1000 people I know are completely different than the 1000 people that you know.
But in reality, we all seem to run around in similar circles, and so there is a great likelihood that many of the people that I know are the same people that you know. The consequences of this is the effective “self-organization” of communities ( as well as sub-communities, and sub-sub-communities). By examining the relationships that exist among groups of people, we can learn who are the influencers, who are the influenced, who spans critical communication boundaries, and how information (or commerce, or viruses, etc.) flow through the selected community.
5 Euler’s Insight
Bridges of Konigsberg
© 2006 Knowledge Integrity Incorporated 6 www.knowledge-integrity.com (301) 754-6350
One pastime of the residents of Konigsberg was to walk around town over the bridges between the different land areas of town. One game was to see if one could start at one location, walk over every bridge just once, and end up at the starting point. Mathematician Leonhard Euler abstracted the problem into a “graph” – acollection of nodes and links between them. By examining the graph, he was able to determine that based on the degrees of the links between nodes, the challenge of the bridges of Konigsberg was actually impossible. However, this insight created the branch of mathematics referred to as graph theory, which is the fundamental basis of network (and consequently, social network) analysis.
6 Network and Link Analysis
Linkages exist everywhere Between individuals (“MCI Friends and Family”) Between locations (“Bridges of Konigsberg”) Between other types of objects (“Telephone network”) Between individuals and other kinds of objects (“Purchasing Preferences”) Between businesses (“D&B corporate hierarchies”) There are different kinds of links Each link has some sort of attribution Analyzing networks can provide insight for evaluating behavior patterns for different intelligence activities
© 2006 Knowledge Integrity Incorporated 7 www.knowledge-integrity.com (301) 754-6350
There are many applications that rely on the power of the network. Each of these networks represents some attempt to exploit the different kinds of connections that exist among small groups of individuals, larger groups of individuals as well as how the groups themselves interact. Applications may be designed to seek out some interesting pattern within the network or to exploit the communication and information exchanges provided by the network.
Every node and their each of their corresponding links carries certain characteristics. Each node represents an entity, while each link carries attributes that describe the nature of the relationship.
7 Applications of Network/Link Analysis
Enforcement: Criminal analysis, money laundering Fraud detection: spambot detection, call pattern analysis Marketing: Customer Behavior analysis, Segmentation, collaborative filtering Community analyses: Account proxy (account used by more than one individual, many accounts used by one individual), research collaboration, communities of interest
© 2006 Knowledge Integrity Incorporated 8 www.knowledge-integrity.com (301) 754-6350
8 More Applications…
Health care: Contagion, disease control Physical: Supply chain analysis Transfer/Communications: Spheres of Influence, Information flows, business partnerships Formal Relationships: Working relationships, Influential individuals, ownership, accountability, corporate structure Informal Relationships: Friendship networks, extended families, social interactions, insider networks
© 2006 Knowledge Integrity Incorporated 9 www.knowledge-integrity.com (301) 754-6350
9 Evidence is All Around…
Databases Transaction systems, logs, data warehouses Semi-structured data Email, web pages, public records, filings Unstructured data News items, prospectuses, filings
© 2006 Knowledge Integrity Incorporated 10 www.knowledge-integrity.com (301) 754-6350
Typical business intelligence applications focus on the ability to organize information for reporting and analysis across one or more dimensions, but are not usually configured to enable network analysis. Yet data warehouses contain significant amounts of connectivity information that is suitable to network and link analysis. Other sources of information provide connectivity data – transaction systems, database logs, software activity logs, as well as less structured systems such as emails, web logs, electronic public data filings, other public records (e.g., real estate transactions, Uniform Commercial Code, etc.). In addition, text analysis applications can extract individual data out of unstructured data to establish connections.
10 Example: Death Notice
Richard J. Palaima, Of Mattapan, Suddenly, Nov. 18, 2002. Beloved husband of Jonadee (Badayos) Palaima. Devoted son of Madelyn L. (George) Palaima of Braintree, and the late Richard A. Palaima. Devoted brother of John A. Palaima of Braintree, nephew of Catherine Cunningham of Rockland, Cousin & Godson of Robert Cunningham of Rockland. Funeral from the Mortimer N. Peck-Russell Peck Funeral Home, 516 Washington St., BRAINTREE, on Saturday at 9 a.m. Funeral Mass in St. Gregory's Church, Dorchester, at 10 a.m. Relatives and friends invited. Visiting hours Friday 2-4 & 7-9 p.m. Memorial donations may be sent to St. Gregory's Church, 2215 Dorchester Ave., Dorchester 02124.
© 2006 Knowledge Integrity Incorporated 11 www.knowledge-integrity.com (301) 754-6350
This example was taken from the Boston Globe, and was available on line from 11/20/2002 - 11/21/2002. Death, birth, engagement, and wedding notices are good examples of publicly available information (published in the newspaper) configured in semi-structured form that provide a lot of data about connections. In this example, we have a description of one individual and his immediate family, his location, and his religious affiliation.
11 Example: Extracted Entities and Their Links
lives in
Richard A Plaima married to Braintree, MA lives in Madelyn L. (George) Palaima has living state of
sister of Deceased
Catherine Cunningham father of Mattapan, MA mother of has living state of mother of John A Palaima lives in St. Gregory's Church brother of Robert Cunningham lives in
located in is religiously affiliated with lives in Richard J Paliama married to Jondalee (Badayos) Palaima Dorchester, MA is godson/godfather of Rockland, MA is cousin of
© 2006 Knowledge Integrity Incorporated 12 www.knowledge-integrity.com (301) 754-6350
12 Good SNA Resource
Many examples taken from Robert A. Hanneman and Mark Riddle’s online text Introduction to social network methods http://faculty.ucr.edu/~hanneman/nettext/
© 2006 Knowledge Integrity Incorporated 13 www.knowledge-integrity.com (301) 754-6350
13 Integrating Network/Link Analysis with BI
Network information may be embedded within data warehouse However: Representations may not be appropriate for analysis Data may need to be transformed and managed using non- relational data structures Analysis lends itself to visual representation Must understand concepts associated with networks, connectivity, and qualification of linkage Objective: Gain a conceptual understanding of Network data and its representation Characteristics of network relationships Types of analysis performed
© 2006 Knowledge Integrity Incorporated 14 www.knowledge-integrity.com (301) 754-6350
14 Network Connectivity Basics
© 2006 Knowledge Integrity Incorporated 15 www.knowledge-integrity.com (301) 754-6350
15 Representing Network Data
What is network data? Two types of objects: Actors (entities that participate in the network) Links (established relationships between the actors) Analysis focuses on: Who the actors are What their relationship is to holistic view of the community How the actors organize within the framework Different approaches to representation: Rectangular Data Adjacency Matrices Graph representation
© 2006 Knowledge Integrity Incorporated 16 www.knowledge-integrity.com (301) 754-6350
The next set of slides introduces basic concepts of social networks: -Actors -Links -Representations
16 Rectangular Data
Relational structure provides a “rectangular” view of the “who knows who” relationship
From To 1 2 1 3 Name ID Sex Age Degree 1 6 John 1 M 23 4 1 5 Betsy 2 F 25 2 2 1 2 3 George 3 M 31 2 3 1 Martha 4 F 27 1 3 2 Sam 5 M 34 3 3 4
Abigail 6 F 32 2 4 3 5 1 5 6 6 1 6 5
© 2006 Knowledge Integrity Incorporated 17 www.knowledge-integrity.com (301) 754-6350
Standard “rectangular” data, as it appears in most databases, can be used to manage network links, but may be problematic for analysis. Links are represented via an associated table, and while this provides information about the individual, the rectangular format makes it difficult to assess “holistic” information, either about the modeled community as a whole, or about segments or patterns that emerge.
17 Adjacency Matrix
This matrix Choice represents the undirected, “who knows who” Chooser John Betsy George Martha Sam Abigail network among our actors using John - 1 1 0 1 1 an adjacency
matrix Betsy 1 - 1 0 0 0
George 1 1 - 1 0 0
Martha 0 0 1 - 0 0
Sam 1 0 0 0 - 1
Abigail 1 0 0 0 1 -
© 2006 Knowledge Integrity Incorporated 18 www.knowledge-integrity.com (301) 754-6350
In an adjacency matrix: A ‘0’ means that there is no link between the chooser and choice A ‘1’ means that there is a link between the chooser and choice An adjacency matrix lends itself to certain types of analysis that cannot be done through standard rectangular representations.
18 Graphs
Graphs contain vertices (nodes) and edges (links)
Abigail Betsy
John Sam
George Martha
© 2006 Knowledge Integrity Incorporated 19 www.knowledge-integrity.com (301) 754-6350
A graph is an abstract representation of the same set of information contained within the adjacency matrix. In a graph: A vertex represents the actor An edge represents a link between actors Graph representations can feed visualization front ends, which supports different analytical processes and methods.
19 Connectivity Concepts
Relationships and links Binary, directed, signed Measures of relationships Grouped ordinal, ranked ordinal, categorical, interval measures
© 2006 Knowledge Integrity Incorporated 20 www.knowledge-integrity.com (301) 754-6350
The link between two individuals can be simple, such as the “who knows who” relationship, or can be much more complex. The different classifications of connections may be based on how much information they carry. The next slides describe different levels of information and provides some examples.
20 Binary Connectivity
Represents a relationship determined by the answer to a true/false question Examples: Person A and person B know each other Organization X and organization Y contribute to the same charity Person A and person B have purchased product P
© 2006 Knowledge Integrity Incorporated 21 www.knowledge-integrity.com (301) 754-6350
A binary connection essentially represents the positive response to a true/false question, while the absence of the link reflects a negative response. Not that the examples provided here, the connection is reflexive, or undirected. In other words, if A knows B, then B knows A.
21 Directed Connectivity
The link is established based on how a question relates two entities in a directed way Examples: Person A has emailed person B Person A has visited web site W Organization X has purchased services from organization Y
© 2006 Knowledge Integrity Incorporated 22 www.knowledge-integrity.com (301) 754-6350
A directed link differs from an undirected link in that there is no assumption of reflexiveness. For example, A may have emailed to B, but that does not mean that B emailed A. If the relationship exists in both directions, then there will be two directed arcs: one from A to B and one from B to A.
22 Signed Connectivity
The link describes the nature of the relationship (e.g. positive, neutral, or negative) Examples: Person A hates product P Organization X has been categorized as an approved vendor Person A has provided a neutral score for Airline U’s service
© 2006 Knowledge Integrity Incorporated 23 www.knowledge-integrity.com (301) 754-6350
This is the first characteristic of the connection that carries a value, although the values indicate gross level data about the relationship. In this case, the sign indicates the nature of the relationship. For example, a +1 is a positive connection, 0 is a neutral connection, and -1 is a negative connection.
This notion suggests that descriptive metadata about links can embed more complex knowledge about the network and how information flows through the network.
23 Grouped and Ranked Ordinal
The links have magnitude, such as “dislikes” = -1, “strongly dislikes” = -2, “vehemently dislikes” = -3 This provides more meaningful description of the connective characteristics Examples: Individuals and vacation destinations (desire to visit) Job references (references) In ranked ordinal, links are ranked in order of magnitude Actor X ranks the other actors in terms of who is liked most to the least
© 2006 Knowledge Integrity Incorporated 24 www.knowledge-integrity.com (301) 754-6350
In a grouped ordinal model, the links carry both sign (to indicate the positive/negative nature) and magnitude of the relationship. The greater the absolute magnitude, the greater the connective characteristic. Grouped ordinal characterizes different quantitative measures: - the “strength” of the connections - the frequency of the interaction - the intensity of the relationship
In the ranked ordinal model, instead of gauging the links based on a quantitative measure, the links are ordered based on their associative rank. While this is not common in network analysis, the information is often easy to assemble. For example, one can calculate the order of “email connectivity” by counting the number of emails that are exchanged and putting together the ranking based on the counts.
24 Categorical
Relationships are defined by category A business relationship is type “1,” personal relationship is type “2,” etc.
© 2006 Knowledge Integrity Incorporated 25 www.knowledge-integrity.com (301) 754-6350
In a categorical linkage model, the attribution of the link reflects a segmentation of the relationships based of their qualitative type.
25 Interval Measures
Rankings reflect scaling Difference between 2 and 3 is same as difference between 20 and 21
© 2006 Knowledge Integrity Incorporated 26 www.knowledge-integrity.com (301) 754-6350
Interval measures not only provide a ranked order, they also capture the relative difference in “intensity” based on the interval order. This approach is the most sophisticated measurement framework. Interval measures can always be reduced in complexity to one of the other measurement types.
26 Matrix Analysis
Col 1 Col 2 Col 3 Col 4
Row 1 (1,1) (1,2) (1,3) (1,4) Row 2 (2,1) (2,2) (2,3) (2,4) Row 3 (3,1) (3,2) (3,3) (3,4)
© 2006 Knowledge Integrity Incorporated 27 www.knowledge-integrity.com (301) 754-6350
A matrix is a rectangular arrangement of link data. Each row and column is assigned an identifier, and each cell within the matrix is indexed by its (row, column) address. For example, the shaded cell in the second row and the third column is addressed “(2,3).” Matrices are one way to capture a representation of a network. Each cell contains the information associated with the link between the entity represented by the row and the entity represented by the column.
27 Example: Adjacency Matrices
A B
Directed graph
C D
ADB C
A - 1 1 0
Adjacency matrix B 0 - 1 0
C 1 1 - 1
D 0 0 1 -
© 2006 Knowledge Integrity Incorporated 28 www.knowledge-integrity.com (301) 754-6350
An adjacency matrix presents the network connectivity using labeled rows and columns, with values within each cell representing the nature of the link. In this example, there is a directed graph representing the relationships among a set of four people. The adjacency matrix shows the same set of relationships – if there is a directed arc one person to another, then the corresponding labeled cell has a “1,” and has a “0” otherwise. In this case, there is no concept of the relationship existing from an entity to itself, so the diagonal, which represents the self-directed arc is left with a “-.”
For an undirected graph, the relationships are essentially reciprocal, and the matrix would be symmetric about the diagonal.
28 More About Matrices
1 ADB C A - 1 1 0
B 0 - 1 0 2 ADB C C 1 1 - 1 A - 1 1 0 D 0 0 1 - B 0 - 1 1 3 ADB C C 0 1 - 1 A - 0 1 0 D 0 1 1 - B 1 - 0 0 4 ADB C C 1 0 - 1 A - 1 0 1 D 1 0 1 - B 0 - 0 0 C 1 1 - 1 Multiple relationships can be captured in D 1 1 0 - matrices of higher dimensionality
© 2006 Knowledge Integrity Incorporated 29 www.knowledge-integrity.com (301) 754-6350
A matrix captured one set of relationships, but the analysis may capture multiple relationships that exist between the same set of actors. In this case, we can layer a set of two-dimensional matrices into a third dimension to capture the complete set. In this slide, each matrix is at its own layer, and we can examine the associations between all actors for one relationship by looking at one layer, or we can look at all the links between any pair of actors by looking across the third dimension. In this slide, the highlighted cells for position (C, B) show the links between C and B.
29 Matrix Operations
Matrix ADB C ADB C transpose A - 1 1 0 A - 0 1 0 B 0 - 1 0 O B 1 - 1 0 C 1 1 - 1 C 1 1 - 1 D 0 0 1 - D 0 0 1 -
ADB C ADB C ADB C Matrix addition A - 1 1 0 A - 0 1 0 A - 1 2 0 and subtraction B 0 - 1 0 +B 1 - 1 0 = B 1 - 2 0 C 1 1 - 1 C 1 1 - 1 C 2 2 - 2 D 0 0 1 - D 0 0 1 - D 0 0 2 -
1 2 1 8 7 9 1 1 5 1 0 2 4 1 15 12 14 1 *3 0 2 0 = Matrix 1 3 1 12 4 1 11 7 11 1 multiplication 2 0 1 3 12 6 1
© 2006 Knowledge Integrity Incorporated 30 www.knowledge-integrity.com (301) 754-6350
Matrix A is the transpose of matrix B if, for each cell (i,j), the value of B(j,i) is the same as A(i,j)
Matrix addition and subtraction are simplest: If we are adding matrices A and B into matrix C, the value of C(i,j) is equal to A(i,j) + B(i,j). Subtraction is the same, except we subtract B(i,j) from A(i,j).
Matrix multiplication is more complex – the value of C(i,j) is equal to the sum of A(i,k) multiplied by B(k,j), for k=1 to the number of elements in each column of matrix A and row of matrix B. In the example here, C(2,1) = (A(2,1) * B(1,2)) + (A(2,2) * B(2,2)) + (A(2,3) * B(3,2)) , which is equal to (1*1) + (2*3) + (1*1) = 8.
30 Adjacency Matrices and Multiplication
Multiplying an A B adjacency matrix by itself once results in a matrix that counts the number of paths between nodes of length 2 C D
ADB C ADB C ADB C A 0 1 1 0 A 0 0 1 0 A 1 1 1 1 B 0 0 1 0 * B 1 0 1 0 = B 1 1 0 1 C 1 1 0 1 C 1 1 0 1 C 0 1 3 0 D 0 0 1 0 D 0 0 1 0 D 1 1 0 1
© 2006 Knowledge Integrity Incorporated 31 www.knowledge-integrity.com (301) 754-6350
The power X to which an adjacency matrix is raised results in a matrix counting the number of paths between nodes of length X. In this example, we see that from node B to A there is one path of length 2, and from C to itself there are 3 paths of length 2. In turn, computing the Boolean square of the adjacency matrix tells us if there exists a path of length 2 between any two nodes.
Why is this relevant? Because the nature of network analysis is to explore connectivity, it is valuable to not review direct links between actors, but to explore how connected the actors are, including the strength/weakness of connections, the distances, “influence,” and the robustness of the connections, among other properties.
31 Example: Knoke Information Exchange Network
Map of exchange of information between 10 organizations involved in the local political economy of social welfare services in a Midwestern city
© 2006 Knowledge Integrity Incorporated 32 www.knowledge-integrity.com (301) 754-6350
32 Example: Knoke Information Network
Graph and Adjacency matrix for the Knoke Information Network
© 2006 Knowledge Integrity Incorporated 33 www.knowledge-integrity.com (301) 754-6350
33 Visualization Techniques
Characterizing attribution of actors by shape, size of node, color Characterizing relationship/link by line size, type, thickness, decorations Example: Blue for non- government, red for government Square for generalists, circle for specialists
© 2006 Knowledge Integrity Incorporated 34 www.knowledge-integrity.com (301) 754-6350
34 Graph Analysis
Simple properties tell a lot about interaction and connectivity Social structures reflect aspects of both Global properties – the way the entire population interacts Local properties – the ways that individuals within small groups interact Analyze both the physical structure and the patterns of structure
© 2006 Knowledge Integrity Incorporated 35 www.knowledge-integrity.com (301) 754-6350
Global properties are those that describe aspects of the entire community, while local properties describe how small groups and individuals interact together within the communities. These properties are analyzed based on the kinds of structures that exist inside the graph (subgraphs, cliques, components) as well as the patterns of the structures. An example of this latter point might be the recognition of a common linkage pattern that carries some sociological meaning, such as the relationships between a pair of parents and their children.
Locality – Dyads and Triads The most common subsets to review are Dyads: groups of two actors Triads: groups of three actors With directed data, there are 4 possible relationships between 2 actors With directed data, there are 64 relationships possible among 3 actors Relationships exhibit hierarchy, equality, exclusion, “social standing,” patterns of behavior
35 Simple Graph Properties
Counts Number of actors Number of possible connections Number of connections present Characteristics of network Size of population, How small groups differ from large groups: “cohesion, solidarity, moral density” Characteristics of individuals Number of connections Density of the network Source vs. sink
© 2006 Knowledge Integrity Incorporated 36 www.knowledge-integrity.com (301) 754-6350
The first place to begin in network analysis is at the global level, looking at properties that describe the entire network. For example, the counts associated with the graph provide some insight into its “density” – the number of possible connections vs. the number of actually present connections. Next, characteristics of the network as reflected by the size of the population and the gross-level review of groupings within the network. At a more granular level, examining the direct relationships among the individual entities and their relative connectivity gives some insight into the population itself.
36 Size and Density
Network size The count of the number of nodes Potential links There are (k*(k-1)) unique ordered pairs or actors The number of possible relationships grows exponentially Density is the ratio of links to the number of possible links The proportion of all possible links that are present Equal to the sum of links/number of possible links Provides insight into movement across the network, qualifications of specific actors within the network
© 2006 Knowledge Integrity Incorporated 37 www.knowledge-integrity.com (301) 754-6350
Many network analyses focus on the flow of “information” across the network, and that becomes a recurring theme. In the next few slides, let’s look at network properties and consider their “information exchange” features. For example, the ability to propagate information across the network is related both to its size and its density. The larger the network is, the more connections are needed to effectively propagate information, which is why we explore its sized and its density.
37 Degree
The number of edges/links attached to the actor The upper limit on number of connections each actor may have is (k-1) In-degree is the number of directed links into the node The out-degree is the number of directed links out of the node High in-degree indicates a “sink” High out-degree indicates a “source”
© 2006 Knowledge Integrity Incorporated 38 www.knowledge-integrity.com (301) 754-6350
The degree of a node describes its level of connectedness in the network. Here we distinguish between undirected and directed graphs. In undirected graphs, we just look at the degree, but in directed graphs, which have arcs that travel from one node to another, we discuss in-degree, which is the number of directed arcs into a node, and out-degree, which is the number of directed arcs that leave a node.
38 Reachability and Point Connectivity
Reachability: An actor is reachable by another if there are connections that can trace between them Provides insight into communication capability, robustness Connectivity: The number of nodes that have to be removed so that one actor could no longer reach another Provides insight into “redundancy” and robustness
© 2006 Knowledge Integrity Incorporated 39 www.knowledge-integrity.com (301) 754-6350
39 Distance
How far is it between nodes? Conceptually, the number of nodes that must be traversed to establish the connection between two actors Walk: A sequence that shows a traversal in the graph between two actors (ABC, ABDC, ABEBC) Cycle: A closed walk of distinct actors except for the originating node, which is also the destination (BCDB) Path: A walk in which each other actor and each other relation in the graph may be used at most one time
© 2006 Knowledge Integrity Incorporated 40 www.knowledge-integrity.com (301) 754-6350
There may be different walks of different lengths that connect two actors Distance can be assessed based on the characteristics of different kinds of walks: The total number of walks of a particular length between any two actors Distances can be scaled based on the size of the network To assess lengths in instances where links have value, sum the values along shortest path, or the minimum of the sums across all paths of size n or smaller
40 Geodesic Distance
Geodesic distance is the number of relations in the shortest walk from one actor to another Flow: similar to bandwidth capacity – how many different paths are there that connect two actors Diameter: Largest geodesic distance in the graph
© 2006 Knowledge Integrity Incorporated 41 www.knowledge-integrity.com (301) 754-6350
Geodesic distance is used to characterize the most efficient or optimal connection between two actors. In any network, two actors may be connected via numerous paths, but depending on the nature of the connectivity, it might be assumed that communication between any pair of actors would be performed along the shortest path.
Dense networks have mostly short geodesic distances. The largest geodesic distance for each actor is called its “eccentricity.” In graphs that are not completely connected,
41 A Simple Network
David Jill
Ted Helen Bob Len
Joe Frank
Rene Jack
Kate
© 2006 Knowledge Integrity Incorporated 42 www.knowledge-integrity.com (301) 754-6350
42 Univariate Statistics and Inferencing
An examination of the statistics across rows (i.e., “sources”) or columns (“sinks”)
Variance: How variable or Sum: number of links “predictable” the actor is to remaining nodes with respect to others
Mean: percentage of remaining nodes to which this nodes send link
© 2006 Knowledge Integrity Incorporated 43 www.knowledge-integrity.com (301) 754-6350
The statistics here review the differences between the “roles” each node takes on as sources of information. In this example, there are some significant sources – 2, 3, 5, and 8 are similar in terms of their role as information providers. Reviewing these statistics allows us to make some inferences about the population. For example:
Actors that have high out-degree may be “communicators” or “influencers” Actors with low out-degree are less likely to be influencers unless they are connected to the “right” other actors Actors with high in-degree may be “powerful” in terms of information gathering Actors with high in- and out-degree may be “facilitators” – they receive information and pass it along Actors with high out-degree and low in-degree may be “wannabes” or “outsiders”
43 Centrality
The concept of centrality is bound to the concept of “power” or “influence” within the network The way that actors are embedded within networks imposes constraints on, or offers opportunities for the actor and the network Provides qualification of favorable position, control, essentialness within the network Three types of measures: Degree Closeness Betweenness
© 2006 Knowledge Integrity Incorporated 44 www.knowledge-integrity.com (301) 754-6350
44 Examples
Star Circle
Line
© 2006 Knowledge Integrity Incorporated 45 www.knowledge-integrity.com (301) 754-6350
These three graphs demonstrate different kinds of centrality characteristics. In the star graph, node A has a high measure of centrality, but in the Circle graph no node has any greater centrality than any other. In the line graph, there are differing approaches to looking at centrality. In one, the edge nodes (A and G) have less centrality than the others, but a different approach may incorporate position in the line into the centrality measures also (e.g., D is more central than C or E, etc.).
45 Degree Centrality
Degree centrality is based on the number of links in and out of each node Actors with high in- degree are “prominent” Actors with high out- degree are “influential” Normalized degrees are based on percentage of remaining actors
© 2006 Knowledge Integrity Incorporated 46 www.knowledge-integrity.com (301) 754-6350
46 Closeness and Betweenness
Closeness measures how close each node is to the others in the network Betweenness characterizes the degree to which any individual node exists between other nodes
© 2006 Knowledge Integrity Incorporated 47 www.knowledge-integrity.com (301) 754-6350
The geodesic distance enables actors in favorable positions to communicate faster. Closeness examines the shortest distances between a node and each of the other nodes. For example, in the star network, the A node has a higher degree of closeness to the other nodes than any other one. In the circle network, the closeness measure is essentially identical for each of the nodes. However, in the line network, the node in the center of the line is “closer” in total to the other nodes than the others.
Betweenness is a measure of how a node in the network lies on the critical paths between other nodes. For example, in the star network, node A has a high level of betweenness, since it is one the path between every other set of nodes.
47 Grouping and Gr aph Basics
The structure of connectivity in which an actor is embedded within some structure, which is then embedded within a larger structure, etc. Components Clique Blocks and Cutpoints Lots of other stuff!!!
© 2006 Knowledge Integrity Incorporated 48 www.knowledge-integrity.com (301) 754-6350
Two vertices are in a connected component if there exists a path between them. Connected components can be drawn as subgraphs with empty space between them.
A clique is a subgraph where every node is connected to every other node.
Blocks are locations in the graph that would become disjoint if a node or link were removed. That removed link or node is called a cutpoint.
48 Link and Network Analysis
© 2006 Knowledge Integrity Incorporated 49 www.knowledge-integrity.com (301) 754-6350
49 Structural Analysis
Evaluating substructures that exist and interact within the network Dyads and Triads compose many graphs Looking at position and structure within graph provides insight into social structure and embeddedness Dyads and Triads form into larger graph substructures Evaluating overlapping membership in graph substructures exposes “influential” entities within the network
© 2006 Knowledge Integrity Incorporated 50 www.knowledge-integrity.com (301) 754-6350
50 Cliques
A clique is a maximally connected subgraph In a clique, each node is connected to every other node
© 2006 Knowledge Integrity Incorporated 51 www.knowledge-integrity.com (301) 754-6350
Cliques indicate a close relationship among the members, and often exist within communities based on entity profile similarities. For example, individuals with the same racial, ethnic, or religious identities may form smaller, tighter group relationships.
The smallest cliques are dyads, followed by triads. These building blocks may be combined into larger cliques as well.
By looking at the types of relationships between nodes in the cliques, you can see behavior patterns emerge as well. For example, in a triad, the link between two of the members may be much stronger than between either of those two and the third member.
Another interesting aspect is to evaluate overlapping membership. Actors that are members of multiple cliques may have greater influence, while sets of actors that share clique memberships may be particularly close also.
51 Example
In our Knoke example, the nodes for COMM and MAYOR share 5 clique memberships
© 2006 Knowledge Integrity Incorporated 52 www.knowledge-integrity.com (301) 754-6350
52 N-Clique and N-Clans
The definition of a clique is very restrictive There are subgraph relationships that relax the clique requirements An N-clique is a subgraph where every node is connected to every other node by a distance of N N-cliques may be connected by nodes that are not member of the clique, suggesting a stricter definition for an n-clan An n-clan has the additional constraint that the connections must be made through other members of the n-clique
© 2006 Knowledge Integrity Incorporated 53 www.knowledge-integrity.com (301) 754-6350
For n-cliques, the most frequent value used is 2. This is conceptually equivalent to “ a friend of a friend.”
There are other approaches to modifying the constraints associated with clique- style connectivity, such as: K-plex, which defines the clique if every node has a connection to all but K out of N nodes K-core, which defines the clique if every node is connected to K out of N nodes
53 Components
Components are subgraphs that are connected within the network, but are disjoint from other subgraphs
© 2006 Knowledge Integrity Incorporated 54 www.knowledge-integrity.com (301) 754-6350
54 Blocks and Cutpoints
If a node or a link were removed, how would that affect the connectivity structure? A node that, when it is removed subdivides the graph into components is called a “cutpoint” Those divisions are called “blocks”
© 2006 Knowledge Integrity Incorporated 55 www.knowledge-integrity.com (301) 754-6350
By removing a node in the graph and bisecting the network into blocks, we identify nodes that are key to the communication or information exchange patterns.
55 Summary: Link Analysis
Explores the relationships between objects, and how objects are linked Identify nodes that are “key” within a network Determine the links are critical to the operations of the network Assess the existence of relevant sub-networks Evaluate “spheres of influence” Clustering entities into communities of interest Geographic Intellectual Psychographic
© 2006 Knowledge Integrity Incorporated 56 www.knowledge-integrity.com (301) 754-6350
56 Example – Collaborative Filtering
Business Intelligence: The Savvy Manager's Guide (The Savvy Manager's Guides) (Paperback) by David Loshin "Imagine that you are the sales manager for a large retail organization and that you were able, within some probability, to predict how much money..." (more ) Key Phrases: productivity analytics , business rules system , business rules approach , Postal Service , Data Warehousing Institute , United States (more... ) (5 customer reviews)
Customers who bought this item also bought Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications by Larissa T. Moss Performance Dashboards: Measuring, Monitoring, and Managing Your Business by Wayne W. Eckerson Enterprise Dashboards: Design and Best Practices for IT by Shadan Malik The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition) by Ralph Kimball Business Intelligence for the Enterprise by Mike Biere
© 2006 Knowledge Integrity Incorporated 57 www.knowledge-integrity.com (301) 754-6350
57 Example – Organizational Behavior
Evaluation of communication patterns between individuals within an organization Attribution of actors with demographic data (age, seniority, gender, department, education, etc.) Assess: Is work being performed within or outside of the formal structure of the organizations? Are there self-organized “invisible barriers” emerging based on individual characteristics? Are there informal communities of interest that may seed innovation across division boundaries? How effective are the different communication channels?
© 2006 Knowledge Integrity Incorporated 58 www.knowledge-integrity.com (301) 754-6350
58 Example - Terrorist Network
Connectivity map by Valdis Krebs, based on data available from news sources on the world wide web
© 2006 Knowledge Integrity Incorporated 59 www.knowledge-integrity.com (301) 754-6350
59 Other Examples/Resources
Analyzing degree, centrality to assess risk of infection in a population (http://intl-aje.oxfordjournals.org/cgi/content/abstract/162/10/1024 ) Integrating raw data from multiple sources, including phone records, bank transactions, surveillance reports, vehicle sales to create a criminal network analysis program (http://ai.bpa.arizona.edu/COPLINK/publications/crimenet/Xu_CACM.doc ) Enabling targeted marketing within self-organizing networks (www.linkedin.com , www.myspace.com ) Krebs’ article on Identifying Terrorist Networks (http://firstmonday.org/issues/issue7_4/krebs/ ) Corporate board members and influencers (http://www.theyrule.net/ )
© 2006 Knowledge Integrity Incorporated 60 www.knowledge-integrity.com (301) 754-6350 http://ai.bpa.arizona.edu/COPLINK/publications/crimenet/Xu_CACM.doc
60 Issues and Considerations for Business Intelligence
© 2006 Knowledge Integrity Incorporated 61 www.knowledge-integrity.com (301) 754-6350
61 Challenges
Establishing a business justification Network Models and Metadata Data warehouse data extraction and transformation Semantic Analysis, Entity Extraction, and Establishing Linkage Graph Management
© 2006 Knowledge Integrity Incorporated 62 www.knowledge-integrity.com (301) 754-6350
62 Business Justification
Are all of these examples proving known concepts after the fact? Where can this technique add value to existing activities? What are the costs related to implementation and socialization?
© 2006 Knowledge Integrity Incorporated 63 www.knowledge-integrity.com (301) 754-6350
One of the criticisms of network and link analysis is the belief that the analysis exposes interesting notions that are actually already known. In evaluating the terrorist data, for example, are we determining (after the fact) that certain parties were connected, even though that fact was already known? Alternatively, one might ask the same question a different way: Today, would this kind of analysis contribute to the determination of suspicious group behavior that would warrant action?
At a more concrete level, we can evaluate the types of applications to which SNA is applied and review what benefits are expected out of the process and how those benefits are measured. For example, consider Amazon’s use of collaborative filtering – the cost is not significant, but there is a perception of high value, especially as it facilitates self-organized predictive modeling.
63 Data Extraction and Transformation
Linearized or rectangular data is not necessarily suitable for network or link analysis Synchronization of warehouse data with network data Challenge: Develop models for representation of network data Provide services for transformation into and out of graph structures Provide front-end applications for visualization
© 2006 Knowledge Integrity Incorporated 64 www.knowledge-integrity.com (301) 754-6350
64 Networks and Data Models
SNA applications require data configured to represent nodes, links, and associated link weights Example: Nodes: Ken Dave Jill Bob Arcs: Ken Dave 10 Ken Jill 1 Dave Bob 6
© 2006 Knowledge Integrity Incorporated 65 www.knowledge-integrity.com (301) 754-6350
Different applications expect the input data representing the network to be configured in a way that can be parsed easily and configured into the tool. In the example here, the node names are enumerated, followed by an enumeration of arcs consisting of the source node, the target node, and the weight of the link. In turn, the applications will maintain an internal representation of the network (perhaps in an adjacency matrix), as well as maintain additional metadata related to actions taken by the analyst, which means that the internal representation of the network may be modified and written out during the analysis.
The challenge is to create the transformation mechanisms to extract rectangular data from traditional data sets, organize the data into the appropriate network representation, as well as transforming the output of the SNA application back into rectangular format for use in traditional BI activities. This requires understanding the SNA application’s model as well as how the connectivity is represented within the source data sets.
65 Social Network Metadata
Actors Label or name Identifier Actor Demographics Links Relational characteristic Connectivity measure Weight/value Other issues to consider: “Link distance” Positioning Graph structure
© 2006 Knowledge Integrity Incorporated 66 www.knowledge-integrity.com (301) 754-6350
The network embeds relationships, but the nature of how those relationships correspond to the modeled objects must be maintained. This means that the metadata associated with each of the objects should be captured, especially if it can be visualized along with the network itself. For example, in the Knoke network, the shapes and colors used indicated different aspects of the modeled actors.
The same is true for the links – different types may be represented using different line types, while the weights may be indicated using line thickness or even distance between the nodes. Because visualization is critical, the positioning of nodes on the “template” of the map may be relevant, as well as the shape that the graph should take (e.g., circle, hub and spokes, random node placement, etc.).
Eventually, most analyses will need to be reintegrated with the original source data sets; identifiers need to be assigned to each node, while these identifiers also need to be linked back to the original data.
66 Semantic Analysis and Entity Extraction
Semi-structured and unstructured text contain references to entities and their relationships Challenge: Provide ability to identify entities within a document Characterize entity types based on context Establish connectivity on more than a naïve basis Transform extracted information into format suitable for integration and analysis
© 2006 Knowledge Integrity Incorporated 67 www.knowledge-integrity.com (301) 754-6350
67 Entity Extraction
Recall our earlier example: Richard J. Palaima, Of Mattapan, Suddenly, Nov. 18, 2002. Beloved husband of Jonadee (Badayos) Palaima. Devoted son of Madelyn L. (George) Palaima of Braintree, and the late Richard A. Palaima. Devoted brother of John A. Palaima of Braintree, nephew of Catherine Cunningham of Rockland, Cousin & Godson of Robert Cunningham of Rockland. Funeral from the Mortimer N. Peck-Russell Peck Funeral Home, 516 Washington St., BRAINTREE, on Saturday at 9 a.m. Funeral Mass in St. Gregory's Church, Dorchester, at 10 a.m. Relatives and friends invited. Visiting hours Friday 2-4 & 7-9 p.m. Memorial donations may be sent to St. Gregory's Church, 2215 Dorchester Ave., Dorchester 02124. Text mining applications can identify names, locations, roles, and organizations within semi- and unstructured data Often the relationship may be as simple as “appearing in the same context”
© 2006 Knowledge Integrity Incorporated 68 www.knowledge-integrity.com (301) 754-6350
One approach for evaluating connections is to analyze text, extract the entities and related metadata (locations, roles, etc.), insert that data into a database, then extract it for the purposes for network analysis. By analyzing many similar document corpuses (e.g., public records, filings, directories, news articles) and inserting them into a data set, linkages can emerge from the data. For a good example, consult www.theyrule.net.
68 Establishing Linkage
Quality of data used to refer to entities may be variant and suspect Challenge: Effectively characterize how attribution contributes to similarity scoring Exploit data quality tools for standardization and linkage from different kinds of source data
© 2006 Knowledge Integrity Incorporated 69 www.knowledge-integrity.com (301) 754-6350
Data cleansing, matching, and linkage tools have been used for many years for identifying duplicates among data records, as well as using similarity scoring mechanisms for householding and establishing hierarchies. The same techniques can be used to examine attributes associated with each record, parse and standardize the data, then apply the similarity scoring and linkage capabilities to the candidate entities.
69 Graph Management
Graph data structures are different than relational databases Graphs do not inherently provide persistence Large networks take up large amount of memory space Issues: Managing large graphs in a performance-efficient manner Manipulating graph models in real time Persistence of graph models Drawing conclusions about what is demonstrated in the graph
© 2006 Knowledge Integrity Incorporated 70 www.knowledge-integrity.com (301) 754-6350
Some of the interesting challenges that need to be addressed include: Providing a usable graph management utility that provides reasonable performance Providing persistence for graphs Enabling transformation into and out of graph format
70 Interesting Resources
Books “Linked: How Everything Is Connected to Everything Else and What It Means,” Albert-Laszlo Barabasi “The Tipping Point,” Malcolm Gladwell
Web sites http://www.insna.org/ http://www.ire.org/sna/ http://faculty.ucr.edu/~hanneman/nettext/ Software The R project http://www.r-project.org/ SNA tools for R http://erzuli.ss.uci.edu/R.stuff/ UCINET trial download http://www.analytictech.com/downloaduc6.htm Pajek http://vlado.fmf.uni-lj.si/pub/networks/pajek/
© 2006 Knowledge Integrity Incorporated 71 www.knowledge-integrity.com (301) 754-6350
71 Questions?
If you have questions, comments, or suggestions, please contact me David Loshin 301-754-6350 [email protected]
© 2006 Knowledge Integrity Incorporated 72 www.knowledge-integrity.com (301) 754-6350
72