<<

Purdue University Purdue e-Pubs

Open Access Dissertations Theses and Dissertations

8-2016 Controlling for network properties in hypothesis testing and anomaly detection Timothy La Fond Purdue University

Follow this and additional works at: https://docs.lib.purdue.edu/open_access_dissertations Part of the Computer Sciences Commons, and the and Probability Commons

Recommended Citation Fond, Timothy La, "Controlling for confounding network properties in hypothesis testing and anomaly detection" (2016). Open Access Dissertations. 791. https://docs.lib.purdue.edu/open_access_dissertations/791

This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. Graduate School Form 30 Updated 

PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance

This is to certify that the thesis/dissertation prepared

By Timothy La Fond

Entitled Controlling For Confounding Network Properties in Hypothesis Testing and Anomaly Detection

For the degree of Doctor of Philosophy

Is approved by the final examining committee:

Jennifer Neville Chair David Gleich

Chris Clifton

Daniel Aliaga

To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material.

Approved by Major Professor(s): Jennifer Neville

Approved by: William Gorman 7/26/2016 Head of the Departmental Graduate Program Date

CONTROLLING FOR CONFOUNDING NETWORK PROPERTIES IN

HYPOTHESIS TESTING AND ANOMALY DETECTION

A Dissertation

Submitted to the Faculty

of

Purdue University

by

Timothy La Fond

In Partial Fulfillment of the

Requirements for the Degree

of

Doctor of Philosophy

August 2016

Purdue University

West Lafayette, Indiana ii

TABLE OF CONTENTS

Page LIST OF TABLES ...... iv LIST OF FIGURES ...... v ABSTRACT ...... viii 1 Introduction ...... 1 2 Networks and Hypothesis Testing ...... 7 2.1 Graph Notation ...... 7 2.2 Properties Influencing Network Formation ...... 8 2.3 Statistics for a Single Observed Network ...... 10 2.4 Delta Statistics for Pairs of Observed Networks ...... 14 2.5 Hypothesis Tests ...... 16 2.6 Hypothesis Tests on Dynamic Networks ...... 16 3 Related Work ...... 20 3.1 First Component (Chapter Five): Testing a Global/Community Prop- erties using a Network and Permutation ...... 23 3.2 Second Component (Chapter 6): Distinguishing the Properties of Dif- ferent Graphs using Multiple Observations of Consistent Network Statis- tics ...... 24 3.3 Analysis ...... 25 3.4 Network Models ...... 26 4 Network Statistic Dependencies ...... 29 4.1 Graph Edit Distance ...... 31 4.2 Clustering Coefficient ...... 32 4.3 Degree Distribution ...... 33 4.4 Centrality Measures and Shortest Paths ...... 33 4.5 ...... 34 4.6 Netsimile Distance ...... 35 4.7 Deltacon Distance ...... 35 4.8 Summary ...... 36 5 Randomization Testing Solution ...... 37 5.1 Introduction ...... 37 5.2 Problem Definition ...... 41 5.3 Method ...... 47 iii

Page 5.3.1 Calculating the Expected Chi-Squared Gain ...... 55 5.4 Synthetic Data ...... 56 5.4.1 Data ...... 57 5.4.2 Methodology ...... 58 5.5 Experimental Results ...... 58 5.6 Real Data Experiments ...... 62 5.6.1 Data ...... 62 5.6.2 Experimental Results ...... 63 5.7 Discussion ...... 65 6 Size-Consistent Statistics for Anomaly Detection in Dynamic Networks . 67 6.1 Introduction ...... 68 6.2 Problem Definition and Data Model ...... 72 6.3 Properties of the ...... 76 6.3.1 Size Consistency ...... 76 6.3.2 Size Inconsistency ...... 78 6.4 Network Statistics ...... 80 6.4.1 Conventional Statistics ...... 80 6.4.2 Proposed Size Consistent Statistics ...... 85 6.5 Anomaly Detection Process ...... 101 6.5.1 Smoothing ...... 102 6.5.2 Complexity Analysis ...... 105 6.6 Experiments ...... 105 6.6.1 Synthetic Data Experiments ...... 106 6.6.2 Semi-Synthetic Data Experiments ...... 110 6.6.3 Real Data Experiments ...... 113 6.7 Local Anomaly Decomposition ...... 117 6.8 Summary ...... 121 7 Conclusions ...... 124 LIST OF REFERENCES ...... 128 iv

LIST OF TABLES

Table Page 3.1 Categorization of related work in static networks, by type of statistic and network property being tested...... 21 3.2 Categorization of related work in dynamic networks, by type of statistic and network property being tested...... 21

3.3 Categorization of related work in static networks, by H0 generation method and network property being tested...... 22

3.4 Categorization of related work in dynamic networks, by H0 generation method and network property being tested...... 22 5.1 for linked node pairs and their attribute values. ... 42 5.2 Changes to the contingency table in time t +1...... 45 5.3 Type I error and power for the choice-based randomization method. .. 60 5.4 Number of groups detected by randomization tests ...... 63 5.5 Example groups with each possible combination of effect...... 64 6.1 Glossary of terms ...... 67 6.2 Statistical properties of previous network statistics and our proposed al- ternatives...... 82 v

LIST OF FIGURES

Figure Page 1.1 Diagram of a simple anomaly detection process...... 2 1.2 Diagram of the anomaly detection process and confounding effects. .. 4 1.3 Common network statistics and confounded network properties .... 5 2.1 Flowchart of the graph generation process and statistic generation. .. 17 2.2 Creation of null distribution and test point from graph statistics. ... 17 2.3 Flowchart of delta statistic generation...... 18 5.1 Confounding effect example with the communication volume property con- trolled...... 40 5.2 Illustration of homophily and influence affect on attributes and links over time...... 41 5.3 Randomization testing procedure. Semi-synthetic graphs are created by permuting the original to construct a null distribution...... 48 5.4 Homophily significance test method ...... 49 5.5 Influence significance test method ...... 50 5.6 Choice-based randomization method for assessing homophily ...... 53 5.7 Choice-based randomization method for assessing influence ...... 54 5.8 Method to measure Type I error rate...... 59 5.9 Method to measure statistical power...... 59 5.10 Power analysis of influence and homophily randomization tests. (a) In- fluence test power, as number of group additions increases. (b) Influence test power, as effect size increases. (c) Homophily test power, as effect size increases...... 61 6.1 Statistic values network data generated from same model, but with increas- ing size. Behavior of (a) Consistent Statistic; (b) Inconsistent Statistic. Black pts: avg of 100 trials, red pts: [min, max]...... 70 6.2 Controlling for confounding effects through careful definition of the test statistics...... 71 vi

Figure Page

6.3 Dynamic network anomaly detection task. Given past instances of graphs created by the typical model of behavior, identify any new graph instance created by an alternative anomalous model...... 73 6.4 Graph generation process. The matrix P ∗ represents all possible nodes and their probabilities. By |V | nodes and |W | edges the observed graph G is obtained...... 74 6.5 Estimation of the Pt matrix from the observed W weights...... 85 6.6 Mass of the cells in P increase as |V | decreases...... 86 ∗ ∗ 6.7 Visualization of the regions averaged to obtain p , p, and p V ...... 87 6.8 Creation of the dynamic graph sequence from message stream using time step width Δ...... 103

6.9 Anomaly detection procedure for a graph stream {G1...Gt}, graph statistic Sk and p-value α...... 104 6.10 Diagram of synthetic experiments and the three sets of generated graphs...... 108

6.11 Synthetic data experimental procedure for statistic Sk using normal prob- ability matrix PN , anomalous probability matrix PA, and Δ additional edges in False Positive graphs...... 109 6.12 (a),(b),(c): ROC curves with false positives due to extra edges. (d),(e),(f): ROC curves with false positives due to extra nodes. (g) false positive rate vs. edges, (h) false positive rate vs. nodes, (i) true positive rate. .... 110 6.13 Diagram of semi-synthetic experiments and the three sets of generated graphs...... 111

6.14 Semi-synthetic data experimental procedure for statistic Sk using normal probability matrix PN , anomalous probability matrix PA, and Δ additional edges in False Positive graphs...... 112 6.15 ROC curves on the semi-synthetic dataset with varying alphas. .... 114 6.16 Detected anomalies in Enron corporation e-mail dataset. Filled circles are detections from our proposed statistics, open circles are other methods. 115 6.17 Detected anomalies in university student e-mail dataset...... 116 6.18 Detected anomalies in Facebook wall postings dataset...... 117 6.19 Subgraph responsible for most of the mass shift anomaly in the Enron network at the weeks of June 25 (before anomaly) and July 2 (during anomaly), 2001 respectively...... 119 vii

Figure Page 6.20 Subgraph responsible for most of the triangle probability anomaly in the Enron network at the weeks of May 1 (before anomaly) and May 8 (during anomaly), 2000 respectively...... 119 6.21 Subgraph responsible for most of the graph edit distance anomaly in the Enron network at the weeks of January 28 (before anomaly) and Feburary 4 (during anomaly), 2002 respectively...... 120 6.22 Subgraph responsible for most of the Barrat clustering anomaly in the En- ron network at the weeks of November 1 (before anomaly) and November 8 (during anomaly), 1999 respectively...... 120 6.23 Subgraph responsible for most of the mass shift anomaly in the Facebook network at June 1 (before anomaly) and June 2 (during anomaly), 2007 respectively...... 121 6.24 Subgraph responsible for most of the triangle probability anomaly in the Facebook network at July 20 (before anomaly) and July 21 (during anomaly), 2007 respectively...... 122 6.25 Subgraph responsible for most of the graph edit distance anomaly in the Facebook network at October 15 (before anomaly) and October 16 (during anomaly), 2007 respectively...... 122 6.26 Subgraph responsible for most of the Barrat clustering anomaly in the Facebook network at May 17 (before anomaly) and May 18 (during anomaly), 2007 respectively...... 123 viii

ABSTRACT

Ph.D., Purdue University, August 2016. Controlling For Confounding Network Prop- erties in Hypothesis Testing and Anomaly Detection. Major Professor: Jennifer Neville.

An important task in network analysis is the detection of anomalous events in a network time series. These events could merely be times of interest in the network timeline or they could be examples of malicious activity or network malfunction. Hy- pothesis testing using network statistics to summarize the behavior of the network provides a robust framework for the anomaly detection decision process. Unfortu- nately, choosing network statistics that are dependent on confounding factors like the total number of nodes or edges can lead to incorrect conclusions (e.g., false pos- itives and false negatives). In this dissertation we describe the challenges that face anomaly detection in dynamic network streams regarding confounding factors. We also provide two solutions to avoiding error due to confounding factors: the first is a randomization testing method that controls for confounding factors, and the second is a set of size-consistent network statistics which avoid confounding due to the most common factors, edge count and node count. 1

1 INTRODUCTION

Network data mining and machine learning is a rich domain of study which has numerous applications. With the ubiquitousness of the internet and the popularity of social networking services there has been a surge in the amount of available network data. With this new data has come a need for new methods of analysis: whether the data represents social interactions, computer communication or biological systems, the fundamental feature of network data is the interconnectivity of the data elements. This interdependency presents both an obstacle and a potential resource to algorithms that analyze or generate predictions from network data as the traditional assumption of data independence does not hold. Older machine learning and data mining methods assume that the data is independently and identically distributed (IID), which is not the case with the interconnected nodes in a network. As such new techniques like relational learning [1] are required to adapt traditional solutions to the field of network data. One technique that has been adapted to the network domain for a variety of prob- lems is statistical hypothesis testing. Statistical hypothesis testing is an algorithmic approach to proposing suppositions about a data set and attempting to prove or dis- prove those suppositions in a rigorous fashion using statistics drawn from the data. Traditionally network hypothesis testing has been performed much in the same way as on independent data, although the test statistics and data models have changed to accommodate the new form of the data [2]. However, in many network hypothesis testing scenarios this translation of independent techniques is incorrect: multiple be- havioral and environmental factors combine to produce the observed network, and if we assume these factors are independent we may misattribute certain aspects of the network as being due to something other than the true cause. In particular we will 2

          

     

Figure 1.1.: Diagram of a simple anomaly detection process.

focus on the problems associated with multi-graph testing where the observed graphs are samples of a dynamic network system taken at different times.

Let us consider an example of a hypothesis testing task where we wish to identify anomalous behavior. Suppose we are monitoring the e-mail communications between the employees of some large company with the goal of detecting any unusual activity. Unusual activity could an employee contacting someone far removed from their own department or sending many more e-mails than they would on a typical day, and could be the result of collusion or a compromised e-mail account. We wish to flag any times where the network at that time appears anomalous compared to what we have observed in the past. Figure 1.1 shows a conventional anomaly detection system for the e-mail scenario described above. There is a hidden Network Formation Property that influences the e-mail communication; in this example that property is simply whether or not some 3

event has changed the pattern of communication in the network to something other than normal. This property affects the process that produces the Observed Graphs. The observed graphs are the empirical evidence of the network property. These graphs are summarized using a Graph Statistic (for example, a quantitative measure of how different a graph example is from the other examples) and then these statistic values are used as inputs to some Anomaly Detector (like a hypothesis test). From the output of the detector we can then make our conclusions as to when the formation property is something other than normal. However, this scenario makes a crucial assumption about the relationship between the network property and the graph statistic: that nothing else in the network could be affecting the observed statistic values. In other words, there is a 1:1 mapping between the network formation property we wish to measure (change in communi- cation behavior) and the statistic we are using to measure it. The reality is that other properties of the network, ones that we are not interested in tracking, could be affecting the observed statistic values. For example, the total number of messages sent across the network may vary greatly from day to day. Observing many e-mails in a particular day may not be a case of unusual behavior but simply many examples of normal behavior. If we measure change by using the raw difference in number of e-mails sent, we would consider the case where one employee spams messages across the network and the case where everyone sent a few more messages because it was a busy day as equivalently unusual events. Figure 1.2 shows this scenario where there are two hidden Network Formation Properties that influence the observed e-mail communication: the change in the be- havior of the communication (who is talking to who and the distribution of that com- munication) and the overall volume of communications. These properties interact in the process that produces the observed graphs. Once we have the set of detections we are unable to disambiguate which of the network properties changed in order to generate the anomaly. If we were interested in finding anomalies in the change of com- munication and not the volume of communication we would be unable to determine 4

        



        

Figure 1.2.: Diagram of the anomaly detection process and confounding effects.

which detections are relevant to our interests or not. These alternate explanations for the detected anomalies are known as Confounding Effects or Confounding Factors and are one of the major challenges of anomaly detection on networks. If we were to apply our basic anomaly detector in this confounded scenario, two types of errors could arise: (1) we could conclude there is a change in communication when only the volume of communication is unusual producing a False Positive;or(2) the of the communication volume translates to noise in our anomaly detector and we fail to find an anomaly producing a False Negative. Real networks are the product of more than just two properties as well. Properties like the size of the network, the transitivity, attribute homophily and tendency to form clusters all combine to produce the observed networks. A variety of statistics have been used in the past to measure each of these network properties, but due to the lack of a 1:1 relationship between statistic and property conclusions about the network properties drawn from these statistics tend to be rife with errors both false positive 5

Network Properties

Communication Communication Node Activity Transitivity Pattern Volume Level

Graph Edit Clustering Degree Message Count Distance Coefficient Distribution Network Statistics Figure 1.3.: Common network statistics and confounded network properties

and false negative. 1.3 shows an example of several common network statistics and the properties they are intended to measure. The black lines represent the intended relationship while the red lines are confounding factors. The purpose of this work is to provide statistical methods which avoid errors due to these confounding factors. The first approach, described in Chapter 5, uses a Ran- domization Testing technique to control the network properties which might affect the test statistics. The second in Chapter 6 demonstrates how to design Consistent Statistics which prevent errors due to confounding factors. Both of these techniques allow for testing specific network properties accurately even in the presence of con- founding factors. Primary Hypothesis Due to the inherent interconnected nature of network data, hypothesis tests regarding specific network properties are often confounded by other factors which produces errors false positive and false negative errors. This dissertation provides two approaches to disambiguate network properties and provide accurate results to network hypothesis tests: the first is to control for network proper- 6 ties using a semi-synthetic randomization testing method, and the second is to design test statistics which are robust to multiple confounding properties. Using either of these approaches will prevent flagging anomalies based on confounding factors which results in a lower rate of error for anomaly detectors. In this dissertation we will introduce algorithms designed to disambiguate which properties are present in a network even when the statistics used are sensitive to multiple properties through the use of carefully defined testing methodology. This dissertation will also define alternative network statistics which are insensitive to the variations in the size of graph instances which is the most common confounding effect for network hypothesis tests. The effectiveness of these statistics at detecting network anomalies without excessive errors due to the confounding effect of network size is demonstrated using experiments with synthetic, semi-synthetic, and real-world data. These methods form the foundation for accurate anomaly detection of multiple network properties for dynamic networks. 7

2 NETWORKS AND HYPOTHESIS TESTING

2.1 Graph Notation

A network’s graph is typically expressed as G = {V,E} where V is the vertex set (the nodes in the graph) and E is the edge set. Every edge in the edge set can be represented by ei,j ∈ [0, 1] where i and j are the endpoints of that edge, forming the |V |×|V |-sized matrix E. Graphs can either be directed or undirected; undirected graphs will have a symmetric E. Some graphs also have attributes on the nodes or edges X where every entry xi is a vector {xi1,xi2,xi3...} representing the attribute values of node i. Another common graph representation is a weighted graph G = {V,W} where the matrix E is replaced with a matrix W where the cells wi,j are continuous variables rather than 0-1 values. The weight on any edge represents either the magnitude of the relationship or the frequency of communication between the two endpoints. Dynamic networks refer to networks which change over time and need a represen- tation that has a time component. In this work a dynamic network is represented by a series of graphs observed at consecutive time steps where each time step is an equal-length window of time where activity is observed. Each graph can be thought of as a ”snapshot” of the activity occurring in the network during that time period.

The full dynamic network produces a series of graphs {G1,G2,G3...Gt...GT } where

the graph of activity starting at time t is Gt = {Vt,Wt}. In this dissertation assume

that Wt always represents the number of communications sent in the time window at t (i.e. number of e-mails sent between all pairs of individuals in a given day). 8

2.2 Properties Influencing Network Formation

Scientists and researchers have conjectured that many different network properties influence network formation based on observations of the structure of real networks or of the activities that take place in networks. A network property differs from a network statistic as the statistic is a function that quantifies a graph instance while a network property is a characteristic of the hypothetical process that generated the graph and may not be directly measurable. Below are some of the most common and critical network properties. Node Activity Levels and Activity Distribution: We refer to the “activity” of a node as its propensity to engage in interactions with other nodes and form edges. Then the distribution of activity throughout the network represents the discrepancy in edge participation of different nodes. It is very rare for the activity levels to be uniform across all nodes of a network: typically there will be a non-trivial number of nodes that participate in a much larger proportion of edges than others. Nodes like these are often called hub nodes since they connect many pairs of neighbors and networks with this kind of activity distribution are called “small-world networks” [3]. Although this property is usually synonymous with the Degree Distribution of the graph, in order to emphasize the difference between network property and network statistic, we refer to Activity Level as the network property and Degree Distribution as the network statistic. The Degree Distribution statistic is described in the following section. Transitivity: Transitivity is the tendency for edges to form triangular relation- ships in a network: it two nodes have a mutual neighbor, it is likely that those nodes are neighbors as well. In the context of a social network the mechanism by which transitivity is expressed might be individuals introducing their mutual friends to each other or by being the vector of their interaction in some other manner. Transitivity is an extremely common property in social networks and often forms a major basis for the growth of edges and communities in the network. 9

Self-similarity, Homophily and Social Influence: Self-similarity is the ten- dency for linked nodes in a network to share similar properties in terms of any node attributes they might have. Self-similarity can arise from a number of ways: a latent effect which produces communities with similar properties (e.g. people from the same town are likely to interact and come from the same demographics), a social influence effect where properties are transmitted through the network (e.g. product adoption spreads through the network as individuals recommend the product to friends), or a homophilous edge selection effect where edges form between nodes due to their mu- tual attributes (e.g. members of the same gym may interact and form a connection). Self-similarity is another property which is extremely common in social networks. It is also important due to its effect in relational learning algorithms: relational learn- ing exploits the dependencies between linked nodes to produce better predictions and self-similarity is the most common dependency [1]. Network Connectivity, Hub Nodes, and Network Clustering: As a network has a topographical structure there is a concept of the distance between nodes in the network’s graph. Although exact measures vary nodes which are directly linked, share many neighbors, or are in the same local cluster (see below) are considered to be “close” in terms of network distance. A node which has a short network distance to many other nodes is often called a hub node or a central node. Clustering in the network is another structural concept. A network cluster is a set of nodes which have more links between them than with nodes outside the cluster. A cluster can be thought of as a large sub-structure within the network as a whole and may represent a distinct community. Rate of Change In the case of a dynamic network where multiple graph examples are observed over time, there is an expectation that while the network is changing over time graph observed close together in time are likely to be similar, i.e. changes occur slowly over time. The frequency at which a network experiences changes to its graph structure is called the rate of change. 10

2.3 Statistics for a Single Observed Network

A wide variety of network statistics have been proposed as aids for determining the properties of networks [4]. Below are listed some of the most popular along with the properties they are intended to represent. Node Count, Edge Count, and Density Coefficient: Thenodecountand edge count of a network are the simplest statistics that can be measured from a network. The node count is simply the number of nodes and the edge count is the number of edges. In this work, when discussing weighted graphs, we will use edge count to refer to the total weight of the edges:

node count: |V | (2.1)

edge count: |E| (2.2) total weight: |W | = wij (2.3) i,j∈V

A statistic closely related to the node and edge count is the network density.The network density is simply defined as the ratio of possible edges in the graph to the number of edges that exist: |E| Density(G)= (2.4) |V |2 Density is one of the simplest network statistics available but it is also an extremely important one. Generally most real-world networks will have a low density; this property is known as sparsity and is exploited by some algorithms to reduce algorithm run times. Sparsity usually comes about because a single actor in a network (say a user of a social network) has an upper limit to the number of other actors they can interact with (it is impossible to connect to every other person on a social network) so as the number of nodes in a network increases the individual degrees do not increase at the same rate. 11

Node Degree and Degree Distribution: The degree of a node is simply its number of edges and is a statistic that measures the activity level of that particular node:

D(i)= ei,j (2.5) j=i∈V OR D(i)= wi,j (2.6) j=i∈V

The degree distribution, then, is the distribution of all node degrees and is a statistic that represents the distribution of activity in the network. Usually the degree dis- tribution is assumed to be a common functional form when modeling networks: for small-world networks this is usually a power law function where the probability of a node having degree d is P (D(i)=d)=c ∗ e−kd where c and k are parameters. This function is “heavy-tailed” in the sense that nodes with a large degree are still likely under the distribution, which matches the tendency to see high degree nodes in real networks. Clustering Coefficient: Transitivity results in a large number of triangles in a network, so the simplest statistic to measure the transitivity is the number of triangles in the network. However, a more commonly used statistic to measure the transitivity is the clustering coefficient:

3 ×|Triangles| CC(G)= (2.7) |Wedges| where |Triangles| is the number of connected triangles in the graph and |Wedges| are the number of connected triplets of nodes in the graph (i, j, k s.t. ei,j,ej,k ∈ E). Note that each triangle in a graph contributes three wedges as well, since all combinations of its vertices are connected, hence the 3 in the numerator. If each wedge is considered a potential triangular relationship, then the clustering coefficient is the ratio of realized triangular relationships to the sum of possible ones and realized ones. Although 12

the ratio of triangles to wedges is the most common formulation of the clustering coefficient, alternatives exist [5,6]. A set of statistics closely related to clustering coefficient are the motif counts, where each motif is a small subgraph structure (like a triangle, or star, or square) and the count is the number of times the motif appears in a network. Triangles are a simple motif structure so this can be thought of as a generalization of clustering coefficient. The distribution of motifs is often used as a signature to distinguish different networks [7,8]. Autocorrelation, Correlation Coefficient and Chi-Squared Score: Label autocorrelation is typically measured with Pearson’s correlation coefficient on the

label or attribute xk across all pairs of nodes with an edge eij:

cov(xk,xk) cor(xk,xk)= 2 (2.8) σxk

where cov is the covariance of the attributes of linked nodes:

1 cov(x ,x )= (x − x )(x − x ) (2.9) k k |E| ik k jk k eij ∈E

with xk being the mean attribute value and σxk the of the at- tribute. This value ranges from -1 (perfect anti-correlation) to 1 (perfect correlation), with 0 being independence. An alternative measure of autocorrelation is to take the Pearson’s chi-squared score on a 2 by 2 contingency table where one dimension is the similarity of a pair of nodes (i.e. xik = xjk or xik = xjk) and the other is whether

an edge is extant between that pair (eij =1oreij = 0). This provides a measure of whether edges are more likely to form between similar nodes and is more appropriate for data with an extreme label skew. Pearson’s chi-squared score is defined as:

2 (E − O ) χ2 = rc rc (2.10) Erc eij ∈E 13

where r = eij and c = I[xik = xjk]. As r and c are binary values, the instances

of the graph can be tallied as a 2x2 contingency table where Orc is the number of

observations in a cell of the contingency table and Erc is the expected value given the marginals. Correlation between node labels in a network derived from social data is extremely common due to natural human tendencies [9]. We will discuss how to measure the differences in autocorrelation of two graphs in a later chapter. Hop Plots, Centrality, and Network Diameter: A Hop Plot is the distribu- tion of node pairs with a particular shortest path length k between them:

HP(k)= I[SP(i, j)=k] (2.11) ij∈V where the shortest path SP(i, j) is the minimum number of edges which connect i and ∃ j in an unbroken path: ei,n1 ,en1,n2 ...enk−1,j. Nodes with no path connecting them are considered to have an infinite shortest path distance or a N/A distance depending on the application. A hop plot gives some indication to the structure of the network, where a hop plot with lower path lengths indicates a centralized structure possibly with many high degree hubs that link many nodes and a hop plot with longer path lengths has multiple poorly linked structures. If no path can be found between a node pair the hop plot usually reports this as an infinite length path, so hop plots also give information about disconnected components. Nodes with a short path to many other nodes can be considered central nodes in the network; the Closeness Centrality [10] statistic uses these paths to assign an exact centrality score to nodes. An alternative statistic to measure centrality is Betweenness Centrality [11] where the centrality of a node is proportional to the number of shortest paths that pass through that node. Eigenvector Centrality [12], where the centrality of nodes is obtained from an eigenvector decomposition of the adjacency matrix, is another alternative. Many other centrality statistics exist, including ones designed to 14 measure centrality in networks with probabilistic edges rather than concrete ones [13] or random-walk based measures such as page rank [14]. Network Diameter is a measure of how many hops are necessary to traverse the network and is defined as the maximum length of the shortest pathes between nodes. As the standard measure is susceptible to outliers typically the number of hops needed to span 90% of the network is reported instead. Network diameter tends to be rela- tively small in most networks and grows very slowly with increased number of nodes. Often this is due to preferential attachment creating large hubs in the network which provide a short path between many different nodes. This is often referred to as the small-world phenomenon which was the focus of the famous six degrees of separation by Stanley Milgram [15]. Change in the network diameter can be used for change point detection [16].

2.4 Delta Statistics for Pairs of Observed Networks

There is also a collection of statistics designed to provide a measure of the sim- ilarity of two network examples; called Delta Statistics. Rather than a calculation from a single graph instance they are a distance measure representing the similarity or dissimilarity of two graph examples. They are typically used for clustering sets of networks or for hypothesis tests determining if a network comes from some population of network examples. For a dynamic network a delta statistic is typically applied to all consecutive pairs of graph examples and is used as a measure of the rate of change; this is shown in Figure 2.3. Graph Edit Distance: Graph Edit Distance is defined as the sum of the number of changes required to make two graphs G1 and G2 identical. It is calculated by summing the difference in the node set with the difference in the edge set:

GED(G1,G2)=|V1| + |V2|−2 ∗|V1 ∪ V2| + |E1| + |E2|−2 ∗|E1 ∪ E2| (2.12) 15

Other similar metrics include error correcting graph matching distance [17] and max- imum common subgraph [18]. Netsimile: The Netsimile network distance is a comparison statistic designed to provide a measure of how similar or dissimilar two networks are [19]. It uses a number of summary statistics such as node degree, neighbor degree, local clustering coeffi- cient, and neighbor clustering coefficient calculated from the two networks. These statistics are gathered from each node in the network and then aggregates like mean, standard deviation, and skew are calculated on the statistics. The distance between two networks is then simply a summation over the Canberra distances of the vectors of aggregates X1,X2 where X1 = {x1,1,x2,1...xk,1} of the two networks:

k abs(xi,1 − xi,2) Netsimile(G1,G2)=d(X1,X2)= (2.13) abs(x 1)+abs(x 2) i=1 i, i,

Deltacon: The Deltacon network distance is a measure of the difference in con- nectivity in two networks [20]. It employs a matrix inversion to calculate an affinity matrix between all pairs of nodes for each graph. This affinity measure approximately corresponds to the random-walk distance between all node pairs. Then the Deltacon distance between two networks is the absolute difference in their two affinity matrices. This approach assumes that the node set is the same for the two networks. Degree Distribution Difference: The difference in the degree distributions of two graphs can also be used as a delta statistic. Possible measures between the two distributions are the Kolmogorov-Smirnov statistic or the Cram´er-von Mises criterion [21]. Although the above are common distance metrics between graph examples, this list is by no exhaustive. For more examples of distance metrics refer to the following papers: [22–28]. 16

2.5 Hypothesis Tests

Before discussing how to perform hypothesis tests on networks let us first look at how hypothesis tests function on simple data. Hypothesis testing was first proposed as a “test of significance” by [29]. His intuition was that an output should be considered significant if the likelihood of that output being generated by the assumed model is below a pre-defined threshold of significance. Usually the test

is formulated as a comparison between the null hypothesis H0 and some alternative

hypothesis H1. Usually H0 represents some baseline or “uninteresting” data while H1 is the case where the data is unusual or different than normal. The threshold of sig- nificance is usually denoted α; typical values of α are 0.05 or 0.01. Some test statistic T is chosen and the value t is calculated from the output under consideration. The distribution of t values expected under the null hypothesis is also calculated (called the null distribution), and the significance thresholds set such that the probability of exceeding the threshold (or thresholds in a two-tailed test) under the null distribution is equal to α. Then, if the observed t exceeds the significance threshold we can reject the null hypothesis as being likely given the observed instance and accept that some alternative hypothesis must be true. Likelihood ratio tests, t-tests, and z-tests are all examples of hypothesis tests.

2.6 Hypothesis Tests on Dynamic Networks

Given the standard hypothesis testing framework, applying a hypothesis test to a dynamic network using network statistics is straightforward. By converting each graph instance into a simple value using the network statistic, one can create a null distribution by extracting statistics from a set of graph with normal behavior (typi- cally all previously observed graphs from a dynamic network are treated as normal). Then performing the test is as simple as calculating the statistic on the test graph in- stance and determining if the value lies within the critical points given by the null [2]. 17

Figure 2.1 shows the extraction of the null distribution and test value from a dynamic network. Figure 2.2 shows the hypothesis testing decision.

 

Figure 2.1.: Flowchart of the graph generation process and statistic generation.

   

     

Figure 2.2.: Creation of null distribution and test point from graph statistics. 18

An alternative approach is to collect a distribution of distance metrics on pairs of networks as shown in Figure 2.3. For example, if the test is designed to find time steps in a series that appear to have an unusual amount of change, the null distribution could be constructed from the distance metrics calculated at each time step versus the network in the prior time step. If a new time step appears which is more different than the prior than expected under the null, we would flag that time as a point of change.

 

Figure 2.3.: Flowchart of delta statistic generation.

Often there is not a suitable set of network examples to construct a null distribu- tion. Usually the null hypothesis requires that some network property is not present in the null distribution graph examples, and if the examples are taken from real- world networks it is very difficult to guarantee that this fact is true. In this case it is necessary to construct a null population artificially. One way to do this is through permutation in a process known as randomization testing [30]. Randomization testing creates new examples of a null distribution from the observed sample by randomly permuting the example in a way that creates an appropriate null distribution. In the case of a network test, the permutation operation could be random rewiring of 19 the network links. In the first component of my work we will investigate some of the difficulties in performing proper randomization tests on networks. Another way to generate network examples is by sampling subgraphs of the initial graph, though obviously those samples are smaller than the original network. Bootstrapping or jack- knifing samples from a network example is a way to produce sample networks which are the same size as the original network [31]. However these sampling methods as- sume that the network can be reconstructed from sampling dyads, which may not capture some of the network-wide properties of the original. Most network sampling methods, however, tend to produce subgraphs which do no not represent the orig- inal graph well in terms of network statistics [32] and are not suitable as the null distribution of a hypothesis test. Another approach to performing hypothesis tests is model-based hypothesis test- ing. Instead of relying on a distribution of statistic values, model-based hypothesis testing uses the parameters of learned graph models or the likelihood of graphs given a model in order to reject unlikely graph instances (see Chapter 3 for more detail). 20

3 RELATED WORK

The scope of papers related to this dissertation and their relationship with each other are categorized in tables 3.1, 3.2, 3.3, and 3.4. The four dimensions of the papers categorized in the tables are: (1) the network property being tested or the goal of the hypothesis test in the paper; (2) whether the graphs being tested are dynamic or static; (3) the type of test statistic used; and (4) the method used to construct or estimate the null distribution. All of the tables have network property as a dimension. Tables 3.1 and 3.2 show the type of test statistic for static and dynamic graphs respectively, while 3.3 and 3.4 show the method of null distribution generation for static and dynamic graphs. The network property and dynamic/static nature of the data are inherent to the problem definition of the papers, while the test statistic and null distribution con- struction are algorithmic choices and could theoretically be replaced with another approach. As the test statistic and null distribution construction are arranged verti- cally in the tables papers in the same columns are tackling the same kinds of problems and can be directly compared to one another. Model Parameters and Likelihood both refer to model-based hypothesis testing while Distance Measure, Basic Statistic,andNetwork Statistic are all statistic-based approaches. Distance Measures refers exclusively to delta statistics evaluating graph difference. Basic Statistics consists of simple statistics like edge count, node count, or the number of nodes with a particular attribute, while Network Statistic has more complicated statistics like autocorrelation or clustering coefficient. Some of the cited papers in the tables are surveys which cover a wide variety of methods. Look to these for general information about Social Network Analysis [2,33], Anomaly Detection [23], or Homophily and Social Influence [34]. 21

The two major components of this dissertation, described in Chapters 5 and 6, are denoted in Tables 3.2 and 3.3 as citations [35] and [36] respectively. In the next two sections we will summarize where these papers belong in the spectrum of current methods and elaborate on which papers are most relevant to each.

Table 3.1.: Categorization of related work in static networks, by type of statistic and network property being tested.

Network Property Local Property Global Property Community Prop. Statistic Type Distinguish Graphs Model Parameters [37] [2, 38, 39] [2] Likelihood [37, 40, 41] Distance Measure [22, 23, 42] [2, 23] Basic Statistic [2] [2, 31, 43] Network Statistic [22, 23, 42] [2] [2, 23, 44–46] [2, 23, 45–48]

Table 3.2.: Categorization of related work in dynamic networks, by type of statistic and network property being tested.

Network Property Local Property Evolution Prop. Global Property Community Prop. Statistic Type Distinguish Graphs Model Parameters [49] [50–55] Likelihood [56] Distance Measure [19, 20, 23, 27, 36] [57] Basic Statistic [16] [33, 53] [31, 53] [53] Network Statistic [17, 19, 23, 36, 58–60] [9, 23, 35, 50, 53, 58] [35] [53, 61] 22

Table 3.3.: Categorization of related work in static networks, by H0 generation method and network property being tested.

Network Property Local Property Global Property Distinguish Graphs Community Property H0 Estimation Method Multiple Observations [37] [58] [2, 47, 48] Generative Model [40, 41] [43] Sampling Method [2] [2, 31] Permutation [22, 42] [2, 38, 39] [2, 44] [2]

Table 3.4.: Categorization of related work in dynamic networks, by H0 generation method and network property being tested.

Network Property Local Property Evolution Prop. Global Property Community Prop. Distinguish Graphs H0 Estimation Method Multiple Observations [19, 20, 23, 36, 58, 59] [49, 57] [51, 52, 54, 55] [61] [16, 17, 27, 60] Generative Model [33] Sampling Method [9, 31, 56] Permutation [35, 50, 62] [35] 23

3.1 First Component (Chapter Five): Testing a Global/Community Properties using a Network Statistic and Permutation

My first component Chapter 5 [35] deals with determining if graphs in a dynamic network exhibit the Homophily or Social Influence properties amongst nodes belong- ing to particular groups. In this Chapter we introduce a new method which uses a permutation test and an autocorrelation statistic to investigate and differentiate these two properties both globally and in the network communities. As such it is cat- egorized in the Permutation H0 row, the Network Statistic row, and both the Global Property and Community Property columns of tables 3.2 and 3.3. The most relevant papers to this work are [62] and [9]. [62] also attempts to identify the presence of social influence using a permutation method; however, their permutation method randomizes the timestamps of different actions of the network rather than the actions themselves and thus requires detailed information about the times when events occur. It also assumes that the network has a constant edge structure. [9] is also dedicated to distinguishing homophily and social influence effects; however it uses a matched sample estimation which is a Sampling Method rather than a Permutation approach to generating the null distribution. This approach requires careful selection of similar data instances which is difficult when explicit node attributes are not available. Also highly relevant is [50] which investigates homophily via autocorrelation as well as some other graph properties using a permutation method called Multiple Regression Quadratic Assignment Procedure (MRQAP). This is the same type of null generation and network statistic but the permutation algorithm used to generate the null distribution instances lacks some of the constraints used in Chapter 5; the importance of these constraints is elaborated in the chapter. The paper also uses the parameters of an Exponential Random Graph Model (ERGM) as an alternative approach and so falls under the Model Parameter category as well. 24

Other related papers include those which use ERGM models to represent ho- mophily or social influence [43, 55], papers which define their own models of net- work behavior [34, 63, 64] and permutation testing applied to static graph scenar- ios [30,65–67].

3.2 Second Component (Chapter 6): Distinguishing the Properties of Different Graphs using Multiple Observations of Consistent Network Statistics

The second major component of this dissertation is found in Chapter 6 and de- noted [36] in the tables and deals with the detection of anomalies in a dynamic network setting. This work introduces new consistent network statistics which aid in determining which graph instances have anomalous network properties when the size of those graphs is not constant. It utilizes both statistics measuring single graphs and distance measures comparing pairs of graphs and therefore falls in both Network Statistic and Distance Measure rows, and it constructs the null distribution using Multiple Observations. As it seeks to differentiate graphs based on their network properties (and whether those properties are anomalous), it falls in the Distinguish Graphs category. The two most relevant papers to this work are the ones describing the Netsimile [19] and Deltacon [20]. Both of these statistics are distance measures designed to find graphs which are different from more normal observed graphs. Netsimile accomplishes this by creating vectors for each graph out of aggregates of network statistics and then taking the distance between those summary vectors, while Deltacon calculates a matrix for each graph representing distances between nodes and then compares those matrices. The details of each method as well as the difficulties they have when dealing with confounded effects will be described in more detail in Chapters 4 and 6. These statistics are used as a basis for comparison as they are both state-of-the-art detectors for dynamic network anomalies. In addition Deltacon explicitly searches for anomalies on a dataset which this work also investigates. 25

Other papers which search for dynamic network anomalies using more traditional statistics include the following: [16,17,25,27,48,59,60]. Model-based anomaly detectio is also common, the most used model being ERGM: [37, 42, 51, 52, 54, 56, 58, 68, 69], although some use other types of models [41]. Another paper which delves into the statistical consistency properties of network statistics is Borgs et. al. [8] which investigates the behavior of motif-style statistics in the limit as the number of nodes sampled from some graph generative process increases . Similar to our approach they prove a finite limit that depends only on the properties of the graph generative process; however, they only take they limit with respect to number of nodes, making the assumption that once a pair nodes are both in the observed set the true edge weight between them is known and not estimated from network communications.

3.3 Time Series Analysis

As applying a network statistic to a dynamic network produces standard time series data, methods which perform anomaly detection or change point detection on time series are also relevant even if the papers in question do not explicitly apply their methods to networks and are not listed in the tables. This Section is dedicated to work relevant to time series methods. Time series analysis is a well studied field, although traditional techniques were de- veloped for independent data rather than networks. Common tasks include anomaly detection and change point detection, which involve flagging certain time steps as unusual or signify a change in the stream. A number of algorithms exist for anomaly detection in time series data, including Cumulative Sum (CUSUM)-based and like- lihood ratio approaches [70–76]. The basic algorithms will flag any time steps were deviation from an baseline model is detected. CUSUM methods cumulatively sum deviation from some statistic and flags at a certain threshold, while likelihood ratio flags when the likelihood under a certain model is too low. 26

For some data the concept of “normal” is not constant and will slowly change over time. To deal with a changing baseline, the model either needs to be re-learned before each detection decision, or a data processing step such as detrended fluctuation analysis (DFA) or detrended moving average (DMA) can be applied to the data as a whole before the detection step [77].

3.4 Network Models

As mentioned before, network models can be used as the basis for a hypothesis test or anomaly detector. In addition, generative graph models also provide useful test cases by producing output graphs with properties specified by the model parameters. In this Section we will briefly describe the network models most relevant to this work. Erd¨os-R´enyi Random Graph: The Erd¨os-R´enyi (ER) random graph is one of the earliest and simplest graph models to be developed [78]. It defines the probability of any link in the network existing as a simple uniform probability p. p can be easily computed by simply calculating the density in the training network. A network can then be generated by taking Bernoulli trials on all pairs of nodes, or to save computationally, by placing edges into the graph uniformly at random. While ER graphs are easy to learn, produce, and analyze, they generally are poor fits to real- world models due to the assumption of uniform degree on all nodes. It also has no way to incorporate concepts like transitivity. It is still useful for certain test cases, though, as the properties are well known. Chung-Lu, Transitive Chung-Lu: The Chung-Lu (CL) model is a generative graph model designed primarily to preserve the degree distribution of the graph used as the basis [79]. It is usually implemented as an edge insertion procedure where the probability of an edge between two vertices is proportional to the product of their expected degrees, with the expected degree being the same as the complementary vertex in the original graph. Unlike ERGM’s parameterization utility, the Chung-Lu model uses only the degrees of the original graph as a parameter and so its utility 27

to hypothesis testing is in the simulation of null distributions. Another variant of the Chung-Lu model that has been developed is the Transitive Chung-Lu [80] which preserves both the degree distribution and the triangle counts of the original graph. We utilize this model later as a source for synthetic data as it allows for easy control of the degree distribution and transitivity of the produced graphs. Exponential Random Graphs: As mentioned earlier in this chapter, Exponen- tial Random Graph Models (ERGMs) have been widely used as a basis for hypothesis testing due to the flexibility of their definition [69]. ERGM defines the likelihood of a particular graph instance as being the exponential of a weighted sum over multiple graph features. The likelihood for a given network configuration can be expressed  − ( ) as e i∈S θisi G where S is some vector of network features calculated from the net- work and θ is a vector of feature weights. The exact features used in the model can vary, but typical ones are edge count, triangle count, and reciprocated edges. It is very straightforward to define likelihood ratio tests using the ERGM model on two networks [37]. For dynamic networks, dynamic ERGM is an alternative to a distance-based or network test statistic [51, 56, 58]. A dynamic ERGM model has the same form as a standard ERGM but includes network features that are temporal in nature: for example, an edge retention feature could be the number of edges which appear in both the current time step and the previous time step. The learning and testing processes are identical to the standard ERGM. Dynamic ERGMs suffer from the same limitations as standard ERGMs, and in fact the computational costs are exacerbated by the need to re-learn the parameters for each time step being tested. Alternatively the features of multiple time steps can be aggregated and a single set of parameters learned to represent the whole stream; however, this assumes that the stream can be represented with a static model, which as discussed in later sections, is rarely the case in real dynamic networks. Unfortunately, ERGM has suffered from issues known as degeneracy issues in the past, which is the tendency to place more probability mass on unusual graph 28 states such as the empty graph or the complete graph [81]. To avoid this, certain “alternating” statistics have been defined which do not place increased preference on a greater number of edges [69]. In addition the model learns inconsistent parameters on subsets of networks or networks of different size [82]. Learning the model from data also takes an impractical amount of time even on moderately large networks. For these reasons the ERGM model is impractical to apply to the problems discussed in this paper despite its ubiquitousness. 29

4 NETWORK STATISTIC DEPENDENCIES

Although many network statistics have been created for the purpose of analyzing networks, these statistics often suffer from confounding effects. Rather than being sensitive only to a single network property, they are affected by changes in many network properties, illustrated in Figure 1.3. The main issue is that many statistics have been conflated with the property they are intended to measure: for instance, the clustering coefficient value is often reported as the transitivity of the network when the network lacks a tendency for closing triangular communications but has a particular number and density of communities. These multiple possible causes for observed statistics makes drawing specific conclusions about the network properties very difficult. Before discussing the behavior of the statistics, we will need to more formally describe the network properties and how they produce the observed graphs. Let us assume that there are k hidden network properties that control the graph generation process. Let each network property be parameterized with a single θ value from the set of θ1,θ2, ...θk. These values can be thought of as the inputs to the graph generation process that constructs the observed graphs and can be represented by the full joint model P (G|θ1,θ2, ...θk)Although the exact graph probability function and how the network properties affect it is unknown, we would like to infer about the properties of the network using their parameters via hypothesis tests. Assume that we are testing a network property parameterized by θi. Our null hypothesis is

H0 : θi = θ0 with the alternative hypothesis being H1 : θi = θ0. Let the test statistic for the hypothesis be denoted S and θi be referred to as the hypothesized parameter. Although technically S can be any network statistic, a good choice of statistic would

be one which satisfies statements S|(θi = θ0)=S0 and S|(θi = θ0) = S0 where S0

is the expected value of the test statistic under θ0. Violations of the first statement imply false positive errors as S does not produce consistent values even when the null 30

hypothesis is true while violations of the second imply false negatives. Essentially, the optimal statistic has a one-to-one mapping between parameter space and statistic space with no variance which allows the critical regions to be trivially defined as

simply ¬S0. Of course, in any realistic scenario the test statistic will have variance due to the randomness of the graph generation procedure which will produce some amount of error. The utility of the test statistic comes from its dependence on the hypothesized parameter. But what happens when the statistic is dependent on other parameters,

i.e. P (S|θi) = P (S|θi,θj)? There can be situations where variation in the θj parame- ter cause a change in the test statistic even when the hypothesized parameter is held constant: S θi = θ0,θj = c1 = S θi = θ0,θj = c2 .Iftheθj that generated the test case is not the same as the parameter that generated the null distribution, this discrepancy could push S into the critical region and cause the null to be rejected even when the null hypothesis is true. This alternative property is often referred to as a confounding effect, as the mutual dependency of S to each of these property pa- rameters prevents us from disambiguating which property is truly different between the test and null cases (false positives). In addition, we could fail to find instances where the property is different than the null case due to the noise introduced by other properties (false negatives). One solution is to ensure that any property that could be a confounding effect is held constant in both the null distribution and the test case. This is generally impractical as edge and node counts could be confounding properties, which means that only hypothesis tests using identically sized networks would be valid - a harsh restriction. A less constraining solution is to use statistics which are independent to all properties other than the one they are intended to mea- sure, but defining such statistics is extremely difficult in practice. In fact, virtually every commonly used network statistic harbors these kinds of mutual dependencies: primarily a dependence with the size (node and edge count) of the network. The rest of this Chapter will be dedicated to demonstrating the dependencies of common net- work statistics on two fundamental network properties: θe, the parameter controlling 31

the number of edges in the network; and θv, the parameter controlling the number of vertices. Note that the network density is the result of these two parameters, so statistics that are dependent on the network density are dependent on the union of these two parameters. For simplicity of the following formulas assume that θe and θv are the exact number of edges and nodes the network models will produce; in reality there will be some variance but the formulas will still hold.

4.1 Graph Edit Distance

The graph edit distance (defined in Section 6.4) is the number of edits that one network needs to undergo to transform into another network. It is an attempt to measure the ”rate of change” property of a dynamic network; the faster the rate of change, the more edits needed at each new time step. However, the graph edit distance is also highly sensitive to the total volume of communications in the network as well as the number of nodes. The edit distance is an absolute sum of differences between the two networks, but the number of possible differences between the two networks is directly related to the edge and node counts.

Let θe1,θv1 be the edge and node counts in time t − 1andθe2,θv2 be the counts for time t. The the minimum possible edit distance between the two networks occurs when the smaller graph is a subgraph of the larger, and has an edit distance of abs(θe2 − θe1)+abs(θv2 − θv1). The maximum possible distance is attained when the graphs overlap minimally, giving an edit distance of abs(θv2 − θv1)+min(θe2 + 2 − 2 − θe1,max(θv2 θe1,θv1 θe2)) So, the of possible edit distances is defined by the edge and node counts in the two time steps. The expected edit distance can also be dependent on the

network size in many cases. Consider the scenario where each posterior network G2

is generated by randomly permuting each edge in the previous network G1 to a new unoccupied node pair with probability p. This gives us p · θe1 expected permutations,

which corresponds to an expected edit distance of 2·p·θe1. Note that the edit distance 32

is still dependent on the total edge count in the two graphs even though the edge counts do not change across the time steps.

4.2 Clustering Coefficient

The clustering coefficient, defined in Section 6.4, is intended to measure the tran- sitivity or tendency to form links between mutual friends and is another common network statistic. Like the edit distance, the clustering coefficient is also dependent on the number of edges in the network. In the most trivial cases, a network with fewer than three edges will always have a clustering coefficient of zero while a completely connected graph will always have its wedges closed leading to a clustering coefficient of one. More generally, increased network density can lead to higher clustering coefficients through random chance. To demonstrate this, consider an Erd¨os-R´enyi graph parameterized

by P (p, θv)wherep is the edge probability and θv the number of nodes to generate. p can be thought of as a parameter for the density of the graph and can be measured |E| with the network density statistic ( |V |2 ). However, this density parameter can also affect the value of statistics intended to measure transitivity in the graph. N 3 The expected number of triangles in an ER graph is 3 p as each triple of vertices needs to be fully connected to form a triangle, while the expected number of wedges N 2 − 3 N 2 is 3 (3p (1 p)+3p )= 3 3p as for every triple of vertices there are three ways to form a single wedge, or one way to form three wedges in the form of a N 3 N 2 triangle. This gives us an expected clustering coefficient of 3 3 p /3 3 p = p.So in a network model that has no network property that directly introduces transitive behavior to the network, the clustering coefficient is directly affected by the density parameter of the network. If we performed a hypothesis test looking for changes in the transitivity property of the network on graphs generated by this model, we may falsely conclude that the transitivity is changing when in reality only the density parameter is changing. The local clustering coefficient likewise suffers from this same 33 dependency as for every pair of neighbors for a node (a wedge) the probability of closing a triangle is p.

4.3 Degree Distribution

Degree Distribution naturally has a direct relationship with the edge count of a network due to the fact that large degree nodes can only exist in networks with a greater total edge count. This places a hard upper limit on the degree of any node: D(v) ≤ θe Even for node degrees less than this hard limit a more dense network has greater odds of having a node with a very high degree than a less dense network. Suppose we have an initial network with maximum degree d. If we densify this network by adding e edges, there is some probability that the max degree node will be an endpoint of one of the new edges and the maximum degree will increase. Although this may not change the relative shape of the degree distribution, measures which compare degree distributions using the exact counts of particular degrees will be affected.

4.4 Centrality Measures and Shortest Paths

The shortest path between two nodes is the fewest number of edge traversals to connect two nodes. The set of all shortest paths in a network is usually summarized in a hop plot. Centrality measures are statistics designed to represent how central or important certain nodes are to the network and include betweenness centrality, closeness centrality, etc. In general the centrality of the nodes in a network and the shortest paths of the network are a result of properties controlling the topography of the graph and prop- erties which determine the importance of different nodes of the network. However, a greater density will also lead to shorter path lengths and higher centrality measures on average and the path lengths will converge to one as a network becomes fully con- nected. Consider a network where the network density is increasing over time through 34

single edge insertions. Every time an edge is inserted, the path length of one pair of nodes drops to one. In addition, all other paths either remain the same length or decrease if the connected nodes were on one of the shortest paths. As every additional

edge can only reduce the lengths of the shortest paths, as θe increases the network will have shorter path lengths even if other network parameters are held constant.

4.5 Autocorrelation

As mentioned in Chapter 2, there are two major properties conjectured to affect how self-similarity appears in networks: propensity for forming links between nodes that are similar known as homophily, and propensity for nodes to adopt the attributes of their neighbors known as social influence. Both of these properties are measured with the same set of autocorrelation statistics which means that they are confounded. In addition the density of the network can affect autocorrelation statistics. In Chapter 2 one of the autocorrelation statistics described is Pearson’s chi-squared score, defined as: 2 (E − O ) χ2 = rc rc (4.1) Erc eij ∈E

where r = eij and c = I[xik = xjk]. In other words, the autocorrelation is maximized when pairs of nodes with an edge have the same attribute and pairs of nodes that have different attributes do not share an edge. Consider a simple graph where the chi-squared score is maximized: nodes with the same attribute form a fully-connected clique while no edges lie between nodes of a different attribute. Now consider if we construct a new graph by adding new edges to this existing graph and increasing the density. As all similar pairs of nodes are already connected the edges must be placed between dissimilar nodes and the chi-squared score will decrease. Therefore, the autocorrelation is not simply a function of the self-similarity property of the network but also the density of the network. 35

4.6 Netsimile Distance

The network distance measure used in the Netsimile algorithm [19] uses standard network statistics like node degree and local clustering coefficient as components when calculating the distance score between two networks. As these components are dependent on the number of edges and nodes in the network, Netsimile will produce a greater distance score if the difference in the number of edges and nodes in the two networks is large. Netsimile also suffers issues due to its use of the Canberra distance when calculating the total distance score. The Canberra distance is defined

|s1−s2| 1 2 1 2 as CD(s ,s )= s1+s2 where in the case of Netsimile s and s are aggregates of local statistics calculated from the two networks. Suppose that the Canberra distance is

being used to detect change points in a dynamic network. s1 is calculated from time t1, s2 is calculated from t2, and the Canberra distance between the two is compared to previous adjacent time step pairs to see if the difference is greater than expected. Here

|s1−s2| is the signal strength of the anomaly; however, as the magnitude of s1 increases a greater signal strength is required to reject the null due to the normalization. If

s1 is a statistic dependent on network size that produces values proportional to the size, then sections of the stream where the network is larger require more dramatic anomalies to produce enough signal strength to create a detection, therefore increased network size results in reduced test sensitivity.

4.7 Deltacon Distance

The Deltacon algorithm [20] generates affinity scores between all pairs of nodes in a network slice, places these values in an affinity matrix, then produces a distance score between two network slices which is the difference in their affinity matrices. The affinity score used is calculated through a matrix inversion and can be thought of as an estimate of the random-walk with restarts distance between any two nodes. Random- walk with restart will generally produce greater distances between nodes in less well- connected networks as the random walk is likely to become stuck in local structures. 36

In contrast a completely connected network will have uniform walk distances between all nodes as direct paths are available between all node pairs. Therefore if the two networks that Deltacon is comparing differ in their densities the Deltacon distance score is likely to be large even if the parameters that generated the networks are similar.

4.8 Summary

As you can see, all of the statistics we have mentioned so far suffer from con- founding factors and are not reliable measures of the intended network properties. In particular properties which control the edge count, node count, and density are confounding factors for virtually every network statistic. In the next Chapter we will provide a new method for avoiding confounding factors when testing for the existence of homophily and social influence properties, and in Chapter 6 we will elaborate on how to avoid edge count, node count, and density as confounding factors more generally. 37

5 RANDOMIZATION TESTING SOLUTION

In this Chapter we will describe the first major component of my work which avoids confounding effects for hypothesis tests through the use of permutation testing. With carefully designed permutation procedures, null distributions can be obtained where possible confounding factors are controlled preventing them from affecting the output of the test.

5.1 Introduction

As mentioned before, Autocorrelation between linked nodes in a graph is a com- mon characteristic of relational and social network datasets due to the nature of social interactions. For example, friends are more likely to share political views than randomly selected pairs of individuals so a positive autocorrelation is expected. The presence of autocorrelation offers a unique opportunity to improve predictive models because inferences about one object can be used to improve inferences about related objects. Indeed, recent work in relational learning has exploited this property in the development of collective inference models, which can make more accurate predic- tions by jointly inferring class label values throughout a network (see e.g., [83–85]). In addition, the gains that collective model achieve over conditional models (which reason about each instance independently) increase as autocorrelation levels increase in the data [1]. A number of widely occurring network properties give rise to observed autocor- relation. Social phenomena, including social influence [86], diffusion processes [87], and the principle of homophily [88], can cause autocorrelated observations through their influence on social interactions that govern the data generation process. Alter- natively, a hidden condition or event, whose influence is correlated among instances 38

that are closely located in time or space, can produce autocorrelated observations through joint influence on link and attribute changes [89,90]. A key question for understanding and exploiting behavior in social network do- mains is to determine which network properties are the root cause of observed au- tocorrelation. Since autocorrelation is the primary motivation to use relational and network models over conventional machine learning techniques, it stands to reason that a better understanding of the causes of autocorrelation will inform the develop- ment of improved models and learning algorithms. For example, although previous work in relational learning and statistical network analysis has focused primarily on static graphs, recent efforts have turned to the analysis of dynamic networks and development of temporally-evolving models (e.g., [91, 92]). In order to deal with the enormous increase in dimensionality associated with modeling both temporal and relational dependencies, these methods restrict the set of dependencies that they con- sider (e.g., through choice of model form). The ability to accurately distinguish which temporal-relational patterns (e.g., homophily) occur in real-world datasets will ensure that researchers can include the most promising set of dependencies in their restricted set of patterns. Research in social psychology and sociology has developed two main theories of the social network properties that are responsible for the autocorrelation observed in social systems. Social influence refers to processes in which interactions with others causes individuals to conform (e.g., people change their attitudes to be more similar to their friends). Homophily refers to processes of social selection, where individuals are more likely to form ties with “similar” individuals (e.g., people choose to be friends with people who share their beliefs). Both homophily and social influence can produce autocorrelation, since their outcomes result in linked individual sharing attribute values. Therefore these two effects are generally confounded: a hypothesis test using an autocorrelation statistic as the test statistic cannot determine which property contributed to the network or which is more prominent if both are present. 39

In this chapter, we focus on the task of differentiating between influence and ho- mophily effects and determining, from the observed autocorrelation dependencies, whether the effects are significant. Recently, there have been a number of empiri- cal studies that investigate (and model) either social influence or homophily effects in real-world datasets (e.g., [34, 63, 64]. However, these efforts have focused primar- ily on demonstrating the presence of homophily and influence—they do not provide the means to estimate effects sizes from data or determine whether the effects are statistically significant. Exceptions include the work of Snijders et al. [55], Anagnos- topoulos et al. [62], and Aral et al. [9]. Snijders et al. [55] develop a time-evolving exponential random graph model that can represent homophily and influence effects. Their method support hypothesis tests for each effect, but the applicability of the approach is limited by the suitability of the model form (e.g., random graph model, Markov assumption). On the other hand, the recent work of Anagnostopoulos et al. [62] presents a model-free approach to assessing influence effects with randomiza- tion tests. The limitation of their framework, however, is an assumption that that the network structure (i.e., links) does not change over time, thus they cannot distinguish homophily effects. Aral et al. [9] correct this issue with a development of matched sample estimation framework that accounts for homophily effects, but the method uses additional node behaviors and characteristics in the matching process, so it will have limited applicability in data with few observed attributes and/or time steps. First we will outline a general randomization framework for datasets where both attribute values and links change over time, where changes can consist of either additions or deletions. The aim is to determine the significance of each property and to distinguish the contribution of influence and homophily effects. We define a randomization test based on permutation of action choices and consider the gain in correlation over one time step in the graph, assessing the amount of gain that is due to each of the effects. The randomization procedure then produces an empirical of expected gains under the null hypothesis (that there is no influence and/or homophily effect) and if the observed gain is greater than expected 40

Pattern of CommunicationC Unknown Process Volume of CCommunication

Anomaly Graph Detector Statistic

Figure 5.1.: Confounding effect example with the communication volume property controlled.

under the null, we can conclude there is a significant influence/homophily effect. In terms of confounding effects, this randomization approach is a way of controlling network properties so that they cannot be confounding factors, shown in figure Figure 5.1. If influence or homophily are held constant in both the null distribution examples and the test point then any conclusion will not be affected by the controlled network properties. We evaluate this method on semi-synthetic social network data, showing that the test has low Type I error (i.e., it does not incorrectly conclude there is an effect when in fact the data are random) and high power when the data exhibit sufficient change over time (i.e., they correctly conclude there is an effect when there is one). We then apply the method to a real-world dataset to investigate the aspects of observed autocorrelation. Our analysis of a public university Facebook network shows that autocorrelation in group memberships is due to significant influence and homophily effects, yet different groups exhibit different types of behavior. 41

Graph(t) Graph(t+1) Graph(t+2)

Influence Influence

Attributes(t) Homophily Attributes(t+1) Homophily Attributes(t+1)

Figure 5.2.: Illustration of homophily and influence affect on attributes and links over time.

5.2 Problem Definition

In this Chapter we will consider relational data represented as an undirected, attributed graph G =(V,E), with V (nodes) representing objects and E (edges) representing relationships. The nodes V represent objects in the data (e.g., people) and the edges E represent relationships (e.g., friendships) between pairs of objects

(eij : vi and vj are friends). Each node v ∈ V and has a number of associated v v v attributes X =(X1 , ..., Xm) (e.g., age, gender). We assume that both the attributes and links may vary over time. First, attribute { v} { v v } values may change at each time step t: Xt = Xt = (X1t, ..., Xmt) . Second, relationships may change at each time step. This results in a different data graph

Gt =(V,Et) for each time step t, where the nodes remain constant but the edge set  may vary (i.e., Et = Et for some t, t ). Figure 5.2 illustrates influence and homophily dependencies. If there is a signifi- cant influence effect then we expect the attribute values in t + 1 will depend on the link structure in t. On the other hand, if there is a significant homophily effect then we expect the link structure in t + 1 will depend on the attributes in t. If either influence or homophily properties are present in the data, the data will exhibit a positive autocorrelation value at any given time step t. Any autocorrelation statistic, such as Pearson’s correlation coefficient or information gain, can be used to assess the association between these pairs of values of X. In this work, we use 42

Pearson’s chi-square statistic as described in the previous Chapter. This choice of statistic will be elaborated on in a later section. Autocorrelation

Let PR = {(vi,vj):eij ∈ E} be a set of related instance pairs in G.LetX be a binary attribute defined on the nodes V . Then we compute the relational autocorrelation of X in G with the contingency table shown in Table 5.1.

Table 5.1.: Contingency table for linked node pairs and their attribute values.

Xi = Xj = x ¬(Xi = Xj = x) (vi,vj) ∈ PR a b (vi,vj) ∈ PR c d

Define autocorrelation as the chi-square statistic that is computed from T (with dof=1): (ad − cb)2 · N C(X, G)=χ2 = (a + b)(c + d)(b + d)(a + c)

where N = a + b + c + d, is the total count of all cells in T . The first column of the contingency table counts pairs of nodes that both have the same value for attribute X. The second column counts pairs of nodes that do not match on X. The first row counts pairs of nodes that are related in G. The second row counts pairs of nodes that are not linked in G. Note that this contingency table encompasses every possible combination of nodes in the graph and thus has a stable size even when the total number of links change in the graph (i.e., it doesn’t depend on the size of E). To measure the autocorrelation between attributes and relationships at time t,

compute the chi-square statistic C(Xt,Gt) from the graph Gt using the attribute values in Xt. Using a similar table for the attributes and links in t + 1, compute

C(Xt+1,Gt+1). 43

Now that the correlation can be computed at any time step, the change in corre- lation can be used as a delta statistic to assess the effects of homophily and influence, where the observed delta correlation G from t to t +1is:

2 Δχ (t, t +1)=C(Xt+1,Gt+1) − C(Xt,Gt)

The delta in correlation from one time step to the next can be due to: (1) ho- mophily gains due changes in the graph structure in t + 1, or (2) influence gains due to changes in attributes in t+1. To show this, we will define influence and homophily and show how they impact the chi-square statistic from one time step to the next. Homophily Homophily is typically used to refer to the general tendency of people to associate with similar others (see e.g., [88])—we operationalize this in the following way to investigate whether attribute similarity influences choice of friends (i.e., are friendships formed based on a pair’s attribute similarity).

Let Xt and Xt+1 be the attribute values at time t and t + 1 respectively. Let ij PR(t) and PR(t+1) be the related pairs at time t and t + 1 respectively. Let Lt+1 refer to the case when pair (vi,vj) form a link at time t + 1 (i.e., (vi,vj) ∈/ PR(t) and ∈ ij (vi,vj) PR(t+1)). Let Ut+1 refer to the case when pair (vi,vj) drops a link at time

t + 1 (i.e., (vi,vj) ∈ PR(t) and (vi,vj) ∈/ PR(t+1)). Then a dataset exhibits homophily if the following hold:

ij | i j ij |¬ i j p(Lt+1 (Xt = Xt = x)) >p(Lt+1 (Xt = Xt = x)) ij |¬ i j ij | i j p(Ut+1 (Xt = Xt = x)) >p(Ut+1 (Xt = Xt = x))

In other words, the probability of link formation over time (from t to t + 1) is higher for pairs with matching attribute values and the probability of link dissolution over time is higher for non-matching pairs. Social Influence Social influence is typically used to refer to the general case of a person’s behavior 44

being influenced by others (see e.g., [93])—we operationalize this in the following way to investigate whether a person’s friends influence their intrinsic attributes (i.e., are attribute values changed to match one’s friends).

Let Xt and Xt+1 be the attribute values at time t and t + 1 respectively. Let PR(t) ij and PR(t+1) be the related pairs at time t and t + 1 respectively. Let At+1 refer to

the case when pair (vi,vj) change their attribute values to agree at time t + 1 (i.e., ¬ i j i j ij (Xt = Xt = x)and(Xt+1 = Xt+1 = x)). Let Dt+1 refer to the case when pair i j (vi,vj) change their attribute values to diverge at time t + 1 (i.e., (Xt = Xt = x)and ¬ i j (Xt+1 = Xt+1 = x)). Then a dataset exhibits social influence if the following hold:

ij | ∈ ij | ∈ p(At+1 (vi,vj) PR(t)) >p(At+1 (vi,vj) / PR(t)) ij | ∈ ij | ∈ p(Dt+1 (vi,vj) / PR(t)) >p(Dt+1 (vi,vj) PR(t))

In other words, the probability of agreement over time (from t to t + 1) is higher for related pairs and the probability of disagreement over time is higher for unrelated pairs. Delta Influence If social influence is present in the network then the measured chi-squared score will be increasing over time. This can be shown using the definition of social influence and calculating the change in the chi-squared score.

Theorem 5.2.1 Let Xt and Xt+1 be attribute values at time t and t +1 respectively | ij | and let PR(t) be the related pairs at time t.Letk = At+1 be the number of agreements | ij | from t to t +1.Letm = Dt+1 be the number of disagreements from t to t +1 and let k = m. Then if an influence effect is present in the data, the autocorrelation will increase when one considers the attribute changes from time t to time t +1:

C(Xt+1,Gt) >C(Xt,Gt) 45

Proof As defined earlier the chi-score for a two-by-two contingency table is:

(ad − cb)2 · N C(X ,G)=χ2 = t t t (a + b)(c + d)(b + d)(a + c)

ˆ ˆ where a, b, c, d, N are defined with respect to Xt and PR(t).Letkr and ku be the expected number of agreements among related and unrelated people respectively in

time t + 1. Similarly, letm ˆ r andm ˆ u be the expected number of disagreements among related and unrelated people respectively. The changes to the contingency table are shown in Table 5.2.

Table 5.2.: Changes to the contingency table in time t +1.

i j ¬ i j Xt = Xt = x (Xt = Xt = x) ˆ ˆ (vi,vj) ∈ PR(t) kr − mˆ r −kr +ˆmr ˆ ˆ (vi,vj) ∈ PR(t) ku − mˆ u −ku +ˆmu

Since there is influence in the data, the probability of agreement is higher for related pairs and the probability of disagreement is higher for unrelated pairs. Thus

kˆr > kˆu andm ˆ r < mˆ u. Subsequently, the change to the a, d diagonal is positive:

Δad =(kˆr − mˆ r)+(−kˆu +ˆmu)

=(kˆr − kˆu)+(ˆmu − mˆ r) > 0

And the change to the b, c diagonal is negative:

Δbc =(kˆu − mˆ u)+(−kˆr +ˆmr)

=(kˆu − kˆr)+(ˆmr − mˆ u) < 0 46

However, the net change to N and each marginal is 0 (since k = m, kˆr + kˆu = k,  andm ˆ r +ˆmu = m). Let [ad] =(a + kˆr − mˆ r)(d +ˆmu − kˆu). From above we know   that [ad] >ad. Similarly let [cb] =(c + kˆu − mˆ u)(b +ˆmr − kˆr). Again, from above we know that [cb]

([ad] − [cb])2 · N C(X +1,G)= t t (a + b)(c + d)(b + d)(a + c)

>C(Xt,Gt)

Delta Homophily Just as with the case of social influence, the presence of homophily in the network will increase the observed autocorrelation values over time, and given the definition of homophily it can be proven that the chi-squared score will increase.

Theorem 5.2.2 Let PR(t) and PR(t+1) be the set of related nodes at time t and t +1 | ij | respectively and let Xt be the attributes at time t.Letk = Lt+1 be the number of | ij | link additions from t to t +1.Letm = Ut+1 be the number of link dissolutions from t to t +1and let k = m. Then if a homophily effect is present in the data, the autocorrelation will increase when one consider the link changes from time t to time t +1:

C(Xt,Gt+1) >C(Xt,Gt)

The proof of this theorem follows the same form and argument as for Theorem 1. Now that we have illustrated the connection between gains in autocorrelation and the presence of influence/homophily, define the correlation decomposition problem as follows. Given two sequential graph observations Gt and Gt+1 and their attributes Xt

and Xt+1, determine (1) if the difference between C(Xt,Gt+1)andC(Xt,Gt)repre- sents a significant presence of homophily, and (2) if the difference between C(Xt+1,Gt) 47

and C(Xt,Gt) represents a significant presence of influence. In the next section, we will describe a new randomization testing approach to evaluating these two hypothe- ses.

5.3 Method

Randomization tests are a model-free, computationally-intensive statistical tech- nique for hypothesis testing [94]. The tests generate many replicates of an actual data set—typically called pseudosamples—and uses the pseudosamples to estimate a score distribution. Pseudosamples are generated by randomly reordering (or permuting) the values of one or more variables in an actual data set. A score is then calculated for each pseudosample, and the distribution of these randomized scores is used to es- timate a sampling distribution for the score statistic under the null hypothesis. The value of the observed score on the original data is then compared to the distribution of scores on the randomized pseudosamples, and if it is significantly higher (or lower) than this distribution, the observed score will be deemed significant. Figure 5.3 illus- trates the randomization test procedure. Null distribution examples are constructed by permuting the original graph, then a hypothesis test using the original as the test point is performed. If the permutation is done in a way that network properties other than the one being tested are held constant then those properties will not be confounding factors. In contrast to conventional hypothesis tests, randomization tests make a relatively small number of assumptions about the data. For example, randomization tests make no assumptions about the form of the distributions from which variable values are drawn. In addition, they can be used to form sampling distributions for estimators whose precise statistical properties are not known. The key issue in developing a randomization test is to formulate an appropriate null hypothesis and permute the data in a way that accurately reflects the null hypothesis. 48

       

        

        

Figure 5.3.: Randomization testing procedure. Semi-synthetic graphs are created by permuting the original to construct a null distribution.

Figures 5.4-5.5 outline the significance testing algorithm that we employ through- out this chapter. The significance tests determine whether an observed gain in au- tocorrelation is significant by using a randomization test to estimate an empirical sampling distribution of gains that would be expected if change in links (attributes) is random and thus not due to homophily (influence). The empirical sampling distribution is estimated from the gains observed in pseu- dosamples generated by the randomization procedure. The method compares the observed gain value to the empirical sampling distribution, if the value is higher than − α α (1 2 )% (or lower than 2 %) of the scores observed in the randomized data, the gain is deemed to be significant and the null is rejected. The gain in autocorrelation from one time step to the next can be due to: (1) homophily gains due to friend changes in t + 1, or (2) influence gains due to changes in attributes in t + 1. To separate the effects of influence and homophily, we de- fine two different randomization tests to use as the Randomize( ) method inside the significance test. 49

HomophilySigT est(Gt,Gt+1,Xt,Xt+1,numIters,α):

ΔR = ∅ // compute original gain ΔO = C(Xt,Gt+1) − C(Xt,Gt) // randomize for iter in 1.. numIters do  Gt+1 = Randomize(Gt,Gt+1 )  − Compute Δr = C(Xt,Gt+1) C(Xt,Gt) ΔR =ΔR ∪{Δr} end for // test significance of delta − α if ΔO > 1 2 critical value of ΔR then Return significant/positive α else if ΔO < 2 critical value of ΔR then Return significant/negative else Return not significant end if

Figure 5.4.: Homophily significance test method 50

InfluenceSigTest(Gt,Gt+1,Xt,Xt+1,numIters,α):

ΔR = ∅ // compute original gain ΔO = C(Xt,Gt+1) − C(Xt,Gt) // randomize for iter in 1.. numIters do  Xt+1 = Randomize(Xt,Xt+1 )  − Compute Δr = C(Xt,Gt+1) C(Xt,Gt) ΔR =ΔR ∪{Δr} end for // test significance of delta − α if ΔO > 1 2 critical value of ΔR then Return significant/positive α else if ΔO < 2 critical value of ΔR then Return significant/negative else Return not significant end if

Figure 5.5.: Influence significance test method 51

The key issue in developing a randomization test is to formulate an appropri- ate null hypothesis and permute the data in a way that accurately reflects the null hypothesis. We formulate three null hypotheses with respect to homophily and influ- ence:

H • H0 : link changes are random and are not due to attribute values in t (i.e., no homophily effect)

I • H0 : attribute changes are random and are not due to friends in t (i.e., no social influence effect)

F • H0 : both attribute and link changes are random (i.e., no homophily nor influ- ence effect)

To identify possible permutations for these null hypotheses, consider four types of data changes that can occur in the data from time t to time t +1:

+ { ∈ ∧ ∈ } Edge additions :ΔE = eij Et+1 eij Et

− { ∈ ∧ ∈ } Edge deletions :ΔE = eij Et eij Et+1

+ { v ∈ ∧ v ∈ } Attribute additions :ΔX = x Xt+1 x Xt

− { v ∈ ∧ v ∈ } Attribute deletions :ΔX = x Xt x Xt+1

Note that attribute value changes can be easily modeled as an addition/deletion pair. Clearly, homophily will impact edge changes and influence will affect attribute changes.

H For the null hypothesis concerning homophily (H0 ), the goal is to randomize the edge changes to remove any association with attribute values in t. To do this, one can randomize the choice of edge target so that it does not depend on the attributes of the source node. For example, if node i adds a link to node j at time t +1,then one can maintain the edge addition in t + 1 but randomize the choice of target node j to replace eij with eij so that any association of attribute similarity between i 52 and j is destroyed. However, to ensure that the degree of j remains the same after  randomization, j must have been part of an edge addition ekj in the original set. The randomization procedure for edges can be thought of as swapping the endpoints of edge additions/deletions such that each node will have the same number of additions and deletions in the randomized set, but the partner of those links will have changed.

I Randomization for the null hypothesis concerning influence (H0 ) will follow a similar procedure by swapping attribute adoptions and abandonments between nodes, removing any influence of edges in t.Ifnodei adds a attribute value x at time t +1, then one can maintain the attribute addition in t + 1 but randomize the choice of value to replace x with x so that any similarity of the attribute value x with the attribute values of i’s linked friends in t is destroyed. We call the procedures based on this form of randomization choice-based methods, since they randomize the results of choices (attribute/link changes). Figures 5.6-

H I 5.7 outline the specifics of the choice-based method for H0 (homophily) and H0 (influence) respectively. One can combine the two methods, randomizing both the

F attribute and link changes to estimate a distribution for H0 . However, calculating choice-based randomizations are non-trivial. A particular target edge or attribute can be selected for swapping only if it has not been selected before, and nodes cannot add edges or attributes if they had them in time step t, nor can they drop edges or attributes if they lack them in t. Given these sets of constraints, it may be difficult to find a valid random assignments (apart from the original), which is problematic for a test that depends on generating a distribution of random pseudosamples. We address this issue by taking a greedy assignment approach. First, collate the edge and attribute changes such that all additions and deletions for a node or attribute can be decided at once. Then, sort the nodes and attributes from those with the least number of random options to those with the largest number of random options. Random options here refers to the amount of freedom the node has when selecting additions or deletions and is given by the number of available selections 53

hom Randomizechoice(Gt,Gt+1): + { ∈ ∧ ∈ } ΔE = eij Et+1 eij Et (added links in t +1) − { ∈ ∧ ∈ } ΔE = eij Et eij Et+1 (dropped links in t +1) // targets for random selections + + T =ΔE − − T =ΔE // randomize ∈ + for eij ΔE do +   ∈ ∈ i Randomly select ekj T ,wherej Et   Replace eij in Gt+1 with eij + Remove ekj from T end for ∈ − for eij ΔE do −   ∈ ∈ i Randomly select ekj T ,wherej Et  Add eij to Gt+1   Remove eij from Gt+1 − Remove ekj from T end for  Return (Gt+1)

Figure 5.6.: Choice-based randomization method for assessing homophily 54

inf Randomizechoice(Xt,Xt+1)

+ { v ∈ ∧ v ∈ } ΔX = x Xt+1 x Xt (added attributes in t +1) − { v ∈ ∧ v ∈ } ΔX = x Xt x Xt+1 (dropped attributes in t +1) // targets for random selections + + T =ΔX − − T =ΔX // randomize v ∈ + for x ΔX do u ∈ + u ∈ u Randomly select x T ,wherex Xt v  v Replace x in Xt+1 with x Remove xu from T + end for v ∈ − for x ΔX do u ∈ − u ∈ u Randomly select x T ,wherex Xt v  Add x to Xt+1 u  Remove x from Xt+1 Remove xu from T − end for  Return (Xt+1)

Figure 5.7.: Choice-based randomization method for assessing influence 55 minus the number of assignments needed. This value can be calculated for edge  |  ∈ ∈ i|−| | additions by ekj T : j Et eij , and similar values can be calculated for the deletions, as well as attribute cases. The assumption in the greedy approach is that nodes and attributes with many random options, at the start, are unlikely to run out of available options even if their assignments are decided later in the algorithm. However, the greedy approach cannot guarantee that a node or attribute will have a valid random option at the point in the algorithm in which it is assigned. If this is the case, the particular node or attribute will retain its original assignment. This will not preserve the degree of nodes and attributes in the original data (since those original assignments may already have been given to others in the randomization). However, since the assignment for that node/attribute is identical to the original, it will prevent Type I errors from occurring. Ideally, after assigning additions and deletions to a node or attribute one could recompute the random options available to the remaining nodes/attributes and resort the list of edges/deletions. However, recomputing in this manner during the algorithm is computationally very expensive, and did not provide an increase in performance in practice, so we do not consider it further.

5.3.1 Calculating the Expected Chi-Squared Gain

In addition to determining the significance of each effect, the sampling distribu- tions for the randomization tests may also provide an estimate of the expected gain for each effect independently (where the full randomization is used to remove any joint effects from the estimation of gains due to homophily and influence):

I − F E[gainhom]=μgainsR μgainsR

H − F E[gaininf ]=μgainsR μgainsR

F E[gainoth]=μgainsR 56

Define the overall expected gain as a function of these independent components to compare the proportion of gain due to each effect:

E[gainobs]=E[gainhom]+E[gaininf ]+E[gainoth]

I − F =[μgainsR μgainsR ]+

H − F F [μgainsR μgainsR ]+μgainsR

I H − F = μgainsR + μgainsR μgainsR

5.4 Synthetic Data Experiments

This Section describes my investigation of the characteristics of the proposed randomization tests on semi-synthetic data. There are two properties of the statistical tests that need to be examined: (1) Type I error : the probability of rejecting a true null hypothesis (i.e., incorrectly concluding that there is a significant effect when there is not); (2) Power: the probability of rejecting a false null hypothesis (i.e., correctly concluding that there is a significant effect when there is one). If a statistical test has elevated levels of Type I error, that implies that many of the conclusions we draw from the test may be incorrect. In contrast, if a statistical test has low statistical power, that implies that legitimate performance differences may not be detected as significant. To create synthetic data, we start with a base of real-world social network data and use the distributions of observed changes to generate posterior data with different characteristics. On data with random changes we evaluate the Type I error of the test; on data with simulated homophily/influence effects we evaluate the statistical power of the tests. The results show that my proposed tests have low Type I error and power increases as the number of data changes increase. 57

5.4.1 Data

Using a small subset of the Facebook network comprised of 56,000 Purdue stu- dents and 3 million communications collected in March 2008 (see section 5.6 for more details) as the basis for time t, we generated semi-synthetic data sets for time t+1 de- signed to either maximize or minimize the presence of homophily or influence effects. The synthetic data in time t + 1 uses the distribution of changes in the original Face- book sample (i.e., number of adds/drops per person), however our data generation procedure chooses a new set of changes to ensure the data have certain characteristics (e.g., homophily).

More specifically, hold Gt and Xt constant but generate new data for Gt+1 and

Xt+1 to create three types of datasets:

Random: Changes are made randomly so there is no homophily or influence effect.

Homophily-rich: Attribute changes are made randomly, link changes are designed to maximize homophily.

Influence-rich: Link changes are made randomly, attribute changes are designed to maximize influence.

To generate data with homophily, link additions are chosen to maximize similarity among the incident nodes. When selecting a new link, the following weights are assigned to each possible link: 1 + γ(#overlaps). Here #overlaps is the number of attribute values that the nodes share. The link weights are then normalized over all possible new links (pairs of unlinked nodes) to produce a probability of selecting any given link. The probabilities are used to randomly select the appropriate number of link additions, while weighting the likelihood heavily toward similar pairs of nodes. Dropping links is done in a similar manner except that this probability is calculated across current links in t and the inverse is used to drop links among pairs of nodes that are most dissimilar. 58

To generate data with influence, we add attribute values in a similar way. Here we again compute a weight for each attribute value: 1 + γ(#overlaps), but #overlaps is defined to be the number of neighbors who have the attribute value under consider- ation. Likewise the inverse is used to determine which attribute values are dropped.

5.4.2 Methodology

Type I errors correspond to cases when the null hypothesis is incorrectly rejected— in other words, false positive assessments of significance, when there is in fact no significant homophily/influence effect. To estimate the Type I error rate for each of the tests, we generated data with random changes in time t + 1. Thus any observed gain in C is entirely random—so any assessment of significance will correspond to a Type I error. To evaluate the Type I error of the tests, we used the procedure outlined in Figure 5.8. Type II errors correspond to cases when the null hypothesis is incorrectly accepted— in other words, false negative assessments of significance, when there is in fact a significant homophily/influence effect. Power is the complement of the Type II error rate—the proportion of significant effects that are correctly identified (1−P (TypeII)). To estimate Type II error of the tests, we generated data for time t + 1 with ei- ther homophily or influence effects. Thus all observed gains in C should be deemed significant—any test that fails will correspond to a Type II error. To evaluate the statistical power of the tests, we used the procedure outlined in Figure 5.9.

5.5 Experimental Results

We evaluated the Type I error rates of each test using semi-synthetic data, where the attribute and link changes are made at random. The power of the homophily and influence tests were evaluated using the Homophily-rich and Influence-rich data, respectively. For these experiments, unless otherwise noted, we used N =20,α = 59

TypeIError(Gt,Xt, ΔG, ΔX ,N,α):

for i in 1..N do i i Generate Gt+1,Xt+1 with random changes numIncorrectSigT ests =0 for j in 1..N do i i SignificanceT est(Gt,Xt,Gt+1,Xt+1,N,α) if significant then numIncorrectSigTests++ end if end for 1 typeIError(i)= N numIncorrectSigT ests end for 1 avgTypeIError = N i typeIError(i)

Figure 5.8.: Method to measure Type I error rate.

Power(Gt,Xt, ΔG, ΔX ,N,α):

for i in 1..N do i i Generate Gt+1,Xt+1 with homophily/influence numIncorrectSigT ests =0 for j in 1..N do i i SignificanceT est(Gt,Xt,Gt+1,Xt+1,N,α) if not significant then numIncorrectSigTests++ end if end for 1 typeIIError(i)= N numIncorrectSigT ests end for 1 − avgP ower = N i(1 typeIIError(i))

Figure 5.9.: Method to measure statistical power.

0.05, γ = 100 where N is the number of synthetic graphs generated, and report average rates over 50 attribute values (i.e., group memberships). The first column of Table 5.3 reports the Type I rates of the choice-based random- ization test. Since we used α =0.05, we expect the error rates to be equal to 0.05. 60

Table 5.3.: Type I error and power for the choice-based randomization method.

P(TypeI) Power Power Rand. ΔX Rand. ΔX Infl. ΔX Rand. ΔG Hom. ΔG Rand. ΔG Homophily test 0.035 0.57 na Influence test 0.04 na 0.76

This is indeed the case for both the homophily and the influence tests, indicating that the tests are likely to be accurate in practice. Next we evaluated statistical power. In this case we created data with only the effect we were attempting to identify: (1) random attribute changes, homophily- based link changes (2) random link changes, influence-based attribute changes. The maximum powers achieved with these tests are shown in columns 2 and 3 of Table 5.3. Figure 5.10 shows the interaction between the parameters of the semi-synthetic data and statistical power achieved. The power of the initial test for identifying homophily was 0.57, meaning an effect was correctly identified 57% of the time on data generated to include homophily. The average power for detecting influence based on our initial semi-synthetic datasets was significantly lower at 0.20, the lowest point shown in Figure 5.10(a). We conjectured that this was due to the presence of fewer changes in attribute values in the original data, compared to changes in the link structure. In particular, many fewer attribute values were being added—in the Facebook sample, the average number of link adds per node was 4.08 while the average number of attribute adds was 0.26 for the same set. As we increase the number of additions for each node in the graph, the power increases (shown in Figure 5.10(a)), reaching a value of 0.65 at 10 additional attribute additions. This is partially due to an increase in effect size (i.e., increase in overall level of influence) but the quantity of attribute changes adds an additional effect, over and above any increase in correlation gain due to increased influence. 61

  





 

   Power Power Power



  0.3 0.4 0.5 0.6 0.7

 0.10.20.30.40.5  0.0 0.2 0.4 0.6 0.8 1.0 0246810 0 20 40 60 80 100 120 0 20406080100 Boost to Attribute Additions (perNode) Chi Gain Chi Gain (a) (b) (c) Figure 5.10.: Power analysis of influence and homophily randomization tests. (a) Influence test power, as number of group additions increases. (b) Influence test power, as effect size increases. (c) Homophily test power, as effect size increases.

The synthetic data generates homophily and influence by selecting link and at- tribute changes such that the correlation between node neighbors is maximized. If there are no such changes available, there will be less homophily and influence present in the synthetic data which degrades the power of the randomization tests. In ad- dition, the distribution after randomizing will be closer to the original data as fewer changes can be randomized, which also reduces the power of the tests. Figure 5.10(b)-5.10(c) shows the increase in statistical power as we systematically increase or decrease the effect size for either influence or homophily. Specifically, we varied γ to change the probability of selecting links and groups that produce auto- correlation in the synthetic data. The influence test had an additional 10 attribute additions as described previously. For the influence test we used γ values of 1, 10, 20, 50 and for the homophily test values of 1, 10, 50, 100. We then plotted the expected power of the tests against the chi gain of the groups. Greater chi gain places the group further from the null distribution, increasing the effect size. As expected, the power of each test decreases as the effect size decreases, dropping from the maximum values of 0.57 and 0.76 down to minimums of 0.035 and 0.04. 62

5.6 Real Data Experiments

5.6.1 Data

We evaluated our approach on data from the public Purdue Facebook network. Facebook is a popular online social network site with over 250 million members world- wide. Members create and maintain a personal profile page, which contains informa- tion about their views, interests, and friends, and can be listed as private or public. Friendship links are undirected and are formed through an invitation by one user along with a confirmation by the other. To be affiliated with a University network, users must have a valid email account within the appropriate domain (e.g., purdue.edu), thus the members consist of students, faculty, staff, and alumni. The network we con- sidered comprised more than 3 million public friendship links among 56,000 members. Users had an average and median degree of 46 and 81 respectively. The friendship links and group memberships were collected in both March 2008 and again in March 2009 which will be used for t and t +1. In addition to the friendship links, we considered a set of attributes corresponding to public group membership. Group membership information is posted in the users’ profile pages. Each “group” maintains a separate page reflecting some interest (e.g., friends of AAAI), and users who share that interest can become members of the group. For this work, we considered the set of 2648 (public) Facebook users belonging to the class of 2011 student network. For the first time step, we used the friendship links and group memberships from March 2008. For the second time step we used friendship links and group memberships from March 2009. The students in this sample belong to 494 groups, so we consider each group membership as a binary attribute that can change from one time step to the next. 63

Table 5.4.: Number of groups detected by randomization tests

Influence Significant ¬Significant Significant 7 111 Homophily ¬Significant 25 351

5.6.2 Experimental Results

To investigate homophily and influence in Facebook, we computed the observed correlation gain for each group membership attribute from t = 2008 to t + 1 = 2009. We then applied the choice-based randomization procedure to determine if the gains exhibited significant homophily and/or influence effects. Due to the low type I error of the choice-based test, the discovered significant patterns are likely to be correct. However, the low power for influence identification (given the observed number of attribute changes in the sample) means that the influence test may not be able detect all effects in the data. Of the 494 groups, notably 143 (29%) exhibited a significant correlation gain of some type. The assessments of significance are summarized in Table 5.4. Note that there are more groups that exhibit significant homophily effects (118) compared to significant influence effects (32). This is likely due the larger number of link changes, which results in higher power for the homophily test. To explore the types of groups exhibiting each type of effect, we examined a set of group names selected at random from each cell in Table 5.4. Table 5.5 list some examples from each category. Groups with significant homophily effects seem to include opportunities for mem- bers to meet in person. For example, Boiler Gold Rush is a freshman orientation program for Purdue where members are likely to meet, members of the Capture the Flag group presumably meet to play the game, and Levee Tan is a local tanning salon. 64

Table 5.5.: Example groups with each possible combination of effect.

Homophily and Influence (7 groups) Purdue Habitat for Humanity Tell 10 to Tell 10 Levee Tan I started doing homework but I ended up on Facebook Homophily Only (111 groups) Purdue Capture the Flag Honors Engineering Community 2007-2008 Boiler Gold Rush 2008 Purdue Opportunity Awards 2007-2008 Influence Only (25 groups) NOBAMA IN 08 I bet I can still find 1,000,000 people who dislike George Bush Hokay, so here’s the Earth i need numberss asap No Effect (351 groups) I support Welcome Home We miss Cody Lehe 4-H alumni Harrison Band 65

Groups with significant influence effects seem to have a political or activist aspect to them. This includes anti-Obama and anti-Bush groups as well groups like Habitat for Humanity and Tell 10 to Tell 10, which is a breast cancer awareness group. Another group of note is i need numberss [sic] asap. This group was created by a user who had lost his phone and wanted his friends to post their phone numbers to the group wall. The members of this group already had friendship links between them and were joining the group to share phone numbers with other friends which naturally produces a detectable influence effect.

5.7 Discussion

In this Chapter we have shown how randomization testing can use a delta chi- squared score two distinguish two network properties that are often confounded. By permuting only the edge selections or the attribute changes we can find the delta chi- squared score associated with both edge changes and attribute changes. As homophily is an edge selection property and social influence is an attribute change property, this lets us test each separately using the same statistic but different null distributions avoiding the confounding of these two properties. The networks in the two slices are not guaranteed to have the same number of edges or nodes; in fact it is rarely the case that the network size remains constant. The statistic used, chi-squared score, is dependent on the number of entries in the contingency table and the delta chi-square will be biased if the sum of the table changes. The total number of entries is N 2, so if the number of nodes changes there will be some bias in the delta chi-squared score. However, the accuracy of the test is not affected by this bias in the test statistic. As the delta chi-squared scores of the null distribution are generated from permutations that approximately preserve the edge and node count of the two network slices, the same bias will be present in the null distribution. While the null distribution will not be centered around zero, it is still appropriate to perform the hypothesis test using this null distribution and the 66 test accuracy will not suffer. Properties that could normally confound a hypothesis test are not an issue if those properties are identical in both the null distribution and the test case. However, permutation testing is not easy to apply in all cases. In this work we have assumed that the edge generation process is purely due to the homophilic pref- erences of the nodes and the attribute adoption process is purely due to influence. What if the hypotheses were to distinguish triangle preference from homophily? One would need a permutation that could re-select the edges preserving triangles or ho- mophily while destroying the other, which requires a more complex model of the edge probabilities than random reassignment. This becomes even more difficult if other properties like the random walk distances or the centrality of nodes are to be preserved: currently there is no clear way to perform a network permutation while keeping these sorts of path properties constant. Rather than performing a very com- plex and very constrained permutation, it would be simpler if statistics were available that had a one-to-one correspondence with the processes they are meant to measure. In the next Chapter we will show that utilizing Size-Consistent Statistics for net- work hypothesis tests can provide this one-to-one correspondence under the proper conditions. 67

6 SIZE-CONSISTENT STATISTICS FOR ANOMALY DETECTION IN DYNAMIC NETWORKS

Table 6.1.: Glossary of terms

Gt Observed graph at time t

Vt Set of vertices in graph Gt,sizeN

Wt Weighted adjacency matrix of Gt

|Wt| Total weight of Wt ∗ ∗ ∗ Pt True distribution of edge weights in the underlying model, size |V |xN ∗ ∗ ∗ At Adjacency matrix of Pt ; i.e. aij,t = I[pij,t > 0] ∗ ∗ V True vertex set of underlying model, Vt ⊂ V

Pt Renormalized distribution of edge weight on vertex set Vt, used to sample Gt

At Adjacency matrix of Pt; i.e. aij,t = I[pij,t > 0]

|At| Number of nonzero cells in adjacency matrix wij,t Pt Gt pij,t Approximate distribution of edge weights estimated from : = |Wt|1 At Adjacency matrix of Gt; i.e. aij,t = I[wij,t > 0] ∗ ∗ p t Mean value of any nonzero cell in Pt pt Mean value of any nonzero cell in Pt p P t Mean value of any nonzero cell in t ∗ ∗ p tVt Mean value of the Pt cells that belong to vertex subset Vt w i W rowi,2 Total weight in row of t ∗ ∗ p row,t Expected mass in any row of Pt prow,t Expected mass in any row of Pt prow,t Expected mass in any row of Pt ∗ ∗ p row,t|Vt Expected mass of rows in Pt , excluding any rows or cells that do not belong to Vt 68

6.1 Introduction

In this Chapter, we will focus on the task of anomaly detection in a dynamic network where the structure of the network is changing over time. For example, each time step could represent one day’s worth of activity on an e-mail network or communications of a computer network. The goal is then to identify any time steps where the pattern of those communications seems abnormal compared to those of other time steps. This differs from the task in the previous Chapter as instead of distinguishing the effects of two network properties, we want to compare new graph examples ver- sus older ones and see if the properties responsible for the new graphs differ from the “normal” ones that created the older examples. In addition, certain network properties - the ones controlling edge count, node count, and density - are changing constantly. As discussed in Chapter 4, these graph size properties are confounding factors for virtually all network statistics making it difficult to test for changes in other properties. Also the construction of the null distribution differs from the previous Chapter. Before the null was created through a permutation process that controlled the prop- erties of the null distribution graphs. Here past observations of the dynamic network make up the null graphs rather than semi-synthetic examples, and as such the prop- erties of those graphs cannot be strictly controlled like before. Instead, the approach will be to carefully define the test statistics such that they are robust to changes in the confounding factors - if the statistic values do not change as the confound- ing properties change, then no testing errors will occur. The flowchart in Figure 6.2 outlines this approach. A typical real-world network experiences many changes in the course of its natural behavior, changes which are not examples of anomalous events. The most common of these is variation in the volume of edges. In the case of an e-mail network where the edges represent messages, the network could be growing in size over time or there 69 could be random variance in the number of messages sent each day. The statistics used to measure the network properties are usually intended to capture some other effect of the network than simply the volume of edges: for example, the common clustering co- efficient is a measure of transitivity which is the propensity for triangular interactions in the network. However, statistics such as the clustering coefficient are Statistically Inconsistent as the size of the network changes - more or fewer edges/nodes change the output of the statistic even when the transitivity property is constant making graph size a confounding factor. Statistical consistency and inconsistency are de- scribed in more detail in Section 6.3. Even on an Erd¨os-R´enyi network, which does not explicitly capture transitive relationships through a network property, the clus- tering coefficient will be greater as the number of edges in the network increases as more triangles will be closed due to random chance. When statistics vary with the number of edges in the network, it is not valid to compare different network time steps using those statistics unless the number of edges is constant in each time step. Table 6.1 shows a glossary of terms that will be used throughout this Chapter.

Some, like the terms Gt, Vt,andWt, are from the dynamic graph definitions used previously. The other terms will be explained as they are used throughout the Chap- ter. Figure 6.1 shows the effect of statistical (in)consistency. During the experiment pairs of graphs were generated using a Chung-Lu generative model (described in Section 6.6) with a certain number of total edges. Subfigure (a) shows the values of a Size Consistent Statistic called Probability Mass Shift (described in Section 6.4) calculated on pairs of graphs, while Subfigure (b) shows the same for the Netsimile statistic (described in a previous Chapter). Each black point shows the average value of 100 generated graph pairs while the red points are the minimum and maximum of these pairs. As the edge weight increases (x-axis) the statically consistent Mass Shift (6.1a) maintains a consistent mean, whereas the statistically inconsistent Netsimile (6.1b) varies wildly, even though all graphs are generated from the same underlying model (Chung-Lu [79] with a power law degree distribution). 70 0.20 15 0.15 0.10 10 0.05 Score Score 0.00 5 -0.05 -0.10 -0.15

2000 4000 6000 8000 10000 2000 4000 6000 8000 10000

Edges Edges (a) (b)

Figure 6.1.: Statistic values network data generated from same model, but with increasing size. Behavior of (a) Consistent Statistic; (b) Inconsistent Statistic. Black pts: avg of 100 trials, red pts: [min, max].

In this Chapter, we will analytically characterize statistics by their sensitivity to network size, and offer principled alternatives that are consistent estimators of network behavior, which empirically give more accurate results when finding anoma- lies in dynamic networks with varying sizes. In terms of confounding effects this approach eliminates confounding by ensuring that the test statistics used do not vary when the confounding network properties change, bringing the statistics closer to the ideal one-to-one relationship with network properties. 71

Pattern of CommunicationC Unknown Process Volume of CommunicationC

Anomaly Graph Detector Statistic

Figure 6.2.: Controlling for confounding effects through careful definition of the test statistics.

The major contributions of this Chapter are:

• We define Size Inconsistent and Size Consistent properties of network statistics and show that Size Consistent statistics have fewer false positives and false negatives that Inconsistent statistics.

• We prove that several commonly used network statistics are Size Inconsistent and fail to capture the network behavior with varying network densities.

• We introduce provably Size Consistent statistics which measure changes to the network structure regardless of the total number of observed edges.

• We demonstrate that our proposed statistics converge quickly and have superior ROC performance compared to conventional statistics. 72

6.2 Problem Definition and Data Model

Let G = {V,W} be a weighted graph that represents a network, where V is the node set and W is the weighted adjacency matrix representing messages or some other interaction, with wij the number of messages between nodes i and j.Let|V | and |W | refer to the number of nodes, and total weight of the edges, in G respectively. A dynamic network is simply a set of graphs {G1,G2, ...GT } where each graph represents network activity within a consistent-width time step (e.g. one step per day).

Problem definition: Given a stream of graph examples {G1,G2, ..., Gt−1} drawn n from a normal model M , and a graph Gt drawn from an unknown model, determine n a if Gt was drawn from M or some alternative model M .

Given an observed graph, we wish to decide if this graph exhibits the same be- havior (network properties) as past graph examples or if is likely the product of some different, anomalous properties. We will be solving this problem with hypothesis tests utilizing network statistics as the test statistics. If Sk(G) is some network statistic designed to measure a network property k, then the set of statistics calculated on the

normal examples {Sk(G1),Sk(G2), ..., Sk(Gt−1)} forms the empirical null distribution, and the value Sk(Gt) is the test point. For this work we will use a two-tailed test with p-values of α =0.05. Anomalous test cases where the null hypothesis is rejected correspond to true positives; normal cases where the null hypothesis is rejected correspond to false positives. Likewise anomalous cases where the null is not rejected correspond to false negatives and normal cases where the null is not rejected correspond to true negatives. The anomaly detection procedure is summarized in Figure 6.3 from model down to null distribution and test point. If all the graph examples have the same number of edges and nodes then graph size cannot be a confounding factor regardless of the choice of test statistic - those properties are naturally controlled in the data. However, if M n and M a produce 73

graphs with a variable number of edges and nodes then any test statistic needs to be

robust to changes in the graph size. Ideally, if Gx,Gy ∼ M but |Vx| = |Vy|, |Wx| =

|Wy| we would still want Sk(Gx) ≈ Sk(Gy) to be true. To accommodate the observations of graphs of varying size, let us assume the mod- els that generated the observed graphs are hidden but take the form of a multinomial sampling procedure. Let P ∗ be a |V ∗|×|V ∗| matrix where the rows and columns represent a node set V ∗ and the sum of all cells is equal to 1. Here V ∗ represents a large set of possible nodes, i.e., larger than the set we may see in any one graph G. ∗ The entry pij,t specifies the probability that a randomly sampled message at time t is between i and j.Let|V | and |W | be drawn from distributions MV and MW .

  MN MA

     

  5.0 3.5   14.5 12.512.5

 

Figure 6.3.: Dynamic network anomaly detection task. Given past instances of graphs created by the typical model of behavior, identify any new graph instance created by an alternative anomalous model. 74

0 .05 0 .01 0 .05 0

0 .05 0 .01 0 .05 0 .07 .02 .07 .06 .07

.05 0 .07 .02 .07 0 .07 0 .14 .04 .03 .04

0 .07 0 .14 .04 .01 .02 .14 0 .07 .02 .05

.01 .02 .14 0 .07 0 .07 .04 .07 0 0 .09

0 .07 .04 0.07 0 .05 .06 .03 .07 0 0 0.1 0 .07 .04 .05 .09 0.1 0 P G P* |V | |W | Sampled Sampled Nodes Edges

Figure 6.4.: Graph generation process. The matrix P ∗ represents all possible nodes and their interaction probabilities. By sampling |V | nodes and |W | edges the observed graph G is obtained.

The full generative process for all graphs is then:

∗ • Draw |V |∼MV . Select V from V uniformly at random.

• Construct P by selecting the rows/columns from P ∗ that correspond to V and 1 ∗ ∗ normalize the probabilities to sum to 1 (i.e., pij = Z pij,whereZ = ij∈V pij).

• Draw |W |∼MW . Sample |W | messages using probabilities P .

• Construct the graph G =(V,W) from the sampled messages.

G is the output of a multinomial sampling procedure on P , with each independent message sample increasing the weight of one cell in W . P itself is a set of probabilities obtained by sampling V from V ∗. This graph generation process is summarized in figure 6.4. Given this generative process, the difference between normal and anomalous graphs is characterized by differences in their underlying models. Let the normal model be

∗n n n represented by P ,MV and MW and let the anomalous model be represented by ∗a a a P ,MV and MW . Finding instances where MV or MW are anomalous is trivial since 75

we can use the count of nodes or messages as the test statistic. Finding instances of graphs drawn from P ∗a is more difficult as our choice of network statistics affects whether we are sensitive to changes in |W | or |V |. If we redefine our network statistics to be functions over P ∗ instead of G we avoid ∗ the problem of graph size as a confounding factor as P is independent of MV or ∗ ∗ MW . However, since P is unobservable, there is no way to calculate Sk(P ) directly. Instead we can only calculate Sk(G)fromtheobservedgraphG. Sk(G) is an estimate

of Sk(P ) using the sampled messages W to estimate the underlying probabilities, and ∗ Sk(P ) itself is an estimate of the true Sk(P ) on a subset V of the total nodes. So just as the sampling procedure follows P ∗ → P → G, the estimation procedure follows ∗ the inverse steps Sk(G) → Sk(P ) → Sk(P ). Delta statistics like Graph Edit Distance can also be used for anomaly detection. In this case the empirical statistic will be Sk(G1,G2), where G1 and G2 are generated using the graph generation procedure described above, and the true value of the ∗ ∗ statistic is Sk(P1 ,P2 ). In order to be consistent Delta statistics should not change when either graph changes in size. ∗ Ideally, Sk(G)=Sk(P )=Sk(P ) and we would have the same output regardless of W and V , being sensitive only to changes in the model. However this is typically not attainable in practice as it is difficult to estimate the true statistic value from graphs that are extremely small - few edges and nodes provides less evidence of the underlying properties. In addition, an unbiased statistic with extremely high variance is also a poor test statistic. In many scenarios the best statistics are ones which ∗ converge to the value of Sk(P )as|V |, |W | increase, a property that we will refer to as Size Consistency. In the next section we will formally define the properties of Size Consistent and Size Inconsistent statistics and show how they affect the accuracy of hypothesis tests. 76

6.3 Properties of the Test Statistic

6.3.1 Size Consistency

∗ As described previously, a statistic Sk(P ) depends on the properties of the pro- cedure that generated the graph instance and is a measure of the graph properties independent of the exact number of edges and nodes in the graph. Although the empirical statistic Sk(G) may not be independent of the edge and node count, if it ∗ converges to Sk(P )as|V | and |W | increase it is a reasonable approximation as long as |V | and |W | are large enough. The bias of the empirical statistic due to graph size ∗ is abs(Sk(P ) − Sk(G)); if this bias converges to 0 as |V | and |W | increase then Sk(G) is Size Consistent.

Definition 6.3.1 A statistic Sk is Size Consistent w.r.t. Sk if:

lim Sk(G) − Sk(P )=0 |W |→∞ ∗ AND lim Sk(P ) − Sk(P )=0 |V |→|V ∗|

Delta statistics have the same requirements for consistency as standard statistic except that the edge and node count of both graphs must be increasing.

Definition 6.3.2 A delta statistic Sk is Size Consistent w.r.t. Sk if:

S G1,G2 S P1,P2 lim ∗ k( )= k( ) |V1|,|V2|→|V | ∗ ∗ AND lim Sk(P1,P2)=Sk(P1 ,P2 ) |W1|,|W2|→∞

Theorem 6.3.1 False Positive Rate for Size Consistent Statistics { } ∗ n n Let G1...Gx be a finite set of “normal” graphs drawn from P , MW and MV and let ∗ a a | | Gtest be a test graph drawn from P , MW and MV .Let Wmin be the minimum edge 77

count in both {Gx} and Gtest and |Vmin| be the minimum node count. For a hypothesis test using a Size Consistent test statistic Sk and a p-value of α,as|Wmin|→∞and ∗ |Vmin|→|V | the probability of identifying Gtest as a false positive approaches α. Proof If Sk(G) is a consistent estimator of Sk(P ), then as |Wmin|→∞, Sk(G) → ∗ ∗ Sk(P ) and if Sk(P ) is a consistent estimator of Sk(P )thenas|Vmin|→|V |, Sk(P ) → ∗ ∗n Sk(P ). If {G1...Gx} and Gtest are drawn from P ,thenSk(G1)...Sk(Gx)andSk(Gtest are converging to the same distribution of values and the hypothesis test will reject with the p-value probability of α.

As the number of edges and nodes drawn for the null distribution and test instance ∗ increase, the bias abs(Sk(P ) − Sk(G)) of the statistic calculated on those networks ∗ converges to zero. This means that Sk(G) effectively becomes equal to Sk(P ), and the outcome of the hypothesis test is only dependent on whether the test instance and null examples were both drawn from P ∗n or if the test instance was drawn from P ∗a. Even if the test case has an unusual number of edges or nodes, as long as the number of edges and nodes is not too small there will not be a false positive. Size consistency is also beneficial in the case of false negatives. A statistic which is sensitive to changes in the edge or node count will produce a null distribution

n n with high variance if MV or MW have high variance, which increases the chance of producing false negatives. A size consistent statistic will have less variance as |V | and | | n n W increase, so as long as the minimum outputs of MV or MW are not too small the variance will be negligible.

Theorem 6.3.2 False Negative Rate for Size Consistent Statistics

∗a n n Let Gtest be a network that is anomalous (i.e., drawn from P , MV , MW ) with respect { } ∗n n n | | to property k, and G1...Gx be graph examples drawn from P , MV , MW .LetWmin n | | n | |→∞ be the minimum of MW and Vmin be the minimum of MV .AsWmin and ∗ ∗n ∗a |Vmin|→|V | the probability of failing to reject Gtest approach 0 if P = P . Proof If Sk(G) is a consistent estimator of Sk(P ), then as |Wmin|→∞, Sk(G) → ∗ ∗ Sk(P ) and if Sk(P ) is a consistent estimator of Sk(P )thenas|Vmin|→|V |, Sk(P ) → 78

∗ ∗a ∗n ∗ Sk(P ). If Sk(P ) = Sk(P ), then as |Wmin|→∞and |Vmin|→|V | the statistic Sk(Gtest) and the set of statistics Sk(G1), Sk(G2), ...Sk(Gx) converge to different values and Gtest will be flagged as an anomaly with probability 1.

Now that we have investigated the effects of size consistency, we must look at the effects of its inverse.

6.3.2 Size Inconsistency

Size Inconsistency is the inverse of size consistency: if a statistic is not size con- sistent, then it is size inconsistent.

Definition 6.3.3 A statistic Sk is Size Inconsistent w.r.t. Sk if:

lim Sk(G) − Sk(P ) =0 |W |→∞ ∗ OR lim Sk(P ) − Sk(P ) =0 |V |→|V ∗|

Where Sk is a nontrivial function (a trivial function being one that is a constant, ∞, or −∞ for all input values). This definition also applies to the delta statistic case.

Theorem 6.3.3 False Positives for Size Divergent Statistics { } ∗ n n Let G1...Gx be a finite set of “normal” graphs drawn from P , MW , and MV and ∗ Gtest be a test graph drawn from P .IfSk(G) diverges with increasing |W | or |V | n n | | | | and MW , MV have finite bounds, there is some V or W for which a hypothesis test

using Sk(G) as the test statistic will incorrectly flag Gtest as an anomaly.

Proof When the set of graphs {G1...Gx} are used to estimate an empirical dis- tribution of the null Sk, the distribution is bounded by max[Sk({G1...Gx})] and min[Sk({G1...Gx})], so the critical points φlower and φupper of a hypothesis test using this set of graphs will be within these bounds. Since an increasing |Wtest| or |Vtest| 79

implies Sk(Gtest) diverges, then there exists a |Wtest| or |Vtest| such that Sk(Gtest)is

not within φlower and φupper and will be rejected by the test.

Size Inconsistency generally occurs when the value of a statistic is a linear function of the edge weight or the number of nodes in the graph: when the edge weight or the number of nodes goes to infinity, the output of the statistic also diverges. If a statistic is dependent on the size of a graph, then two graphs both drawn from P ∗n may produce entirely different values and a false positive will occur. A second problem occurs when the edge counts in the estimation set have high variance. If the statistic is dependent on the number of edges, noise in the edge counts translates to noise in the statistic values which lowers the statistical power (i.e. the percentage of true anomalies detected) of the test. With a sufficient amount of edge count noise, the signal is completely drowned out and the statistical power of the anomaly detector drops to zero.

Theorem 6.3.4 False Negatives for Size Divergent Statistics

∗a Let Gtest be a network that is anomalous (i.e., drawn from P ) with respect to prop- | | | | n n erty k.IfSk(G) diverges with increasing W or V there exists some MW or MV with sufficient variance such that a hypothesis test with p-value α and empirical null distribution Sk will fail to detect Sk(Gtest) as an anomaly with probability 1 − α.

∗a n n { } Proof Let Gtest be the test network drawn from P , MW ,andMV ,and G1...Gx ∗n n n be the set of null distribution graphs drawn from P , MW and MV .IfSk(G)is a divergent function of |W | or |V |, then the variance of of the null distribution Sk

estimated from {G1...Gx} is dependent on the variance of |W | and |V |.Ifthevariance of sampled |W | or |V | is sufficiently large, the variance of Sk will increase to cover

all possible Sk(G) values, and the test instance will fail to be flagged as an anomaly with probability 1 − α.

With a sufficient amount of edge count noise, the statistical power of the anomaly detector drops to zero. Regardless of whether a time step is an example of an anomaly 80

or not, if the variance of the test statistic is dominated by random noise in the edge count the time step will only be flagged due to random chance. These theorems show that with number of edges or nodes can lead to both false positives and false negatives from anomaly detectors that look for unusual network properties other than edge count. These theorems have been defined using a statistic calculated on a single network, but some statistics are delta measures which are measured on two networks. In these cases, the edge counts of either or both of the networks can cause edge dependency issues.

6.4 Network Statistics

In this section we introduce our set of proposed size consistent statistics, as well as analyze multiple existing statistics to determine if they are size consistent or incon- sistent. These properties are summarized in Table 6.2; Fast Convergence indicates fewer necessary edge/node observations to obtain a high level of accuracy.

6.4.1 Conventional Statistics

Graph Edit Distance The graph edit distance (GED) mentioned in Chapter 2 is often used in anomaly detection tasks. GED on a weighted graph is typically defined as:

GED(G1,G2)=|V1| + |V2|−2|V1 ∩ V2| + abs(wij,1 − wij,2) (6.1) ij∈V1∪V2

Claim 1 GED is a Size Inconsistent statistic.

∗ Consider the case where G1 and G2 are both drawn from P .Let|W2| = |W1| + WΔ

where WΔ is some constant value. The expected difference in weights between two nodes i, j in G1 versus G2 is: 81

E[wij,1 − wij,2]=|W1|pij − (|W1| + WΔ)pij = WΔpij

Then, the limit as |W1|, |W2| increase is

lim GED(G1,G2) |W1|,|W2|→∞ =lim|V | + |V |−2|V ∩ V | + WΔpij |W1|,|W2|→∞ ij∈V

= WΔ

As GED(G1,G2) is converging to a constant value, it is not converging to a nontrivial

Sk and the first condition of Size Consistency is violated. 

Degree Distribution and Degree Dist. Difference As defined before the De- gree Distribution of a graph is the distribution of node degrees. In this task we will find the difference between the degree distributions of two graphs using a delta statis-

tic. Define the delta statistic Degree Distribution Difference DD(G1,G2) between two graphs as:

DD(G1,G2)= ( I[Di(G1) ∈ bink] bink∈Bins i∈V1 2 − I[Di(G2) ∈ bink]) (6.2) i∈V2

where Bins is a consecutive sequence of equal size bins which encompass all degree values in both graphs. Note that this value is an approximation of the Cram´er von- Mises criterion between the two empirical degree distributions. Let the probabilistic ∗ ∗ ∗ ∗ degree of node i be Di(P )= j=i∈V ∗ pij. Then let the value of DD(P1 ,P2 )be: 82

Table 6.2.: Statistical properties of previous network statistics and our proposed alternatives.

Inconsist. Consist. Fast Convergence Mass Shift   Probabilistic Degree   Triangle Probability   Graph Edit Distance  Degree Distribution  Barrat Clustering  Netsimile  Deltacon 

∗ ∗ ∗ DD(P1 ,P2 )= ( I[Di(P1 ) ∈ bink] ∗ bink∈Bins i∈V ∗ 2 − I[Di(P2 ) ∈ bink]) (6.3) i∈V ∗

Claim 2 DD(G1,G2) is a Size Inconsistent statistic.

∗ Let G1,G2 be drawn from P using the same node set V and let |W2| = |W1| + WΔ

for some constant WΔ.As|W1|, |W2| increase the value of Di(G2)−Di(G1)converges to

|W | W p∗ −|W | p∗ W p∗ ( 1 + Δ) ij 1 ij = Δ ij (6.4) j=i∈V j=i∈V j=i∈V

So for sufficiently large WΔ, at least one node will be placed into a higher bin for G2 ∗ ∗ versus G1, and the limit lim|W1|,|W2|→∞ DD(G1,G2)isnotequaltoDD(P ,P ). This violates the first condition of Size Consistency and therefore the Degree Distribution Difference is Size Inconsistent.  83

Other measures create aggregates using the degrees of multiple nodes [19,60] but as the degree is size inconsistent these aggregates tend to be so as well.

Weighted Clustering Coefficient Clustering coefficient is a measure of the tran- sitivity, the propensity to form triangular relationships in a network. As the standard clustering coefficient defined in Chapter 2 is not designed for weighted graphs we will be analyzing a weighted clustering coefficient, specifically the Barrat weighted clus- tering coefficient (CB) [5]:

1 wij + wik CB(G)= aijaikajk (6.5) |V |(ki − 1)Di(G) 2 i j,k

where aij = I[wij > 0], Di(G)= j wij,andki = j aij. Other weighted clustering coefficients exist but they behave similarly to the Barrat coefficient.

Claim 3 CB is a Size Consistent statistic that converges to

p∗ p∗ ∗ 1 ( + ) ∗ ∗ ∗ CB(P )= ij ik a a a . (6.6) |V ∗|(a∗ − 1) p∗ 2 ij ik jk i i j ij j,k

First we will find CB(P ) by taking the limit as |W |→∞:

lim CB(G) |W |→∞ 1 wij + wik = lim aijaikajk | |→∞ W |V |(ai − 1)Di(G) 2 i j,k | | 1 W (pij + pik) = aijaikajk |V |(ai − 1)|W | pij 2 i j j,k 1 (pij + pik) = aijaikajk |V |(ai − 1) pij 2 i j j,k = CB(P ) 84

Now we take the limit lim|V |→|V ∗| CB(P ):

lim CB(P ) |V |→|V ∗| 1 (pij + pik) = lim aijaikajk | |→| ∗| V V |V |(ai − 1) pij 2 i j j,k 1 = lim 1 | |→| ∗|  ∗ V V |V |(ai − 1) ∗ p i ij∈V pij j ij ∗ ∗ 1 (pij + pik) ∗ aijaikajk ∈ p 2 j,k ij V ij ∗ ∗ 1 (p + p ) = ij ik a∗ a∗ a∗ |V ∗|(a∗ − 1) p∗ 2 ij ik jk i i j ij j,k 

Other weighted clustering coefficients such as those proposed by Onnela et. al. [95] and Holme et. al. [96] behave in a similar manner.

Deltacon The core element of the Deltacon statistic is the Affinity Matrix which is a measure of the closeness (in terms of random walk distance) between all nodes in a graph. Pairs of graphs with similar Affinity Matrices are scored as being more likely to be from the same distribution.

Claim 4 Deltacon is a Size Consistent statistic.

The Affinity Matrix S is approximated with Fast Belief Propagation and is estimated with S ≈ I +  ∗ A + 2 ∗ A2 where A is the adjacency matrix and  is the coefficient of attenuating neighbor influence. As |W |→∞and |V |→|V ∗| the adjacency matrix A approaches A∗ which is the adjacency matrix of P ∗, so the statistic does converge to the value given by the Affinity Matrix difference calculated on the true P ∗ graphs. However, this convergence will be slow in practice as a small difference between A and A∗ can cause large changes in the path lengths between nodes if the missing edges are critical bridges between graph regions. 85

Netsimile Netsimile is an aggregate statistic using multiple simple statistics to form descriptive vectors. These statistics include number of neighbors, clustering coefficient (unweighted), two-hop neighborhood size, clustering coefficient of neighbors, and total edges, outgoing edges, and neighbors of the egonet.

Claim 5 Netsimile is a Size Inconsistent statistic.

Statistics that use the raw edge count such as Di(G) will not be consistent as shown earlier, so aggregates that use these types of statistics will also be inconsistent. The

abs(Sk(G1)−Sk(G2)) statistic uses the Canberra distance ( Sk(G1)+Sk(G2) ) for each component statistic

Sk as a form of normalization, but as the component statistics diverge to infinity the Canberra distance converges to 0 and the normalization is still inconsistent.

6.4.2 Proposed Size Consistent Statistics

We will now define a set of Size Consistent statistics designed to measure network properties similar to the previously described dependent statistics, but without the sensitivity to total network edge count. They will also be designed such that the empirical estimations converge to the true values as quickly as possible.

0 2 5 2 0 0 .04 0 .04 0

2 0 3 1 3 Estimate P .04 0 .06 .02 .06

5 3 0 6 3 0 .06 0 .12 .04

2 1 6 0 3 .04 .02 .12 0 .06

0 3 3 3 0 0 .06 .06 .06 0 Wt Pt

Figure 6.5.: Estimation of the Pt matrix from the observed W weights.

wij,t These statistics use a matrix Pt where pij,t = |Wt| which is empirical estimate of

Pt obtained by normalizing the matrix as shown in figure 6.5. Obtaining this matrix 86

can be thought of as a reversal of the sampling process shown in Figure 6.4. Although Pt is only an estimate of Pt, it is an unbiased one, and given an increasingly large

|Wt| it will be eventually exactly equal to Pt. Therefore, empirical statistics which use Pt in place of Pt as their input will converge to the true statistic calculated on Pt

and the statistic converges w.r.t. |Wt|. ∗ However, this does not guarantee that Pt will converge in probability to Pt as

the number of nodes in Vt increases. In fact, the value of any cell pij,t is inversely | | ∗ proportionate to Vt :asbothPt and Pt are probability distributions which sum to 1, the more cells in either matrix the lower the probability mass in each cell on average. ∗ This concentrating effect as Vt is sampled from Vt is demonstrated in figure 6.6.

0 .05 0 .01 0 .05 0

0 .083 0 .016 0 .05 0 .07 .02 .07 .06 .07

.083 0 .116 .032 .116 0 .07 0 .14 .04 .03 .04

0 .116 0 .233 .066 .01 .02 .14 0 .07 .02 .05

.016 .032 .233 0 .116 0 .07 .04 .07 0 0 .09

0 .116 .066 .116 0 .05 .06 .03 .07 0 0 0.1 0 .07 .04 .05 .09 0.1 0 P P*

Figure 6.6.: Mass of the cells in P increase as |V | decreases.

The solution to avoiding this concentration of probability mass is to introduce ∗ 1 ∗ normalizing terms which negates the effect. These terms are p = | ∗| ∈ ∗ p for t At ij Vt ij,t ∗ 1 | ∗| | | Sk(Pt ) and the empirical term pt = |At| ij∈V pij,t for Sk(Pt), where At and At are ∗ ∗ the number of nonzero cells in Pt and Pt respectively. Replacing each pij,t and pij,t ∗ pij,t pij,t | | term in a statistic with ∗ and ensures that the statistic also converges as Vt pt pt increases. ∗ The utility of the p t and pt terms is to normalize the probability mass concen- tration effect when the size of |V | changes. As P is a proper 87

and sums to a total of 1, decreasing |V | also causes the cells in P to decrease and the probability mass in each cell to rise (illustrated in figure 6.6). Normalizing by the mean of each nonzero cell p allows the terms of the consistent statistics to converge as |V | increases and ensures that the bias remains small. Another way to consider ∗ ∗ this term is that p is a renormalization of p V where p V is the mean of the subset ∗ ∗ ∗ of P cells that belong to V .Asp V is the sample mean approximating p it is an ∗ 1 1 unbiased estimator of p and the inverse ∗ is a consistent estimator of ∗ due to p V p Slutsky’s theorem. The regions spanned by each of these terms are shown in figure 6.7.

∗ pV

0 .05 0 .01 0 .05 0

0 .083 0 .016 0 .05 0 .07 .02 .07 .06 .07

.083 0 .116 .032 .116 0 .07 0 .14 .04 .03 .04

0 .116 0 .233 .066 P* .01 .02 .14 0 .07 .02 .05 P

.016 .032 .233 0 .116 0 .07 .04 .07 0 0 .09

0 .116 .066 .116 0 .05 .06 .03 .07 0 0 0.1

0 .07 .04 .05 .09 0.1 0 p p∗

∗ ∗ Figure 6.7.: Visualization of the regions averaged to obtain p , p, and p V .

Probability Mass Shift We will now introduce a new consistent statistic called Probability Mass Shift which is a measure of change in the underlying P ∗ matrices which produced two graphs. Similar to graph edit distance when used on a dynamic network it is a measure of the rate of change the network is experiencing; unlike graph edit distance it is consistent with respect to the size of the input graphs. ∗ ∗ ∗ Let P1 and P2 be probability distributions over a node set V . Define the Probability ∗ ∗ Mass Shift between P1 and P2 to be: 88

p∗ p∗ ∗ ∗ 1 ij,1 ij,2 2 MS(P1 ,P2 )= ( − ) (6.7) Z ∗ ∗ ∗ V ∗ p p ij∈V 1 2

∗ 1 ∗ ∗ where the term p = ∗ ∈ ∗ p refers to the average value of nonzero cells in P x |A | ij V ij x |V ∗| and ZV ∗ = 2 . ∗ Let the Probability Mass Shift between node subsets V1,V2 ⊂ V be:

p p MS P ,P 1 ij,1 − ij,2 2 ( 1 2)=Z ( p p ) (6.8) V∩ 1 2 ij∈V∩

∗  pij,x 1 ∗ where V∩ is the intersection of V1 and V2, pij,x = ,andpx = | | ∈ ∩ pij. ij∈V∩ pij,x A ij V Now define the empirical Probability Mass Shift MS(G1,G2)tobe:

p p MS G ,G 1 ij,1 − ij,2 2 ( 1 2)=Z ( ) (6.9) V∩ p1 p2 ij∈V∩

1 ||  wij,x where px =  ∈ ∩ pij, Ax = ∈ ∩ I[wij,x > 0], and pij,x = . |Ax| ij V ij V ij∈V∩ wij,x

Theorem 6.4.1 MS(G1,G2) is a size consistent statistic which ∗ ∗ converges to MS(P1 ,P2 ).

MS P1,P2 lim ∗ ( ) |V1|,|V2|→|V | 1 2 |A1|p 1 −|A2|p 2 = lim ∗ ( ij, ij, ) |V1|,|V2|→|V | ZP ij∈V∩ |A | |A | 1 1 p∗ − 2 p∗ 2 = lim ∗ ( ∗ ij,1 ∗ ij,2) |V1|,|V2|→|V | ZP ∈ p 1 ∈ p 2 ij∈V∩ ij V∩ ij, ij V∩ ij, (6.10) 89

 [ ∗ 0]  |Ax| ij∈V∩ I pij,x> ≈ 1 As ∗ = ∗ ( ∗ ) , this is an approximation calculated over only ij∈V∩ pij,x ij∈V∩ pij,x px 1 a subset of the nodes V∩, denoted as ∗ . p1|V∩

MS P1,P2 lim ∗ ( ) |V1|,|V2|→|V | 1 1 ∗ 1 ∗ 2 = lim ( p 1 − p 2) ∗ Z ∗ ij, ∗ ij, |V1|,|V2|→|V | P p1|V∩ p2|V∩ ij∈V∩ 1 1 ∗ 2 1 ∗ 2 = lim (p 1) + (p 2) Z ∗ | | | |→| ∗| ∗ 2 ij, ∗ 2 ij, V ∗ V1 , V2 V (p |V∩) (p |V∩) ij∈V 1 2 − 1 1 p∗ p∗ 2 ∗ ∗ ij,1 ij,2 (6.11) (p1|V∩) (p2|V∩)

∗| ∗| 2 px V∩ and (px V∩) are the sample mean and square of the sample mean of the value ∗ of the P cells in V∩ respectively, and according to Slutsky’s Theorem their inverses 1 1 1 1 ∗ ∗ and ∗ 2 converge in probability to ∗ and ∗ 2 as |V∩|→|V |. px|V∩ (px|V∩) (px) (px)

1 1 ∗ 2 lim MS(P1,P2)= ∗ (p 1) | | | |→| ∗| ∗ 2 ij, V1 , V2 V ZV ∗ (p1) ij∈V 1 ∗ 2− 1 1 ∗ ∗ ∗ ∗ + ∗ 2 (pij,2) 2 ∗ ∗ pij,1pij,2 = MS(P1 ,P2 ) (6.12) (p2) (p1) (p2)

lim MS(G1,G2) |W1|,|W2|→∞ 1 2 = lim (|A1|pij,1 − |A2|pij,2) |W1|,|W2|→∞ ZP ∈ ij V∩ 2 1 wij,1 wij,2 = lim |A1| − |A2| | | | |→∞ W1 , W2 ZP |W1| |W2| ij∈V 2 1 |A1| 2 = lim w 1 | | | |→∞ 2 ij, W1 , W2 ZP |W1| ij∈V 2 || || || A2 2 − A1 A2 + 2 wij,2 2 wij,1wij,2 (6.13) |W2| |W1| |W2| 90

Let the minimum value of any nonzero cell in Px be a finite . The probability

|Wx| of any node pair not being sampled from Px is (1 − ) , which is converging to 0. Once every nonzero cell has been sampled, |Ax| = |Ax|, so this term is converging and can be replaced:

2 2 1 |A1| 2 |A2| 2 = lim w 1 + w 2 | | | |→∞ 2 ij, 2 ij, W1 , W2 ZP |W1| |W2| ij∈V |A1| |A2| − 2 wij,1wij,2 |W1| |W2| 1 | |2 2 | |2 2 − | || | = A1 pij,1 + A2 pij,2 2 A1 A2 pij,1pij,2 ZP ij∈V

= MS(P1,P2) (6.14)

We can improve upon the empirical version of the statistic by calculating the amount | | | | 2 of bias for W1 , W2 values and compensating. As the expectation of wij,x for any node pair i, j given |Wx| can be written as:

2 2 EWx [wij,x]=Var(wij,x)+EWx [wij,x] | | | | 2 = Var(Bin( Wx ,pij,x)) + EWx [Bin( Wx ,pij,x)] | | − | |2 2 = Wx pij,x(1 pij,x)+ Wx pij,x (6.15)

We can rewrite the expectation of the empirical mass shift:

| | 2 1 ( A1 ) | | − | |2 2 2 ( W1 pij,1(1 pij,1)+ W1 pij,1) ZP |W1| ij∈V | | 2 ( A2 ) | | − | |2 2 + 2 ( W2 pij,2(1 pij,2)+ W2 pij,2) |W2| 91

|A1| |A2| − 2 |W1|pij,1|W2|pij,2 |W1| |W2| 1 2 = (|A1|pij,1 −|A2|pij,2) ZP ij∈V 1 2 1 2 + |A1| pij,1(1 − pij,1)+ |A2| pij,2(1 − pij,2) |W1| |W2| 1 2 = MS(P1,P2)+ |A1| pij,1(1 − pij,1) |W1| 1 2 + |A2| pij,2(1 − pij,2) (6.16) |W2|

which is equal to MS(P1,P2) plus a bias term. Although we have shown the empirical mass shift to be size consistent, we can improve the rate of convergence by subtracting our estimate of the bias:

2 1 2 MS(G1,G2)=(|A1|pij,1 −|A2|pij,2) − |A1| pij,1(1 − pij,1) |W1| 1 2 − |A2| pij,2(1 − pij,2) (6.17) |W2|

Probabilistic Degree Distance The Probabilistic Degree Distance is a delta statistic that measures the difference between the degree distributions of two graphs in a size-consistent manner. It is defined as:

PDD P ∗,P∗ 1 I PD P ∗ ∈ bin ( 1 2 )= (|V ∗| [ i( 1 ) k] ∗ bink∈Bins i∈V − 1 I PD P ∗ ∈ bin 2 |V ∗| [ i( 2 ) k]) (6.18) i∈V ∗

where Bins is a consecutive sequence of equal size bins which encompass all PDi ∗ 1 ∗ ∗ 1 ∗ values, PDi(P )= p∗ j∈V pij is the Probabilistic Degree of a node i,andp = |V ∗| 92

is the average probability mass per node. We can rewrite the probabilistic degree as ∗ | ∗| ∗ PDi(P )= V j∈V ∗ pij. As the name suggests the probabilistic degree is a normalized version of node degree, and the distribution of probabilistic degrees replaces the standard degree distribution. Before we can begin our proofs about the consistency of the PDD, we must first analyze the behavior of this probabilistic degree distribution. The probabilistic degree of a node can be represented as the mean of the masses ∗ in the cells of that node in Pk :

∗ = ∈ ∗ p F ∗ x I x> j i Vk ij,k row,k( )= [ p∗ ] (6.19) ∗ ( row,k) i∈Vk

 ∗ ij∈V ∗ p ∗ k ij,k 1 Where p row,k = N ∗ = N ∗ . We can rewrite the CDF as

F ∗ x I x>N∗ ∗ p∗ row,k( )= [ ij,k] (6.20) ∗ ∗ i∈Vk j=i∈Vk

As before let us investigate the effect of node sampling by calculating the value of

the statistic using the normalized probabilities P1,P2 on node subsets V1,V2:

p F (x)= I[x> ij,k ] (6.21) row,k (p ) i∈Vk j=i∈Vk row,k

1 Where prow,k = N . ∗  pij,k Since pij,k = ∗ , we can rewrite Frow,k(x)as ij∈Vk pij,k

F x I x>N∗ 1 ∗ p∗ row,k( )= [ ∗ ij,k] ∈ p i∈Vk jl Vk jl,k j=i∈Vk 93

I x> 1 ∗ p∗ = [ ∗ ij,k] (6.22) p |Vk i∈Vk row,k j=i∈Vk

∗ | ∗ Where p row,k Vk is the mean probability mass per row in P excluding any cells/rows that don’t belong in the set Vk. If we take the expectation of the PDF for a particular i with respect to the node sample Vk:

E 1 ∗ p∗ Vk [ ∗ ij,k] p |Vk row,k j=i∈Vk 1 ∗ =EN E [ ∗ p ] Vk N ∗ ij,k p |Vk row,k j=i∈Vk 1 ∗ =EN E [ ] ∗ E [ p ] Vk N ∗ Vk N ij,k p |Vk row,k j=i∈Vk (6.23)

∗ If we apply Wald’s equation to E [ = ∈ p ] we obtain Vk N j i Vk ij,k

E p∗ E |A | ∗ p∗ [ ]= [ rowi,k ] (6.24) Vk N ij,k Vk N ij,k j=i∈Vk

If we assume that probability mass in row i is evenly distributed amongst the ∗ columns, then the fraction of row mass in Vk versus Vk is equal to the fraction of their sizes:

N N E |A | ∗ p∗ ∗|A∗ |∗p∗ ∗ p∗ [ rowi,k ] = = (6.25) Vk N ij,k N ∗ rowi,k ij,k N ∗ ij,k ∗ j=i∈Vk 94

 1 Now if we approximate E [ ∗ ] with a taylor expansion we obtain: Vk N jl∈Vk pjl,k

1 E [ ]= Vk N p∗ |V row,k k 1 1 ∗ Var p∗ |V ∗ + ∗ 3 ( row,k k) (6.26) E [p |Vk] (E [p |Vk]) Vk N row,k Vk N row,k

If we make the same assumption that row mass is roughly evenly distributed across the columns of the matrix,

∗ | N ∗ ∗ E [p row,k Vk]= ∗ p row,k Vk N N

. We can also rewrite the variance term as ∗ ∈ p 1 Var( ij Vk ij,k )= ∗ Var( p∗ ) N N 2 ij,k ij∈Vk

Putting this together we have

1 E [ ]= Vk N p∗ |V row,k k ∗ ∗3 N N 1 ∗ + ∗ ∗ Var( p ) (6.27) N ∗ p∗ N 3 ∗ p∗3 N 2 ij,k row,k row,k ij∈Vk

A typical degree distribution of a social network tends to be a power-law in type, which means that a handful of nodes have a large degree and most have a very small degree. Again we will assume that covariance between edge probabilities are limited to within row/column pairs. If we assume that the majority of nodes have a sub O(N 1/2) number of neighbors then the 1/N 2 term will be greater than the number of covariance terms and the bias from these nodes will converge to 0. Likewise, if the 95 handful of high degree nodes have a sub O(N)numberofneighborstheircovariance terms will also be less than 1/N 2 and the bias will also converge to 0.

∗ ∗ N N = ∈ ∗ p ∗ ∗ p∗ j i Vk ij,k N ∗ p∗ N ∗ ij,k = p∗ (6.28) row,k ∗ row,k j=i∈Vk

∗ Which is the true PDF calculated on Pk . This means that the CDF of the row masses converges to the correct distribution as N approaches N ∗. 

Degree Distribution Edge Bias Now we will consider the CDF of the empirical row mass calculated from an edge sampling Wk:

p = ∈  ij,k F (x)= I[x> j i Vk ] (6.29) row,k p i∈V row,k

1 Where prow,k =  and Vk is the set of nodes that have at least one edge in Wk. Nk

If we take the expectation of the PDF with respect to Wk we obtain

E N ∗ p Wk [ k ij,k] (6.30)  j=i∈Vk

As all rows in Pk have at least one cell with nonzero probability, as |Wk|↑, Vk → Vk as the probability of sampling at least one edge from every node approaches 1. So if we take the limit as |Wk| increases:

E N ∗ p lim Wk [ k ij,k] | |→∞ Wk  j=i∈Vk 96

E N ∗ p = Wk [ k ij,k] j=i∈Vk N ∗ E p = k Wk [ ij,k] j=i∈Vk =Nk ∗ pij,k (6.31) j=i∈Vk

So

E F x F x lim Wk [ row,k( )] = row,k( ) (6.32) |Wk|→∞

Now let us define the empirical probabilistic degree distance and analyze its behav- | | ior. The empirical probabilistic degree is PDi(G)= V j∈V pij and the empirical version of the delta statistic on G1,G2 is:

PDD G ,G 1 I PD G ∈ bin ( 1 2)= (|V | [ i( 1) k] bink∈Bins i∈V − 1 I PD G ∈ bin 2 |V | [ i( 2) k]) (6.33) i∈V ∗

 Theorem 6.4.2 PDD(G1,G2) is a size consistent statistic ∗ ∗ which converges to PDD(P1 ,P2 ).

First take the limit of the Probabilistic Degree for a node as |W | increases:

lim PDi(G) = lim |V | pij |W |→∞ |W |→∞ j∈V wij = lim |V | = |V | pij = PDi(P ) (6.34) |W |→∞ |W | j∈V j∈V 97

If we take the same limit over the Probabilistic Degree Difference we obtain:

 1 lim PDD(G1,G2)= ( I[PDi(P1) ∈ bink] |W |→∞ |V1| bink∈Bins i∈V1 1 2 − I[PDi(P2) ∈ bink]) = PDD(P1,P2) (6.35) |V2| i∈V2

∗ If we take the limit as |V |→|V | of PDi(P ):

|V | PD P |V | p p∗ lim ∗ i( ) = lim ∗ ij = lim ∗ ∗ ij (6.36) |V |→|V | |V |→|V | |V |→|V | ∈ p j∈V ij V ij j∈V

 |V | 1 ∗ ∗ can be rewritten as ¯∗ where p¯ |V is the average probability mass per node ij∈V pij p |V 1 in V . As this is an inverse mean, it will converge to the true value ¯∗ , and therefore p 1 ∗ ∗ ∗ ∗ | |→| ∗| ∗ ∗ | | ∗ lim V V PDi(P )= p¯ j∈V pij = V j∈V pij = PDi(P ). If we take the limit on the PDD we obtain a similar result:

1 ∗ lim PDD(P1,P2)= ( I[PDi(P1 ) ∈ bink] | |→| ∗| |V ∗| V V ∗ bink∈Bins i∈V − 1 I PD P ∈ bin 2 PDD P ∗,P∗ |V ∗| [ i( 2) k]) = ( 1 2 ) i∈V ∗

If we take the expectation over V :

1 E[PDD(P1,P2)] = E[ ( I[PDi(P1) ∈ bink] |V1| bink∈Bins i∈V1 1 2 − I[PDi(P2) ∈ bink]) ] |V2| i∈V2 1 1 2 = E[( Bin(|V1|,pk,1) − Bin(|V2|,pk,2)) ] |V1| |V2| bink∈Bins 98

1 2 1 2 = E[( Bin(|V1|,pk,1)) ]+E[( Bin(|V2|,pk,2)) ] |V1| |V2| bink∈Bins 1 1 − 2E[ Bin(|V1|,pk,1)]E[ Bin(|V2|,pk,2)] |V1| |V2| (6.37)

where pbink,x is the probability of any node selected from Vx belonging to bin k. Using − (pk,1)(1−pk,1) − the same approach as with Mass Shift, we obtain a bias correction of |V2| (pk,2)(1−pk,2) |V2| .

Triangle Probability As the name suggests, the triangle probability (TP) statistic is an approach to capturing the transitivity of the network and an alternative to traditional clustering coefficient measures. Define the triangle probability as:

∗ 1 1 3 ∗ ∗ ∗ TP(P )= ( ) p p p Z∗ (p∗) ij ik jk ijk∈V ∗ 1 |A∗|3p∗ p∗ p∗ = Z∗ ij ik jk (6.38) ijk∈V ∗

The empirical version on G is:

3 TP G 1 |A| p p p ( )=Z ij ik jk (6.39) ijk∈V

∗ |V ∗| |V | where Z = 3 and Z = 3 .

Theorem 6.4.3 TP(G) is a size consistent statistic which converges to TP(P ∗).

As before, if we take the limit with increasing |W |:

3 3 |A| 1 |A| p p p 1 w w w lim ij ik jk = lim 3 ij ik jk |W |→∞ Z |W |→∞ Z |W | ijk∈V ijk∈V 99

1 |A|3p p p TP P = Z ij ik jk = ( ) (6.40) ijk∈V

Now if we take the limit as |V |→|V ∗|:

1 3 lim |A| pijpikpjk |V |→|V ∗| Z ijk∈V |A|3 1 p∗ p∗ p∗ = lim ∗ ∗ 3 ij ik jk |V |→|V | Z ( ∈ p ) ijk∈V ij V ij 1 1 p∗ p∗ p∗ = lim ∗ 3 ij ik jk (6.41) |V |→|V ∗| Z (p¯ |V ) ijk∈V

1 1 Similar to the approach before, (p¯∗|V )3 converges to (p¯∗)3 by Slutsky’s Theorem, so the final limit is

1 1 ∗ ∗ ∗ ∗ = p p p = TP(P ) (6.42) Z∗ (p¯∗)3 ij ik jk ijk∈V ∗

As with the Mass Shift, let us take the expectation w.r.t. |W | and see if we can improve the rate of convergence with a bias correction:

E TP G E 1 |A| 3p p p [ ( )] = Z ( ) ij ik jk ijk∈V |A| 1 E 3w w w = Z (|W |) ij ik jk (6.43) ijk∈V

As before, let us assume that we have enough edge samples so that |A| = |A|:

|A| 1 E 3w w w Z (|W |) ij ik jk (6.44) ijk∈V 100

1 The quantity E[ |W |3 wijwikwjk] can be calculated with

E 1 w w w 1 E w w w |W |3 ij ik jk = |W |3 [ ij ik jk] 1 E w w E w − cov w w ,w =|W |3 [ ij ik] [ jk] ( ij ik jk) 1 E w E w E w =|W |3 [ ij] [ ik] [ jk]

− E[wjk]cov(wij,wik) − cov(wijwik,wjk) 1 |W |3p p p |W |2p p p − cov w w ,w =|W |3 ij ik jk + ij ik jk ( ij ik jk)

The covariance term can be expanded with the formula for products of random vari- ables [97]:

cov(wij · wik,wjk)=

E[wij]cov(wik,wjk)+E[wik]cov(wijwjk)

+ E[(wij − E[wij])(wik − E[wik])(wjk − E[wjk])] 2 = − 2|W | p p p ij ik jk

+ E wijwikwjk − E[wij]wikwjk

− E[wik]wijwjk − E[wjk]wikwij

+ E[w ]E[w ]w + E[w ]E[w ]w ij ik jk jk ik ij

+ E[wij]E[wjk]wik − E[wij]E[wjk]E[wik]

2 = − 2|W | pijpikpjk + E[wijwikwjk]

−|W |pijE[wikwjk] −|W |pikE[wijwjk]

−|W |pjkE[wikwij] 3 3 +3|W | pijpikpjk −|W | pijpikpjk 2 = − 2|W | pijpikpjk 3 + E[wijwikwjk] − 3|W | pijpikpjk 101

+ |W |pijcov(wik,wjk)

+ |W |pikcov(wij,wjk)+|W |pjkcov(wik,wij) 3 3 +3|W | pijpikpjk −|W | pijpikpjk 2 = − 2|W | pijpikpjk + E[wijwikwjk] 2 3 − 3|W | pijpikpjk −|W | pijpikpjk 2 3 = − 5|W | pijpikpjk + E[wijwikwjk] −|W | pijpikpjk

By plugging the covariance into the original equation we obtain:

1 1 E w w w = (|W |3p p p + |W |2p p p |W |3 ij ik jk |W |3 ij ik jk ij ik jk 2 3 +5|W | pijpikpjk − E[wijwikwjk]+|W | pijpikpjk) 1 1 2E[ w w w ]= (2|W |3p p p +6|W |2p p p ) |W |3 ij ik jk |W |3 ij ik jk ij ik jk 1 1 E[ w w w ]= (|W |3p p p +3|W |2p p p ) |W |3 ij ik jk |W |3 ij ik jk ij ik jk 3 =p p p + p p p ij ik jk |W | ij ik jk

3 So the bias term is |W | pijpikpjk. If we subtract the empirical version of this term to compensate, we obtain the corrected empirical Triangle Probability:

3 TP G 1 |A| p p p − 3 p p p ( )=Z ( ij ik jk |W | ij ik jk) ijk∈V

6.5 Anomaly Detection Process

In order to perform the anomaly detection on a dynamic network the collection of messages need to first be converted into a sequence of graph instances. As each 102 message consists of a pair of nodes and an associated timestamp, after picking a time step width Δ the graph at each sequential time step t is created by adding all messages falling between t and t + Δ to matrix Wt, producing a sequence of graphs. The algorithm is described in Figure 6.8. Then, a statistic value needs to be calculated at every graph instance in the stream. As the length of the stream is usually short compared to the size of the graphs, the computational complexity depends on the cost of calculating the network statistics on the largest graph instance. In order to calculate our consistent statistics Pt must be estimated, which is easily obtained by normalizing the observed messages W .Then the network statistic scores are calculated at each time step. This generates a set of standard time series which can be analyzed with traditional time series anomaly detection techniques. Selection of a proper Δ time step width is crucial. Due to the nature of size- consistent statistics larger values of Δ will reduce the error associated with statistical bias, but larger values also reduce the granularity of the detection algorithm making it harder to pinpoint the exact time that the anomaly occurred. Now that we have a stream of graphs we can perform the anomaly detection

process. For every time step t the graph at Gt becomes the test graph and the graphs

Gt−1,Gt−2...Gt−k become the null distribution examples (here we use k = 50). By applying Sk to each graph we obtain both the test point and the null distribution. Given a certain p-value α, we then set the critical points to be the values which reject the most extreme α/2 values from the null distribution on both sides. If the test point Sk(Gt) falls outside of these critical points we can reject the null hypothesis and raise an anomaly flag. This detection algorithm is described in 6.9.

6.5.1 Smoothing

Rather than calculating delta statistics using a weighted matrix Wt−1 which con- tains only the communications of the immediately prior time step, an aggregate of 103

GraphP rocess(messages, Δ) :

tstart =0,tend =Δ while tstart < last timestamp in messages do W =[],V = {} for mij,t in messages do if tstart

Figure 6.8.: Creation of the dynamic graph sequence from message stream using time step width Δ.

prior time steps Wt−k...Wt−1 can be used by simply calculating the average weighted matrix W from Wt−k...Wt−1 and then calculating Sk(Wt, W ) as the delta statistic. The advantage of this approach is that it measures the distance of the current behav- ior from the average behavior seen in a range of recent past timesteps, and as such is

less susceptible to flagging time t due to an outlier in Wt−1. Another smoothing option is to use a moving window approach with overlapping

time steps, i.e. calculate Wt,Wt+δ,Wt+2∗δ... where Wt+δ starts at time t + δ and ends at time t + δ + Δ. This effectively allows for a larger time step without sacrificing granularity, as it should be straightforward to find which δ-wide time span that an anomaly occurred in. 104

AnomalyDetection(G1,G2, ...Gt,Sk,α):

for i in 50...t do for j in 1...i − 1 do Add Sk(Gj)toNullDistr end for CriticalP oints = CalcCPs(NullDistr, α) if Sk(Gi) outside CriticalP oints then Generate Anomaly Flag at time i end if end for

Figure 6.9.: Anomaly detection procedure for a graph stream {G1...Gt}, graph statis- tic Sk and p-value α.

A prior edge weight value for the cells of Wt is another option. Instead of using   Wt, one can use Wt where wij,t = wij,t + c for some value of c. In general, c should be 2 small, usually less than 1, as this prior value adds c|Vt| total weight to the matrix 2 and c|Vt| << |Wt| in the ideal case. Larger values of c can easily wash out the actual network behavior leading all of the graph examples to seem uniform. So far the P matrices have been estimated with a frequentist approach using the observed message frequencies to estimate the probabilities. If one desires to assign a prior distribution to the P matrix, a Bayesian approach is easily implemented by choosing a Beta distribution for each cell in P and using them as conjugate priors for normalized binomial distributions using the observed message frequencies as the evidence. The reason we did not utilize this approach is because it is difficult to choose proper prior distributions: due to the sparsity of most networks the vast majority of cells in P are zero. Similar to the prior edge weight approach assigning a nonzero prior to all cells in P tends to dilute the network, but deciding which cells to assign a zero is nontrivial. Because 0 is the natural value for most pairs of nodes in the network trying to smooth by assigning a non-zero prior to all these node pairs is detrimental. 105

6.5.2 Complexity Analysis

Statistics like probability mass and probabilistic degree can be calculated at each step in O(|At|) time, making their overall complexity O(|At|t), where |At| is the num-

ber of nonzero elements in Wt. Triangle probability, on the other hand, is more expensive as some triangle counting algorithm must be applied. The fastest counting algorithms typically run in O(|V |k) time where 2

maximum number of neighbors of any node is bounded by nmax, we can approximate | | 2 the triangle count with O( V nmaxt). Note that any other statistic-based approach such as Netsimile that utilizes triangle count or clustering coefficient must make the same approximations in order to run in linear time.

6.6 Experiments

Now that we have established the properties of size-consistent and -inconsistent statistics we will show the tangible effects of these statistical properties using syn- thetic, semi-synthetic, and real-world datasets. The objective for the synthetic and semi-synthetic experiments is to maximize the true positive detection rate (where a true positive is flagging a graph generated with anomalous parameters) and minimize the false positive rate (where a false positive is flagging a graph with unusual edge count or node count but generated with normal parameters). The real-world exper- iments will be an exploratory analysis, demonstrating how to discover and explain events in a real-world dynamic graph. We will compare each of the consistent statistics to the conventional one they were intended to replace: graph edit distance for probability mass shift, degree distribu- tion difference to probabilistic degree difference, and Barrat weighted clustering to triangle probability. In addition we will also compare the performance of the consis- tent statistics to Netsimile and Deltacon. Netsimile is an aggregate statistic which attempts to measure graph differences in a variety of dimensions and as such can be 106 applied to find many types of anomalies. Deltacon on the other hand measures graph differences through the distances between nodes in the graphs and attempts to find anomalies of an entirely different type than the consistent statistics.

6.6.1 Synthetic Data Experiments

In order to create data with specific known properties we used generative graph models. There are four types of graphs generated:

1. normal graph examples which are used to create the null distribution for a hypothesis test.

2. edge false positive graphs which are generated using the same model parameters but with additional sampled edges.

3. node false positive graphs which are generated using the same model parameters but with additional sampled nodes.

4. true positive graphs which are generated with a normal number of edges and nodes but different model parameters.

The first three sets of graphs are created with the same generative model but with varying edges and nodes in the output graphs while the last set uses a different generative model. An illustration of the null distribution, false positive distribution with additional edges, and true positive distribution is shown in figure 6.10. First a set of normal graph examples are created using the process described in 1. which will form the null distribution graphs. A statistic is calculated for each graph example and given an α value the two critical points are found. Then a false positive graph set is created using either 2. or 3. and a true positive graph set created using 4. and statistics calculated for each. The percentage of false positive graphs outside the critical points becomes the false positive rate while the percentage of true positive graphs becomes the true positive rate. By varying the value of α and plotting the true 107 positive vs. false positive rate for each value we can create an ROC curve showing the tradeoff of true anomalous instances found versus falsely flagged instances. The circle on the ROC curves represents selecting a p-value of 0.05. The number of edges in the normal and true positive graphs ranges from 300k-400k while the edge false positive graphs have 400k-500k, and the number of nodes in the normal and true positive graphs is 25k while the node false positive graphs have 30k. An equal number of graphs of each type were generated. For a statistic that detects the model changes reasonably well we expect the false positive distribution to be very close to the null distribution, while the anomalous distribution is significantly different. Ideal performance on the ROC curve would be a horizontal line across the entire top of the plot: this would indicate perfect performance in detecting true positive graphs even at low p-value, and a false positive rate that is low until the p-value is increased. For comparison a diagonal line with a slope of 1 would indicate random performance where each false positive and true positive graph is flagged as anomalous using an unbiased coin flip. Any statistic which has a curve below this line has more sensitivity to the additional edges or nodes of the false positive graphs than to the model changes of the true positive graphs. Some of the statistics evaluated even have a vertical line at the right of the plot: this indicates that no matter the p-value picked all false positive graphs are being flagged but not all true positive graphs are flagged; this is the worst possible space for the statistic to be in. To evaluate delta statistics, graphs were generated in pairs, the first being from the normal/false positive/true positive model while the other always from the normal model, and the delta statistic calculated between them. To test the performance of graph change statistics like graph edit distance and probability mass shift, synthetic data was generated using a mixture model that either samples edges from a static normal graph instance from 1. or from an anomalous graph from 4. The initial graph has a power-law degree distribution with an exponent of 2.0 and was generated using a Chung-Lu sampling process while the alternative graph was generated with an Erdos-Renyii graph model. The normal model draws 108

Graph Model: Graph Model: Anomalous Normal Parameters Parameters

E edges E + x edges E edges

N+x N A

Critical Points

Figure 6.10.: Diagram of synthetic experiments and the three sets of generated graphs.

edge samples only from the initial graph, while the alternative model draws 5% of the edges from the alternative graph. The performance of these statistics is shown in 6.12 (a) and (d). Mass Shift strictly dominates the other statistics as either the edges or nodes changes. To determine ability to detect degree distribution changes synthetic graphs were also generated using a Chung-Lu process, however anomalous graph instances were generated by altering the exponent parameter of the power law determining degree distribution rather than using a mixture model. The normal graph instances have a power-law degree distribution with an exponent of 2.0 while the true positive graph have an exponent of 1.8. The performance is shown in figure 6.12 (b) and (e).

The transitivity experiments were done by creating graphs with a varying⎡ amount⎤ 0.4 0.3 of triangle closures. To create each graph, a KPGM model with a seed of ⎣ ⎦ 0.3 0 is used to sample an initial edge set. These parameters were selected to create a graph with a branching pattern with few natural triangles. Then, with probability ρ each 109

Synthetic(PN ,PA, |W |, |V |,δ,Sk,α):

for i in 1...50 do NormalGraphs.add(GenerateGraph(|W |, |V |,PN ) F alseP osGraphs.add(GenerateGraph(|W | + δ, |V |,PN ) AnomalousGraphs.add(GenerateGraph(|W |, |V |,PA) end for NullDistr = {Sk(G),G∈ NormalGraphs} CriticalP oints = CalcCPs(NullDistr, α) for G in F alseP ositiveGraphs do if Sk(G) outside CriticalP oints then False Positives ++ end if end for for G in AnomalousGraphs do if Sk(G) outside CriticalP oints then True Positives ++ end if end for

Figure 6.11.: Synthetic data experimental procedure for statistic Sk using normal probability matrix PN , anomalous probability matrix PA, and Δ additional edges in False Positive graphs.

edge is removed and replaced with a triangle closure by performing a random walk (identical to the technique used in the Transitive Chung Lu model [80]). The normal graphs were generated with a rho of 0.05 while the alternative graphs had a rho of 0.055. The results are in figures (c) and (f). Figures 6.12 (g)-(i) shows the effect of changing (g) edges, (h) nodes, or (i) model parameter on transitivity statistics. The zero point on the false positive plots com- pares graphs of the same size and model which will produce false positives at the p-value rate, while deviating in either direction introduces more false positives. The power in figure (i) depends on the deviation in the model parameter. 110 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 l Triangle Prob True Positive Rate l Prob Degree True Positive Rate l Mass Shift l Barrat Clust l Standard Degree True Positive Rate l Netsimile l Edit Dist l Netsimile l Deltacon 0.2 l Netsimile l Deltacon l Deltacon 0.2 0.2 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate False Positive Rate False Positive Rate (a) (b) (c) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 l Triangle Prob l Mass Shift True Positive Rate l Prob Degree True Positive Rate l Barrat Clust l Standard Degree True Positive Rate l Edit Dist l Netsimile l Netsimile l Deltacon 0.2 l Netsimile l Deltacon

l Deltacon 0.2 0.2 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate False Positive Rate (d) (e) (f)

l Triangle Prob. 1.0 l Barrat Clust. 1.0 1.0 l Netsimile

0.8 l Deltacon 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 False Positive Rate 0.2 0.2 0.2 0.0 0.0 0.0

-3000 -1000 0 1000 3000 -400 -200 0 200 400 0.05 0.10 0.15 0.20 0.25 0.30

Total Weight Delta Nodes Rho (g) (h) (i)

Figure 6.12.: (a),(b),(c): ROC curves with false positives due to extra edges. (d),(e),(f): ROC curves with false positives due to extra nodes. (g) false positive rate vs. edges, (h) false positive rate vs. nodes, (i) true positive rate.

6.6.2 Semi-Synthetic Data Experiments

Although synthetically driven experiments have the advantage of complete control over the network properties of the generated graphs, these experiments give inherently artificial results and the utility of any conclusions drawn from those experiments 111

Aggregate Graph Examples

E edges Permute Graph

E + x edges E edges

Critical Points

Figure 6.13.: Diagram of semi-synthetic experiments and the three sets of generated graphs.

depends heavily on the comprehensiveness of the experiments. To ensure that these results generalize to more realistic scenarios I’ve also evaluated them using a set of semi-synthetic experiments where the normal and false positive graph examples of 1. – 3. are sampled from real-world networks and the true positive anomaly examples of 4. are artificially inserted. These experiments show that the proposed consistent statistics are superior at discovering anomalies inserted into real-world data. The first step in generating the graph sets is to aggregate all graph instances from a dynamic network source into a single graph example which will become our normal graph source. All normal graph examples are generated from this source graph by first sampling an active node set, obtaining the subgraph over those active nodes, then sampling edges with replacement to create the sample graph. By aggregating all instances over time we smooth out any variations that occur over the lifespan of the network and obtain the “average” behavior of the network to use as our normal 112

SemiSynthetic(PN ,PA, |W |, |V |,δ,Sk,α):

for i in 1...50 do NormalGraphs.add(SampleGraph(|W |, |V |,PN ) F alseP osGraphs.add(SampleGraph(|W | + δ, |V |,PN ) AnomalousGraphs.add(SampleGraph(|W |, |V |,PA) end for NullDistr = {Sk(G),G∈ NormalGraphs} CriticalP oints = CalcCPs(NullDistr, α) for G in F alseP ositiveGraphs do if Sk(G) outside CriticalP oints then False Positives ++ end if end for for G in AnomalousGraphs do if Sk(G) outside CriticalP oints then True Positives ++ end if end for

Figure 6.14.: Semi-synthetic data experimental procedure for statistic Sk using normal probability matrix PN , anomalous probability matrix PA, and Δ additional edges in False Positive graphs.

examples. False positive examples are creates by sampling additional nodes or edges from the same source network. True positive examples are sampled from a separate, alternate source instance which is created by permuting the original source graph in some way. To generate network change anomalies the alternate source has 5% of its edges selected uniformly at random compared to the source; degree distribution anomalies are generated by taking 30% of the edges of high degree nodes (high degree meaning in the top 50% of nodes) and assigning them uniformly at random; and transitivity anomalies are gen- erated by performing triangle closures by selecting an initial node, randomly walking two steps, then linking the endpoints of the walk to form a triangle. The semi- synthetic data generation process is shown in figure 6.13 and the exact algorithm for 113

generating the data is described in Figure 6.14. The input PN is created by dividing the aggregated normal graph described above by |W | and the input PA is created by modifying the aggregated normal graph in one of the ways described above and then dividing by |W |. The dataset used for the underlying graph was the University E-mail dataset described in Section 6.6.3 used in the real data experiments; when aggregated this data forms a graph with 54102 total nodes and 5236591 total messages. Edge false positives are generated by creating graphs with 20k nodes and either 400k or 600k edges while node false positives are generated by sampling either 20k or 30k nodes and sampling edges equal to 20 times the number of nodes. Sampling edges as a ratio of nodes in the node false positive experiment is to hold the density of the graphs constant. We analyze the performance of the statistics using the same ROC approach as with the synthetic data. Figure 6.15 shows the resulting ROC curves. As with the synthetic experiments (a)-(c) show mass shift statistics, degree distribution statistics, and transitivity statistics respectively when the false positives are generated with additional edges, while (d)-(f) have false positives generated via additional nodes. The proposed consistent statistics have superior performance in most cases, and none of the competing statistics perform well in both the additional edges and additional nodes scenarios.

6.6.3 Real Data Experiments

Now let us investigate the types of anomalies found when these statistics are applied to three real-world networks and contrast these events to those found by other detectors. The first dataset is the Enron communication data, a subset of e-mail communications from prominent figures of the Enron corporation (150 individuals, 47088 total messages) with a time step width of one week used in papers such as Priebe et al [60]. The second is the University E-mail data, e-mail communications of students from one university in the 2011-2012 school year (54102 individuals, 5236591 114 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4

True Positive Rate l Mass Shift True Positive Rate l Prob Degree True Positive Rate l Triangle Prob l Edit Dist l Raw Degree l Barrat Clust l Netsimile l Netsimile l Netsimile l Deltacon l Deltacon l Deltacon 0.2 0.2 0.2 0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate False Positive Rate False Positive Rate

(a) (b) (c) 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4

True Positive Rate l Mass Shift True Positive Rate l Prob Degree True Positive Rate l Triangle Prob l Edit Dist l Raw Degree l Barrat Clust l Netsimile l Netsimile l Netsimile l Deltacon l Deltacon l Deltacon 0.2 0.2 0.2 0.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate False Positive Rate False Positive Rate

(d) (e) (f)

Figure 6.15.: ROC curves on the semi-synthetic dataset with varying alphas.

total messages), sampled daily and described in detail in the paper by LaFond et al [36]. The third is a Facebook network subset made up of postings to the walls of students in the 2007-2008 school year (444829 individuals, 4171383 total messages), also from the same university and sampled daily. The Facebook dataset was also used in a paper by LaFond [35] and is described there in more detail. Figure 6.16 shows the results of multiple statistics detectors when applied to the set of e-mail data from the Enron corporation, including our three proposed statis- tics, the raw message count, Netsimile, and Deltacon. Time step 143 represents the most significant event in the stream, Jeffrey Skilling’s testimony before congress on February 6, 2002. The detected triangle anomalies at time steps 50-60 coincide with Enron’s price manipulation strategy known as “Death Star” which was put into ac- 115

Edge Count Mass Shift Probabilistic Degree September 2001: Enron stock Triangle Probability begins to falter Netsimile Deltacon May 2001: Mintz sends memorandum to Skilling on LJM paperwork

December 2000: Skilling takes over as CEO May 2000: Price manipulation February 2002: Skilling strategy "Death Star" implemented testifies before Congress July 2001: Skilling approaches Lay about resigning April 2001: "Asshole" conference call July 2000: Enron/Blockbuster deal

0 50 100 150

Time (weeks)

Figure 6.16.: Detected anomalies in Enron corporation e-mail dataset. Filled circles are detections from our proposed statistics, open circles are other methods.

tion in May 2000. Other events include The CEO transition from Lay to Skilling in December 2000, the “asshole” conference call featured prominently in the book “The Smartest Guys in the Room,” and Lay approaching Skilling about resigning. Netsimile has difficulty detecting most of important events in the Enron timeline. Although it accurately flags the time of the Congressional hearings, the other points flagged, particularly early on, do not correspond to any notable events and are prob- ably false positives due to the artificial sensitivity of the algorithm in very sparse network slices. Deltacon detects a greater range of events than Netsimile but still fails to detect several important events such as the price manipulation and Skilling’s attempted resignation. In general it generates detections more frequently in the region between May and December 2001 which is also the region of highest message activity, and fails to generate detections in times with fewer messages. 116

December 25: Christmas February 14: Valentine's Day

August 29: Semester begins January 16: Martin Luther King day

September 5: Labor day

November 1: Basketball season begins

0 50 100 150 200

Time (days)

Figure 6.17.: Detected anomalies in university student e-mail dataset.

Figure 6.17 shows the detected time steps of the University E-mail dataset. Several major events from the academic school year like the start of the school year and Christmas break are shown. It seems that the consistent statistics flag times closer to holidays and other events compared to other statistics. Unfortunately, as the text content of the messages was unavailable it is impossible to determine if the detected conversations correspond to specific events based on the dialogues of users. Figure 6.18 shows the detected events of the Facebook wall data and the expla- nations for the detected events. Some of the listed events are holidays while others were obtained by investigating the time steps flagged as anomalies; see Section 6.7 for an explanation of this process. Some events of interest are: the “Race to 2k Posts” where a pair of individuals noticed they were nearing two thousand posts on one of their walls and decided to reach that mark in one night, generating much more traffic between them than usual (over 160 posts); the “Divorce w/ Third Party” where a pair 117

August 6: April 26: Painting Room Online Quiz October 16: July 21: Cramming for Exams 2007 Open Championship

June 2: Race to 2k Posts September 12: Exams/Projects Jan 1: New Years

January 14: Classes resume March 12-17: Spring Break August 27: Classes begin Dec 25: June 25: Christmas Divorce w/ Third Party

0 100 200 300 400

Time (days)

Figure 6.18.: Detected anomalies in Facebook wall postings dataset.

of individuals were going through a messy breakup and a mutual friend was cracking jokes and egging them on; and a discussion about Tiger Woods’ odds in the 2007 Open Championship.

6.7 Local Anomaly Decomposition

After flagging a time step as anomalous it is useful to have some indication as to what is happening in the network at that time that generated the flag. One tool for investigating the flagged time step is local anomaly decomposition, where the network is broken down into subgraphs that contribute the most value to the total statistic score at that time step. For many statistics like mass shift or triangle probability which are summations over the edges, nodes, or triplets of the graph this process is trivial: each component of the summation has an associated anomaly score and the components that provide the most anomaly score are the ones investigated. For others 118 such as PDD which cannot be easily decomposed into node and edge contributions this approach is nearly impossible. Anomaly score decomposition is more useful when the score is skewed rather than uniformly distributed as it is easier to highlight a concise region that contributes the most towards the anomaly. To demonstrate the decomposition, we applied the statistics to the real-world net- works and sorted all of the nodes (for Barrat clustering) or edges (all other statistics) from highest to lowest contribution to the anomaly score sum. From there we selected the components with the highest anomaly score contribution totaling at least 20% of the log of the anomaly score to be part of the visualized anomaly. We then plotted all of the selected components as well as any adjacent edges and nodes. We investigated the Enron and Facebook datasets as these have names/message content associated with the graphs; the e-mail dataset has neither so these graphs are omitted. Figures 6.19 - 6.22 show the local subgraphs reported by the mass shift, triangle probability, graph edit distance, and Barrat clustering respectively. The left subgraph shows activity in the time step immediately prior to the anomaly while the right shows the subgraph during the anomaly. Red nodes and edges are part of the top anomaly contributors while black edges and nodes are merely adjacent; the thickness of the edges corresponds to the edge weight in that time step. Figure 6.19 shows an unusually large amount of communication between Senior Vice President Richard Shapiro and Government Relations Executive Jeff Dasovich immediately before Lay approaches Skilling about resigning as CEO. Figure 6.20 shows the triangular communications occurring between members of the Enron legal department which was occurring during the price-fixing strategy in California. Both of these methods find succinct subgraphs to represent the anomalies occurring at these times. 6.21, on the other hand, shows graph edit distance reporting nearly the entirety of the network at that time. While this does represent an event (the Congressional hearings) there is no interpretation of the event other than that there were many messages being sent at that time. Barrat clustering identifies the legal department 119 in 6.22 but does so at a time with relatively low communication. Barrat clustering normalizes by node degree which makes it more likely to report triangles with less weight as long as the participating nodes don’t communicate with anyone else.

Dutch Quigley Dutch Quigley Andy Zipper Greg Whalley Andy Zipper Greg Whalley

Mark Taylor Rosalee Mark Taylor Rosalee Jeff Dasovich Fleming Jeff Dasovich Fleming Hunter Shively Arnold John Hunter Shively Arnold John Sara Sara Shackleton Williams Jason Richard Shackleton Williams Jason Richard Shapiro Shapiro Marie Heard Kim Ward Marie Heard Kim Ward Louise Kitchen Louise Kitchen Susan Bailey Barry Tycholiz Susan Bailey Barry Tycholiz Jim Schwieger Jim Schwieger Stephanie Darron Giron Stephanie Darron Giron Panus Panus Mike Grigsby Mike Grigsby Kevin Presto Kevin Presto Gerald Nemec Gerald Nemec

(a) (b)

Figure 6.19.: Subgraph responsible for most of the mass shift anomaly in the Enron network at the weeks of June 25 (before anomaly) and July 2 (during anomaly), 2001 respectively.

Carol Clair Carol Clair Sara Sara Shackleton Elizabeth Sager Shackleton Elizabeth Sager

Kim Ward Kim Ward Susan Bailey Susan Bailey Stacy Dickson Stacy Dickson Tana Jones Tana Jones

Louise Kitchen Louise Kitchen Marie Heard Marie Heard

(a) (b)

Figure 6.20.: Subgraph responsible for most of the triangle probability anomaly in the Enron network at the weeks of May 1 (before anomaly) and May 8 (during anomaly), 2000 respectively.

Figures 6.23 - 6.26 show the local subgraphs found in the Facebook dataset. ?? shows the event we named “race to 2k posts;” at this time a pair of individuals noticed 120

Susan Bailey Susan Bailey

Randall Gay Randall Gay Mark Taylor Mark Taylor

Mike Swerzbin Mike Swerzbin

Barry Tycholiz Ryan Slinger Barry Tycholiz Ryan Slinger Marie Heard Marie Heard Albert Meyers Albert Meyers Williams JasonMark Haedicke Williams JasonMark Haedicke Keith Holst Keith Holst Paul Thomas Paul Thomas Phillip Love Hunter Shively Phillip Love Hunter Shively Errol McLaughlin Errol McLaughlin Eric Saibi Andrew Lewis Eric Saibi Andrew Lewis Stacey White Stacey White Charles Weldon Charles Weldon Louise Kitchen Geoff Storey Louise Kitchen Geoff Storey Jeff King Andy Zipper Jeff King Andy Zipper James Steffes James Steffes Vladi Pimenov John Forney Vladi Pimenov John Forney Jonathan Mckay Jonathan Mckay

Chris StokleyEric Bass Chris StokleyEric Bass Joe Parks Joe Parks

Robin Rodrigue Michelle Cash Robin Rodrigue Michelle Cash Fletcher Sturm Fletcher Sturm Mike Grigsby Liz Taylor Mike Grigsby Liz Taylor Jay ReitmeyerKim Ward Jay ReitmeyerKim Ward

Patrice Mims Chris Germany Patrice Mims Chris Germany Elizabeth Sager Bill Williams Chris Dorland Elizabeth Sager Bill Williams Chris Dorland Darron Giron Darron Giron

Kam Keiser Kam Keiser Scott Neal Dutch Quigley Scott Neal Dutch Quigley Phillip Allen Phillip Allen Doug Gilbert-smith Doug Gilbert-smith John Zufferli John Zufferli Dana Davis Kevin Presto Dana Davis Kevin Presto Greg WhalleySally Beck Harry Arora Greg WhalleySally Beck Harry Arora Thomas Martin Thomas Martin

John Griffith John Griffith Mike Carson Mike Carson Don Baughman Diana Scholtes Don Baughman Diana Scholtes

Tana Jones Matthew Lenhart Tana Jones Matthew Lenhart Holden Salisbury Holden Salisbury

Robert Benson Frank ErmisCraigSandra Dean Brawner Robert Benson Frank ErmisCraigSandra Dean Brawner Jason Wolfe Jason Wolfe Geir Solberg Geir Solberg AndreaJudy Ring Townsend AndreaJudy Ring Townsend Gerald Nemec Cooper Richey Gerald Nemec Cooper Richey

Stephanie Panus Stephanie Panus

Larry MayMike Maggi Larry MayMike Maggi

Cara Semperger Cara Semperger

Sean Crandall Sean Crandall

Sara Shackleton Sara Shackleton

(a) (b)

Figure 6.21.: Subgraph responsible for most of the graph edit distance anomaly in the Enron network at the weeks of January 28 (before anomaly) and Feburary 4 (during anomaly), 2002 respectively.

Sara Sara Shackleton Shackleton

Marie Heard Marie Heard

Carol Clair Carol Clair

Tana Jones Tana Jones

(a) (b)

Figure 6.22.: Subgraph responsible for most of the Barrat clustering anomaly in the Enron network at the weeks of November 1 (before anomaly) and November 8 (during anomaly), 1999 respectively.

they were closing in on two thousand posts on their walls and decided to reach that goal in one night. The result is a massively higher amount of communication than was typical between the two in prior time steps. 6.24 shows the communications occurring during the 2007 Open Championship golf tournament. The three individuals with the 121 most communication were arguing about the odds that Tiger Woods would win the tournament. Graph edit distance, by contrast, identifies no coherent local structure in 6.25. It is likely that this event signifies a global increase in communication rather than a change in the distribution of messages. As the additional edges were distributed throughout the network, when looking for subgraphs that generated the most anomaly score the majority of the network has similar scores so a random chunk of the network is found. 6.26 is the structure found by Barrat clustering; as before it finds a set of triangular communication with relatively low weights, around 2 – 4, while the anomaly found by Triangle Probability has about 12 messages per edge.

s1164480010 s1159770042 s13756909 s504100579 s1164480010 s1159770042 s13756909 s504100579 s1164480319 s13752103 s13721354 s32306362 s1164480319 s13752103 s13721354 s32306362 s1159770044 s1159770044 s1160220895 s13750525 s1169281163 s1160220895 s13750525 s1169281163 s1169280591 s1169280591 s510052523 s1159770041 s13727642 s510052523 s1159770041 s13727642 s1164482119 s1164480828 s1164482119 s1164480828 s170901130 s13753060 s21103472 s170901130 s13753060 s21103472 s13750963 s674710560 s13750963 s674710560 s1165470099s1164600169 s1165470099s1164600169 s1164481159 s1169280026 s1164481159 s1169280026 s1164483110 s607755441 s500633330 s1164483110 s607755441 s500633330 s886155180s507400003 s1164180207 s13747430 s1169280502 s886155180s507400003 s1164180207 s13747430 s1169280502 s13754325 s1169281152 s13754325 s1169281152 s13728933 s1168530008 s13728933 s1168530008 s13715878 s13746665 s13715878 s13746665 s1169280811 s1169280811 s705832189 s1164481281 s705832189 s1164481281 s1164480156 s32300697 s1164480156 s32300697 s2336606 s1164240109 s1169280051 s2336606 s1164240109 s1169280051 s13752734 s13749688s226712286 s13752734 s13749688s226712286 s798990220 s798990220 s1142970093 s13715064 s1142970093 s13715064 s1135800090 s1169280544 s1135800090 s1169280544 s1164480601 s58209322 s1144740093 s1164480601 s58209322 s1144740093 s1163400699 s1143750129 s1163400699 s1143750129 s1164481139 s1164480744 s509411732 s509892819 s1164481139 s1164480744 s509411732 s509892819 s1160010357 s13750350 s1160010357 s13750350 s1169280530 s1169280530 s1583850037 s13746393 s1583850037 s13746393 s1164480366 s1164420239 s1164480366 s1164420239 s20723071 s20723071 s13751150 s13754847 s13751150 s13754847 s215236664 s215236664 s626365564 s626365564 s1164482502 s1164482502 s1164481542 s13736563 s1164481542 s13736563 s13748841 s13748841

s13748880 s13748880 s13750521 s13750521 s46501377 s46501377

s513724895 s13753694 s25512852 s1160221150 s513724895 s13753694 s25512852 s1160221150 s15801606 s693920740 s15801606 s693920740

s142400053 s507934204 s142400053 s507934204 s13751148 s13751148 s68106508 s509946050 s68106508 s509946050 s13756522 s13756522 s8906454 s537911005 s8906454 s537911005 s1162860043 s11201414 s1160221539 s1162860043 s11201414 s1160221539 s1564080462 s74200499 s74200486 s1564080462 s74200499 s74200486

s15802395 s891985053 s15802395 s891985053

s1163010206 s11200725 s759250013 s1163010206 s11200725 s759250013 s1150350076 s37611262 s1150350076 s37611262 s1149930097 s1149930097 s11200087 s736920563 s11200087 s736920563 s1931200 s1931200 s1151700184 s1151700184

s1149930173 s1149930173

s1203120119 s229070203 s508570209 s1164060114 s13732963 s1163010146 s1203120119 s229070203 s508570209 s1164060114 s13732963 s1163010146

s1164600629 s1163010252 s1164600629 s1163010252

s1170480026 s1170480026 s1168230068 s1168230068

s584135936 s1159440128 s584135936 s1159440128 s763730511 s1164482680 s13739960 s763730511 s1164482680 s13739960 s783775693 s508901789 s1265100003 s783775693 s508901789 s1265100003 s1170480027 s13739070 s1170480027 s13739070

s13731990 s13731990

s1170480076 s1170480076 s1598730043 s1164180342 s502293990 s1598730043 s1164180342 s502293990 s602475649 s602475649 s1203120095 s625355193 s1159770075 s1203120095 s625355193 s1159770075 s13705961 s13730734 s13705961 s13730734 s1170480002 s1170480002

s1078740053 s223289375 s1163190150 s6848692 s1078740053 s223289375 s1163190150 s6848692

s507533763 s507533763

s596436250 s13755407 s1164600567 s596436250 s13755407 s1164600567 s42113159 s42113159 s13750454 s1155512567 s13750454 s1155512567

s783065310 s1160220666 s783065310 s1160220666 s1951577 s1951577

(a) (b)

Figure 6.23.: Subgraph responsible for most of the mass shift anomaly in the Facebook network at June 1 (before anomaly) and June 2 (during anomaly), 2007 respectively.

6.8 Summary

In this chapter we have demonstrated that dependence on network edge count hinders the ability of statistics to detect certain changes in dynamic networks. To remedy this we have introduced the concept of Size Consistency and shown that statistics with this property are less affected by edge count variation. 122

s130521 s130521

s104537 s104537

s238683 s238683

s236424 s236424

s104538 s104538

s238687 s238687

(a) (b)

Figure 6.24.: Subgraph responsible for most of the triangle probability anomaly in the Facebook network at July 20 (before anomaly) and July 21 (during anomaly), 2007 respectively.

s233509 s226161 s9327 s19163 s330166 s212780 s237035 s233509 s226161 s9327 s19163 s330166 s212780 s237035 s35279 s35279 s54826 s11359 s17478 s54826 s11359 s17478 s21530 s21530 s340608 s35723 s339414 s340608 s35723 s339414 s93697 s17836 s56026 s192777 s93697 s17836 s56026 s192777 s210923 s242592 s83928 s202952 s160790 s210923 s242592 s83928 s202952 s160790 s339438 s137082 s339438 s137082 s31038 s325972 s12692 s31038 s325972 s12692 s66241 s120510 s103315 s66241 s120510 s103315 s19846 s12693 s192306 s208272s194175 s19846 s12693 s192306 s208272s194175 s17795 s120207 s10076 s7365 s17795 s120207 s10076 s7365 s18320 s239125 s13038 s18320 s239125 s13038 s342352 s15209 s282172 s342352 s15209 s282172 s162001 s46727 s221675 s12505 s162001 s46727 s221675 s12505 s242788 s25236 s102638 s2016 s70262 s18204 s242788 s25236 s102638 s2016 s70262 s18204 s84422 s28005 s28312 s128597 s267099 s13846 s84422 s28005 s28312 s128597 s267099 s13846 s342350 s31385 s117834 s10958 s135030 s342350 s31385 s117834 s10958 s135030 s95358 s95358 s79876 s327971 s24836 s269645 s79876 s327971 s24836 s269645 s156065 s235149 s164086 s292650 s142232 s156065 s235149 s164086 s292650 s142232 s253731 s46064 s253731 s46064 s273921 s341230 s117695 s95481 s137832 s33374 s273921 s341230 s117695 s95481 s137832 s33374 s25045 s1075 s12884 s86258 s132973 s25045 s1075 s12884 s86258 s132973 s120812 s121693 s107268s228662 s86810 s120812 s121693 s107268s228662 s86810 s19541 s59275 s87425 s82166 s131423 s19541 s59275 s87425 s82166 s131423 s16400 s75524 s28237 s6838 s16400 s75524 s28237 s6838 s249476 s154281 s249476 s154281 s86846 s144662 s274775 s86846 s144662 s274775 s342248 s61803 s164079 s90004 s342248 s61803 s164079 s90004 s61849 s336776 s342354s341229 s165719 s24894 s139949 s85381 s108442 s61849 s336776 s342354s341229 s165719 s24894 s139949 s85381 s108442 s47964 s9316 s323220 s334929 s209393 s90562 s47964 s9316 s323220 s334929 s209393 s90562 s242136 s194969 s71824 s242136 s194969 s71824 s190016 s341419 s190016 s341419 s247020s342253 s157604 s134507 s41420 s143652 s247020s342253 s157604 s134507 s41420 s143652 s342277 s48336 s192770 s342277 s48336 s192770 s319264 s342351s126247 s23612 s95422s117815 s176927 s86585 s289534 s340048 s340315 s319264 s342351s126247 s23612 s95422s117815 s176927 s86585 s289534 s340048 s340315 s323173 s25050 s18503 s301731 s323173 s25050 s18503 s301731 s40861 s211832 s222567 s18769 s213957 s40861 s211832 s222567 s18769 s213957 s168671 s205109 s221347 s56315 s331475 s42999 s273 s168671 s205109 s221347 s56315 s331475 s42999 s273 s138127 s277664s226240 s114830 s341420 s341421 s56027 s138127 s277664s226240 s114830 s341420 s341421 s56027 s342256 s153243 s20669 s7361 s342256 s153243 s20669 s7361 s6726s109729 s2303 s69537 s63 s6726s109729 s2303 s69537 s63 s206468 s24684s45924 s37755 s55114 s206468 s24684s45924 s37755 s55114 s211704 s2304 s182 s249 s42451 s211704 s2304 s182 s249 s42451 s246758 s47716 s275658 s291519 s108184 s78874 s116445 s246758 s47716 s275658 s291519 s108184 s78874 s116445 s267906 s13370 s342249 s30307 s20254 s47853 s6559 s1889 s267906 s13370 s342249 s30307 s20254 s47853 s6559 s1889 s174514 s273349 s69039 s174514 s273349 s69039 s342353 s26435 s133185 s230498 s10936 s342353 s26435 s133185 s230498 s10936 s225542 s16958 s66476 s9837 s225542 s16958 s66476 s9837 s25723 s143369 s110567 s25723 s143369 s110567 s112244 s189135 s93643 s9811 s23263 s9033 s112244 s189135 s93643 s9811 s23263 s9033 s28779s139572 s7508 s30239 s28779s139572 s7508 s30239 s44905 s12930 s342358 s93597 s221166 s161778 s125165 s27445 s44905 s12930 s342358 s93597 s221166 s161778 s125165 s27445 s272184 s140829 s196114 s1318 s269646 s206625 s15807 s272184 s140829 s196114 s1318 s269646 s206625 s15807 s255383 s10921 s285665 s255383 s10921 s285665 s34719 s131172 s8399 s298047 s280959 s34719 s131172 s8399 s298047 s280959 s39740 s180989 s39740 s180989 s14349 s28079 s14349 s28079 s28778 s28778 s42117 s284153 s257625 s42117 s284153 s257625 s43402 s342252 s34074 s43402 s342252 s34074 s37921s139477 s258949 s2128 s37921s139477 s258949 s2128 s189137 s38476 s11324 s150994 s189137 s38476 s11324 s150994 s155440 s40111 s12520 s1121 s155440 s40111 s12520 s1121 s149767 s330283 s20584 s149767 s330283 s20584 s111442 s342250 s38742 s18086 s111442 s342250 s38742 s18086 s38465 s288370 s51940 s11383 s38465 s288370 s51940 s11383 s116378 s43403 s116378 s43403 s199291 s5095 s9034 s199291 s5095 s9034 s238322 s342254 s6762 s238322 s342254 s6762 s235745 s8328 s139152 s235745 s8328 s139152 s3 s70118 s3 s70118 s69288 s117932 s69288 s117932 s252542 s438 s252542 s438 s342251 s93243 s134107 s38068 s342251 s93243 s134107 s38068 s101474 s6759 s101474 s6759 s23046 s23046 s199146 s28543 s197975 s199146 s28543 s197975 s342255 s92391 s134937 s342255 s92391 s134937 s278829 s834 s54652 s278829 s834 s54652 s10925 s10925 s217699 s336974 s217699 s336974 s10955 s307635 s18100 s10955 s307635 s18100 s286562 s56293 s78451 s286562 s56293 s78451 s6930 s6930

s336975 s336975

s4584 s4584 s15120 s15120

s99385 s99385 s305328 s305328

s128060 s128060 s339355 s339355

s11954 s228385 s11954 s228385

s268271 s268271 s323069 s323069

s335888 s120781 s335888 s120781

s283337 s283337 s19048 s19048 s299706 s217619 s299706 s217619

s26705 s26705

s2840 s2840

s244878 s150189 s240835 s340016 s84663 s21142 s244878 s150189 s240835 s340016 s84663 s21142 s2644 s161260 s2644 s161260 s269658 s269658 s30411 s153498 s238894 s30411 s153498 s238894 s100554 s178784 s100554 s178784 s221871 s80107 s221871 s80107 s134716 s316336 s134716 s316336 s986 s986

s127606 s194470 s127606 s194470 s302408 s268275 s1301 s302408 s268275 s1301 s984 s340234 s44467 s1505 s984 s340234 s44467 s1505 s4566 s4566 s291954 s267167 s177825 s291954 s267167 s177825 s261624 s289330 s261624 s289330 s83323 s131504 s83323 s131504 s226973 s253532 s62759 s226973 s253532 s62759 s6797 s28295 s6797 s28295 s214534 s168207 s296996 s88362 s4880 s214534 s168207 s296996 s88362 s4880 s32206 s119220 s32206 s119220 s93984 s16874 s93984 s16874 s243694 s47666 s62422 s16052 s30922 s243694 s47666 s62422 s16052 s30922 s43919 s3213 s14124 s43919 s3213 s14124 s83289 s290922 s1502 s83289 s290922 s1502 s10994 s10994 s228860 s260923 s24245 s228860 s260923 s24245 s10854 s302238 s10854 s302238 s202568 s202568 s242473 s1834 s80329 s117115 s293313 s277436 s122305 s85999 s242473 s1834 s80329 s117115 s293313 s277436 s122305 s85999 s200598 s200598 s2809 s130876 s54172 s2809 s130876 s54172 s47355 s33147 s47355 s33147 s333493 s1833 s207704 s248073 s333493 s1833 s207704 s248073 s1880 s134500 s1880 s134500 s83208 s3459 s83208 s3459 s14137 s341908 s163489 s261234 s33551 s14137 s341908 s163489 s261234 s33551 s238574 s83631 s330031 s238574 s83631 s330031 s177273 s235 s32948 s177273 s235 s32948 s7755 s258106 s700 s7755 s258106 s700 s269186 s269186 s28420 s129712 s28420 s129712 s341393 s47603 s277272 s20510 s341393 s47603 s277272 s20510 s11574 s1711 s15021 s281401 s11574 s1711 s15021 s281401 s101376 s101376 s195868 s47233 s195868 s47233 s90964 s63938 s90964 s63938 s263887 s180224 s267496 s263887 s180224 s267496 s2808 s36523 s2808 s36523 s16045 s109438 s16045 s109438 s27665 s341498 s4327 s27665 s341498 s4327 s70832 s70832 s330659 s330659

s109116 s340415 s109116 s340415

s330635 s35172 s330635 s35172 s1600 s36055 s1600 s36055 s30028 s30028 s1809 s1809 s297536 s297536 s24760 s24760 s35965 s35965

s31336 s31336 s46544 s46544

s49909 s76670 s49909 s76670

s2996 s2996

s70282 s70282

s52335 s52335

s2998 s2998

s3934 s3934

s30616 s30616

s2461 s2461

s309902 s47683 s332834 s20270 s223827 s7817 s280279 s340367 s181467 s309902 s47683 s332834 s20270 s223827 s7817 s280279 s340367 s181467 s4367 s283735 s4367 s283735 s194461 s9512 s341453 s283736 s194461 s9512 s341453 s283736 s340607 s37711 s178560 s340607 s37711 s178560 s14810 s340372 s278211 s226357 s14810 s340372 s278211 s226357 s2558 s20461 s2924 s2558 s20461 s2924 s34965 s77093 s308015 s34965 s77093 s308015 s4002 s3256 s323360 s4002 s3256 s323360 s62946 s62946 s287511 s340368 s279563 s287511 s340368 s279563 s331692 s293438 s331692 s293438 s308843 s45178 s24716 s339436 s315540 s308843 s45178 s24716 s339436 s315540 s42327 s235334 s181466 s42327 s235334 s181466 s294700 s15035 s92900 s304156 s294700 s15035 s92900 s304156 s9891 s22543 s54216 s339434 s278210 s9891 s22543 s54216 s339434 s278210 s38487 s1804 s276250 s340365 s180529 s38487 s1804 s276250 s340365 s180529 s93429 s68454 s93429 s68454 s7222 s13319 s55898 s7222 s13319 s55898 s32251 s340371 s278212 s162353 s32251 s340371 s278212 s162353 s238819 s19359 s1175 s238819 s19359 s1175 s33096 s272647 s340366 s33096 s272647 s340366 s339435 s307795 s339435 s307795 s205568 s45177 s103817 s17260 s259320 s251550 s205568 s45177 s103817 s17260 s259320 s251550 s12415 s13785 s288274 s230142 s12415 s13785 s288274 s230142 s80590 s174515 s340369 s279566 s80590 s174515 s340369 s279566 s27881 s285528 s13784 s27881 s285528 s13784 s192441 s340370 s311827 s136814 s341451 s192441 s340370 s311827 s136814 s341451 s9031 s219903 s5897 s86789 s9031 s219903 s5897 s86789 s269973 s245957 s339387 s269973 s245957 s339387 s53226 s30268 s32047 s339296 s53226 s30268 s32047 s339296 s114424 s114424 s23871 s33894 s5176 s23871 s33894 s5176

s9259 s36714 s1671 s9259 s36714 s1671 s85432 s288142 s238743 s85432 s288142 s238743 s292389 s142200 s288166 s292389 s142200 s288166 s258925 s258925 s29781 s287443 s341781 s29781 s287443 s341781 s342359 s342359

s248326 s248326 s1182 s28671 s1182 s28671 s25920 s25920

s340248 s340248 s237736 s237736

s47143 s47143

s221493 s162624 s47996 s72922 s63950 s166444 s279199 s10604 s64519 s221493 s162624 s47996 s72922 s63950 s166444 s279199 s10604 s64519 s133151 s335507 s81674 s133151 s335507 s81674 s95734 s95734 s131338 s249853 s131338 s249853 s37323 s866 s214047 s210533 s37323 s866 s214047 s210533 s101165 s74994 s22020 s101165 s74994 s22020 s11769 s8713 s277963 s24582 s11769 s8713 s277963 s24582 s272102 s52974 s173314 s272102 s52974 s173314 s146603 s307294 s146603 s307294 s3455 s190956 s3455 s190956 s85257 s17326 s8255 s32428 s85257 s17326 s8255 s32428 s129560 s15181 s198676 s129560 s15181 s198676 s28534 s27547 s28534 s27547 s65727 s65727 s206692 s117630 s15110 s6864 s173323 s146273 s124292 s206692 s117630 s15110 s6864 s173323 s146273 s124292 s131337 s131342 s192048 s278073 s22825 s71021 s7836 s131337 s131342 s192048 s278073 s22825 s71021 s7836 s2626 s2626 s262064 s117127 s175204 s249780 s262064 s117127 s175204 s249780 s8982 s335339 s8982 s335339 s22431 s304818 s53812 s22431 s304818 s53812 s157119 s131341 s38743 s342164 s331619 s28276 s157119 s131341 s38743 s342164 s331619 s28276 s88147 s178859 s336315 s342118 s88147 s178859 s336315 s342118 s43747 s142187 s43747 s142187 s213438 s13447 s190109 s213438 s13447 s190109 s290683 s279256 s46019 s264500 s290683 s279256 s46019 s264500 s89186 s42179 s89186 s42179 s339395 s200341 s198677 s115957 s56588 s339395 s200341 s198677 s115957 s56588 s165456 s182745 s34425 s157469 s30609 s141092 s165456 s182745 s34425 s157469 s30609 s141092 s340316 s22435 s340316 s22435 s44740 s22486 s14467 s44740 s22486 s14467 s1979 s341067 s1979 s341067 s12268 s4325 s95016 s12268 s4325 s95016 s300972 s87585 s300972 s87585 s15310 s294364 s15310 s294364 s60723 s60723 s293264 s293264 s10931 s36421 s10931 s36421 s338220 s338220

s3247 s3247

s224030 s224030

s41965 s41965

s5407 s6185 s84720 s155773 s269145 s130004 s91034 s98234 s240653 s340431 s3164 s5407 s6185 s84720 s155773 s269145 s130004 s91034 s98234 s240653 s340431 s3164 s17539 s338691 s78801 s17539 s338691 s78801 s318391 s318391 s143552 s73637 s143552 s73637 s337917 s52963 s27647 s337917 s52963 s27647 s122117 s82241 s122117 s82241 s2662 s20386 s2662 s20386 s340270 s340270 s214188 s214188 s121790 s89857 s141444 s121790 s89857 s141444 s238318 s27329 s6868 s238318 s27329 s6868 s67836 s288190 s56542 s294853 s2262 s67836 s288190 s56542 s294853 s2262 s158 s316634 s292344 s36483 s244144 s158 s316634 s292344 s36483 s244144 s135498 s14133 s20400 s238572 s20114 s167 s135498 s14133 s20400 s238572 s20114 s167 s58976 s45135 s128410 s58976 s45135 s128410 s325959 s48770 s325959 s48770 s13073 s13073 s63112 s297301 s56338 s63112 s297301 s56338 s173370 s282889 s18844 s173370 s282889 s18844 s131163 s131163 s22287 s172794 s317026 s22287 s172794 s317026 s340269 s14555 s340269 s14555 s195717 s5392 s19184 s18819 s13954 s195717 s5392 s19184 s18819 s13954 s33978 s94524 s6283 s33978 s94524 s6283 s2313 s261650 s28798 s61225 s2313 s261650 s28798 s61225 s1775 s311484 s283510 s1775 s311484 s283510 s5280 s16651 s5280 s16651 s251432 s298868 s251432 s298868 s36692 s36692 s52232 s187351 s52232 s187351 s17914 s97741 s54310 s17914 s97741 s54310 s6505 s292323 s6505 s292323 s22774 s22774 s18628 s18628 s183279 s79297 s183279 s79297 s276989 s8088 s276989 s8088 s76903 s76903

s74298 s74298

s14450 s14450

s276306 s57914 s6905 s279434 s244326 s338703 s196971 s303564 s13888 s29167 s276306 s57914 s6905 s279434 s244326 s338703 s196971 s303564 s13888 s29167 s271791 s271791 s80610 s341436 s80610 s341436 s229093 s208312 s12177 s174446 s229093 s208312 s12177 s174446 s205572 s253110 s148850 s205572 s253110 s148850 s341367 s341367 s91787 s91787 s193188 s193188 s229747 s235264 s30438 s229747 s235264 s30438 s36538 s36538 s5059 s180634 s20039 s214105 s3148 s59623 s2391 s5059 s180634 s20039 s214105 s3148 s59623 s2391 s218027 s7847 s22969 s2121 s340628 s84779 s29870 s218027 s7847 s22969 s2121 s340628 s84779 s29870 s80187 s11310 s275150 s80187 s11310 s275150 s168730 s168730 s15214 s6268 s25957 s15214 s6268 s25957 s221553 s221553 s168535 s190776 s168535 s190776 s341174 s341174 s340233 s280915 s340233 s280915 s5283 s5283 s341409 s227060 s341409 s227060 s216905 s340778 s79819 s219107 s216905 s340778 s79819 s219107 s90600 s6168 s337907 s90600 s6168 s337907 s40940 s32844 s116933 s40940 s32844 s116933 s24825 s38211 s308745 s257619 s30678 s234158 s24825 s38211 s308745 s257619 s30678 s234158 s2436 s214133 s2436 s214133

s84812 s84812 s243653 s22893 s23589 s243653 s22893 s23589 s2638 s2638 s163015 s26322 s163015 s26322 s49551 s49551

s341221 s273446 s285796 s340704 s158300 s226870 s155848 s80758 s30681 s69448 s11422 s57608 s341221 s273446 s285796 s340704 s158300 s226870 s155848 s80758 s30681 s69448 s11422 s57608

s174297 s18822 s35214 s336957 s340320 s340341 s254851 s342378 s196628 s174297 s18822 s35214 s336957 s340320 s340341 s254851 s342378 s196628 s341480 s341480 s185076 s52491 s45360 s104479 s185076 s52491 s45360 s104479 s227773 s164707 s218598 s229543 s45840 s227773 s164707 s218598 s229543 s45840 s293999 s136673 s55358 s3340 s293999 s136673 s55358 s3340 s150182 s24921 s150182 s24921 s19487 s13721 s13047 s275460 s225784 s96988 s67913 s15010 s13139 s12059 s3341 s19487 s13721 s13047 s275460 s225784 s96988 s67913 s15010 s13139 s12059 s3341 s10889 s100513 s199219 s10889 s100513 s199219 s241396 s248057 s278164 s254848 s342377 s129584 s241396 s248057 s278164 s254848 s342377 s129584 s201122 s236438 s201122 s236438 s342345 s238131 s342345 s238131

s19653 s188717 s155850 s148548 s85170 s24494 s19653 s188717 s155850 s148548 s85170 s24494 s5669 s5669 s337802 s16205 s43375 s337802 s16205 s43375 s271187 s249282 s304932 s62347 s271187 s249282 s304932 s62347 s66264 s61627 s78630 s233035 s202997 s5118 s247782 s341239 s59937 s66264 s61627 s78630 s233035 s202997 s5118 s247782 s341239 s59937 s11231 s11231

s284084 s284084 s17341 s17341

s1247 s226535 s183848 s11806 s75970 s180287 s57042 s14811 s72406 s3374 s14757 s508 s1247 s226535 s183848 s11806 s75970 s180287 s57042 s14811 s72406 s3374 s14757 s508 s217326 s164683 s129331 s43922 s7150 s217326 s164683 s129331 s43922 s7150 s341364 s7295 s583 s341364 s7295 s583

s82302 s244333 s82302 s244333 s24507 s24507 s137338 s66049 s28773 s1500 s578 s497 s137338 s66049 s28773 s1500 s578 s497 s101585 s298717 s36619 s14763 s101585 s298717 s36619 s14763 s30377 s1962 s582 s30377 s1962 s582 s1245 s269861 s254663 s43142 s223924 s36618 s1959 s0 s1245 s269861 s254663 s43142 s223924 s36618 s1959 s0 s312810 s1796 s144757 s62511 s126386 s312810 s1796 s144757 s62511 s126386 s336908 s164871 s57622 s12729 s506 s336908 s164871 s57622 s12729 s506

s11805 s44711 s67936 s7818 s11805 s44711 s67936 s7818 s16168 s16168 s1249 s130069 s1249 s130069 s340239 s185569 s296155 s340201 s340239 s185569 s296155 s340201 s252042 s18051 s92275 s57544 s252042 s18051 s92275 s57544 s290956 s38199 s36028 s290956 s38199 s36028

s28717 s28717

s335805 s337310 s254502 s203044 s146271 s70321 s229703 s18300 s10292 s40904 s16694 s166445 s335805 s337310 s254502 s203044 s146271 s70321 s229703 s18300 s10292 s40904 s16694 s166445 s4178 s4178 s47736 s47736

s19610 s19610 s190345 s190345

s13935 s6277 s13935 s6277 s274099 s135296 s274099 s135296 s309601 s267544 s3518 s280015 s171417 s181835 s26696 s21805 s34046 s186063 s269184 s309601 s267544 s3518 s280015 s171417 s181835 s26696 s21805 s34046 s186063 s269184 s247409 s53797 s37204 s14742 s9694 s9633 s7155 s247409 s53797 s37204 s14742 s9694 s9633 s7155 s341094 s341208 s341623 s126198 s340943 s290151 s341094 s341208 s341623 s126198 s340943 s290151 s826 s826 s29780 s29780 s8210 s341332 s8210 s341332 s2104 s2104 s54682 s54682 s110512 s239949 s260296 s12626 s110512 s239949 s260296 s12626 s289239 s270842 s231649 s103123 s233597 s277486 s289239 s270842 s231649 s103123 s233597 s277486

s29884 s3536 s103707 s282038 s218286 s183859 s218190 s92566 s70615 s23515 s20755 s150132 s107905 s29884 s3536 s103707 s282038 s218286 s183859 s218190 s92566 s70615 s23515 s20755 s150132 s107905 s35503 s35503 s10187 s10147 s10187 s10147 s320896 s340046 s236700 s221910 s213947 s320896 s340046 s236700 s221910 s213947 s34606 s34606 s198270 s167001 s133188 s52856 s46191 s40618 s35558 s198270 s167001 s133188 s52856 s46191 s40618 s35558 s7588 s7588 s1375 s1376 s20367 s135261 s1193 s55724 s1375 s1376 s20367 s135261 s1193 s55724 s46681 s330519 s46681 s330519 s1381 s879 s1381 s879 s107059 s340407 s107059 s340407 s12714 s12714

s74507 s311247 s74507 s311247 s23312 s334936 s342161 s290476 s286363 s340187 s23312 s334936 s342161 s290476 s286363 s340187 s213988 s213595 s213988 s213595 s56486 s340226 s56486 s340226

s75955 s26549 s26791 s245313 s5105 s183762 s21308 s58001 s7718 s2377 s36596 s170978 s24038 s75955 s26549 s26791 s245313 s5105 s183762 s21308 s58001 s7718 s2377 s36596 s170978 s24038 s14813 s10838 s14813 s10838

s64755 s38970 s302683 s331497 s189941 s254381 s115537 s52228 s205571 s75878 s64755 s38970 s302683 s331497 s189941 s254381 s115537 s52228 s205571 s75878 s1728 s1728 s28544 s26548 s25406 s13141 s9943 s4449 s2376 s2265 s2030 s1694 s28544 s26548 s25406 s13141 s9943 s4449 s2376 s2265 s2030 s1694 s5370 s23835 s5370 s23835

s352 s352

s22226 s251 s22226 s251 s340669 s68117 s342380 s192029 s339456 s144872 s91004 s340324 s211950 s340669 s68117 s342380 s192029 s339456 s144872 s91004 s340324 s211950 s281522 s281522

s269890 s269890

s140810 s112662 s217439 s340489 s146235 s10157 s19977 s340526 s288224 s325286 s72390 s277122 s341271 s342376 s32662 s3466 s140810 s112662 s217439 s340489 s146235 s10157 s19977 s340526 s288224 s325286 s72390 s277122 s341271 s342376 s32662 s3466 s300246 s288223 s187786 s69685 s20112 s13534 s13095 s2697 s2557 s300246 s288223 s187786 s69685 s20112 s13534 s13095 s2697 s2557 s112661 s63623 s5004 s774 s112661 s63623 s5004 s774 s250499 s250499 s567 s116330 s334740 s341647 s339335 s567 s116330 s334740 s341647 s339335 s46651 s24491 s46651 s24491

s341323 s341323 s12824 s19302 s12824 s19302

(a) (b)

Figure 6.25.: Subgraph responsible for most of the graph edit distance anomaly in the Facebook network at October 15 (before anomaly) and October 16 (during anomaly), 2007 respectively.

We proposed three Size Consistent network statistics, Mass Shift, Degree Shift, and Triangle Probability to replace the Graph Edit Distance, Degree Distribution and Clustering Coefficient statistics. These statistics are provably Size Consistent and we demonstrated using synthetic trials that anomaly detectors using our statistics have superior performance on variable sized networks. 123

s13732595 s13732595

s1170180195 s1170180195

s13726491 s13726491

s13754550 s13754550 s1170180179 s1170180179

s70805178 s70805178

(a) (b)

Figure 6.26.: Subgraph responsible for most of the Barrat clustering anomaly in the Facebook network at May 17 (before anomaly) and May 18 (during anomaly), 2007 respectively.

The framework for developing Size Consistent network statistics can be applied to new statistics in the future. We hope that researchers who propose network statistics in the future will make sure to analyze the effects that changing network size have on their proposed statistics and ensure that those statistics meet the Size Consistency requirements. 124

7 CONCLUSIONS

In the course of my work, we have identified one of the main problems facing hy- pothesis tests regarding network properties: the multi-dependence between common network statistics and the properties they are supposed to measure. We have framed this issue of confounding effects as a mismatch between the one-to-one relationship desired between network statistics and the network properties and the reality of a complex set of dependencies. These interdependencies prevent proper conclusions about the network properties as we cannot disambiguate between statistic and prop- erty, leading to both false positives and false negatives when we attempt to use said statistics for tasks like anomaly detection. To remedy this we have proposed several methods to reduce these confounding errors by eliminating some or all of these interdependencies. We have shown that a carefully designed permutation test can produce an artificial population where non- permuted properties are the same as in the original example, allowing testing of the hypothesized property in isolation. Rather than applying the single available statistic and conjecturing about what network property is responsible, we utilized strict control of the semi-synthetically generated null distributions to test for each property separately. In cases where the permutation would be too complex to perform, we have in- troduced a set of consistent statistics which are effectively independent of the most common confounding factors, the edge count and nod count of the network. We have proven that the proposed empirical size-consistent statistics are indeed consistent with respect to edge and node count and that the number of errors they produce decreases as the size of the network increases. Using synthetic, semi-synthetic, and real-world data we have shown the advantage of these statistics over conventional alternatives when detecting anomalies in a dynamic network, producing fewer false positives and 125 false negatives when detecting anomalies. Furthermore the framework with which we demonstrated the advantage of consistent statistics is generalizable. Future statistics which meet the consistency requirements we have laid out in this work will also avoid confounding due to graph size. This work has demonstrated the critical nature of confounding effects when per- forming hypothesis tests or anomaly detection on networks and future work on hy- pothesis tests for networks should take steps to control those effects or use consistent statistics as demonstrated in this dissertation.

Contributions

• Identified the multiple dependence between network statistic and network prop- erty, and most importantly network size, as a major obstacle in accurate network hypothesis testing and anomaly detection.

• Developed a randomization testing approach to address this problem when a proper permutation operation can be defined. We then demonstrated its utility in distinguishing two typically confounded network properties.

• Proposed a size-consistent statistic approach to dynamic network anomaly de- tection which measures network properties without interference from edge and node count dependencies.

• Demonstrated the utility of the size-consistent statistics in a dynamic anomaly detection scenario, showing how changes in the size of the network can confound standard statistics but are not a problem for size-consistent statistics.

• Shown how to investigate the regions of a network most responsible for de- tected anomalies and provided examples of novel events discovered through this method. 126

Future Work The framework for consistent network statistics outlined in this dissertation is applicable to new and existing network statistics. In this dissertation, we developed size-consistent variants of a subset of existing network statistics: hop-plot, random walk distance, centrality, and motif statistics are all possible candidates for improve- ment using the size-consistent framework. This work has focused on dynamic networks rather than sets of static networks. While dynamic networks encompass many interesting scenarios, they have an assump- tion of node identifiability, i.e. nodes are uniquely labelled and graphs from the same dynamic network can have their nodes paired by ID. In the case of static networks there may be no mapping between nodes in each network, which either limits the types of network statistics that can be applied or requires some kind of mapping procedure to take place. Probability mass shift, for example, assumes that the rows and columns of the P ∗ matrices match node-for-node; to apply without this mapping requires solving the isomorphism problem. In this dissertation we have proven consistency properties for network statistics. As shown in relevant work, network models like ERGM also suffer from size incon- sistency [82]. Developing network models which are valid for hypothesis tests, which can be learned in a scaleable fashion, and which obtain consistent parameter values when learning from data of different sizes would be a useful research direction. These consistent statistics may be the foundation for new graph model definitions which avoid the problems that ERGM suffers from. Many existing anomaly detection approaches only report the network as experi- encing an event when the event has already started to occur. In real-world settings advanced notice of possible anomalies may be much more valuable than detecting current problems. Given examples of anomalous behavior, we would want to find graph structures in previous graph observations which indicate this type of anomaly and learn a predictive model. This type of problem requires causal reasoning about 127 possible precursors to anomalies and is a direction with a tremendous amount of potential, allowing destructive anomalies to predicted and possibly averted. LIST OF REFERENCES 128

LIST OF REFERENCES

[1] D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves rela- tional classification. SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 593–598, 2004. [2] R. Hanneman and M. Riddle. Introduction to Social Network Methods. University of California Riverside, 2005. [3] D. Watts and S. Strogatz. Collective dynamics of small-world networks. Nature, 393(6684):440–442, 1998. [4] M. Brinkmeier and T. Schank. Network statistics. Network Analysis, pages 293–317, 2005. [5] J. Saramaki et al. Generalizations of the clustering coefficient to weighted com- plex networks. Physical Review, 2007. [6]S.SofferandA.V´azquez. Network clustering coefficient without degree- correlation biases. Physical Review, 71.5, 2005. [7] P. Wang et al. Efficiently estimating motif statistics of large networks. Transac- tions on Knowledge Discovery from Data, 2014. [8] C. Borgs, J. Chayes, L. Lov´asz, V. S´os, B. Szegedy, and K. Vesztergombi. Graph limits and parameter testing. In ACM Symposium on Theory of Computing, 2006. [9] S. Aral, L. Muchnika, and A. Sundararajan. Distinguishing influence-based con- tagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51):21544–21549, 2009. [10] K. Stephenson and M. Zelen. Rethinking centrality: Methods and examples. Social Networks, 11(1):1–37, 1989. [11] B. Ulrik. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25.2:163–177, 2001. [12] P. Bonacich. Some unique properties of eigenvector centrality. Social Networks, 29:555–564, 2007. [13] J. Pfeiffer and J. Neville. Methods to determine node centrality and clustering in graphs with uncertain structure. International AAAI Conference on Web and Social Media, 2011. [14] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation rank- ing: Bringing order to the web. Technical Report SIDL-WP-1999-66, Stanford InfoLab, November 1999. 129

[15] S. Milgram. The small world problem. Psychology Today, 2.1:60–67, 1967. [16] M. Gaston, M. Kraetzl, and W. Wallis. Using graph diameter for change de- tection in dynamic networks. Australasian Journal of Combinatorics, 35:299, 2006. [17] P. Shoubridge, M. Kraetzl, and D. Ray. Detection of abnormal change in dynamic networks. In Information, Decision and Control, pages 557–562. IEEE, 1999. [18] P. Shoubridge, M. Kraetzl, W. Wallis, and H. Bunke. Detection of abnormal change in a time series of graphs. Journal of Interconnection Networks, 3(2):85– 101, 2002. [19] M. Berlingerio, D. Koutra, and C. Faloutsos. Network similarity via multiple so- cial theories. International Conference on Advances in Social Networks Analysis and Mining, 2012. [20] D. Koutra, J. Vogelstein, and C. Faloutsos. Deltacon: A principled massive- graph similarity function. Proceedings of the SIAM International Conference in Data Mining, 2013. [21] M. Stephens. EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69(347):730–737, 1974. [22] M. Dow and J. Cheverud. Comparison of distance matrices in studies of popula- tion structure and genetic microdifferentiation: Quadratic assignment. American Journal of Physical Anthropology, 68.3:367–373, 1985. [23] L. Akoglu, H. Tong, and D Koutra. Graph based anomaly detection and descrip- tion: A survey. Data Mining and Knowledge Discovery, 29.3:626–688, 2015. [24] H. Bunke. Error correcting graph matching: On the influence of the underlying cost function. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9):917–922, 1999. [25] H. Bunke, P. Dickinson, A. Humm, C. Irniger, and M. Kraetzl. Computer net- work monitoring and abnormal event detection using graph matching and mul- tidimensional scaling. In Industrial Conference on Data Mining, pages 576–590. Springer, 2006. [26] M. Peabody. Finding groups of graphs in databases. Phd dissertation, Drexel University, 2002. [27] P. Papadimitriou, A. Dasdan, and H. Garcia-Molina. Web graph similarity for anomaly detection. Journal of Internet Services and Applications, 1(1):19–30, 2010. [28] K. Henderson, B. Gallagher, L. Li, L. Akoglu, T. Eliassi-Rad, H. Tong, and C. Faloutsos. It’s who you know: Graph mining using recursive structural fea- tures. In SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 663–671. ACM, 2011. [29] R. Fisher. Statistical Methods and Scientific Inference. Hafner Publishing Co., 1956. 130

[30] D. Jensen, J. Neville, and M. Rattigan. Why collective inference improves rela- tional classification. SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 593–598, 2004. [31] S. Snijders. Non-parametric standard errors and tests for network statistics. Connections, 1999. [32] S.H. Lee, P. Kim, and H. Jeong. Statistical properties of sampled networks. Physical Review, 73.1, 2006. [33] K. Borner, J. Maru, and R. Goldstone. The simultaneous evolution of author and paper networks. Proceeding on the National Academy of Sciences, 101:5266–5273, 2004. [34] L. Backstrom et al. Group formation in large social networks: Membership, growth, and evolution. SIGKDD International Conference on Knowledge Dis- covery and Data Mining, pages 44–54, 2006. [35] T. La Fond and J. Neville. Randomization tests for distinguishing social influence and homophily effects. World Wide Web Conference, 2010. [36] T. LaFond, J. Neville, and B. Gallagher. Anomaly detection in networks with changing trends. Outlier Detection and Description under Data Diversity at the International Conference on Knowledge Discovery and Data Mining, 2014. [37] K. Faust and J. Skvoretz. Comparing networks across space and time, size and species. Sociological Methodology, 2002. [38] N. Contractor, S. Wasserman, and K. Faust. Testing multitheoretical, multilevel hypotheses about organizational networks: An analytic framework and empirical example. Academy of Management Review, 31.3:681–703, 2006. [39] D. Krackhardt. Predicting with networks: Nonparametric multiple of dyadic data. Social networks, 10.4:359–381, 1988. [40] S. Moreno, J. Neville, and S. Kirshner. Learning mixed kronecker product graph models with simulated method of moments. SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013. [41] S. Moreno and J. Neville. Network hypothesis testing using mixed kronecker product graph models. International Conference on Data Mining, 2013. [42] P. Yates and N. Mukhopadhyay. An inferential framework for biological network hypothesis tests. BMC , 2013. [43] J. Young. How do they ’end up together’? A social network analysis of self- control, homophily, and adolescent relationships. Journal of Quantitative Crim- inology, 27:251, 2011. [44] J. Boyd, W. Fitzgerald, and R. Beck. Computing core/periphery structures and permutation tests for social relations data. Social Networks, 28, 2006. [45] B. Gallagher. Matching structure and semantics: A survey on graph-based pat- tern matching. Association for the Advancement of Artificial Intelligence, pages 45–53, 2006. 131

[46] K. Henderson et al. RolX: Structural role extraction & mining in large graphs. SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012. [47] T. Zimmerman and N. Nagappan. Predicting defects using network analysis on dependency graphs. International Conference on Software Engineering, 2008. [48] J. Wang and C. Q. Chiu. Detecting online auction inflated-reputation behaviors using social network analysis. North American Association for Computational Social and Organizational Sciences, 2005. [49] R.S. Burt. Decay functions. Social Networks, 22, 2000. [50] D. Lazer et al. The coevolution of networks and political attitudes. Political Communications, 27.3:248–274, 2010. [51] B. Desmarais and S. Cranmer. Micro-level interpretation of exponential random graph models with application to estuary networks. Policy Studies Journal, 2012. [52] T. Snijders. Models for longitudinal network data. Models and Methods in Social Network Analysis 1, 2005. [53] K. Borner, S. Sanyal, and A. Vespignani. Network science. Annual Review of Information Science and Technology, 41, 2007. [54] T. Snijders, G. Van de Bunt, and C. Steglich. Introduction to stochastic actor- based models for network dynamics. Social Networks, 2010. [55] T. Snijders, C. Steglich, and M. Schweinberger. Modeling the co-evolution of net- works and behavior. Longitudinal Models in the Behavioral and Related Sciences, pages 41–71, 2007. [56] S. Hanneke, W. Fu, E. Xing, et al. Discrete temporal models of social networks. Electronic Journal of Statistics, 2010. [57] A. Sanil, D. Banks, and K. Carley. Models for evolving fixed node networks: Model fitting and model testing. Social Networks, 17:65–81, 1995. [58] I. McCulloh. Detecting changes in a dynamic social network. Technical Report CMU-ISR-09-104, 2009. [59] M. Mongiovı, P. Bogdanov, R. Ranca, E. Papalexakis, C. Faloutsos, and A. Singh. Netspot: Spotting significant anomalous regions on dynamic networks. In SIAM International Conference on Data Mining, 2013. [60] C. Priebe et al. Scan statistics on enron graphs. Computational & Mathematical Organization Theory, 2005. [61] R. Madhavan, B.R. Koka, and J.E. Prescott. Networks in transition: How in- dustry events (re)shape interfirm relationships. Strategic Management Journal, 19:439–459, 1998. [62] A. Anagnostopoulos, R. Kumar, and M. Mahdian. Influence and correlation in social networks. SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 7–15, 2008. 132

[63] P. Singla and M. Richardson. Yes, there is a correlation: – From social networks to personal behavior on the web. World Wide Web Conference, pages 655–664, 2008. [64] D. Crandall et al. Feedback effects between similarity and social influence in online communities. SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 160–168, 2008. [65] R. Milo et al. Network motifs: Simple building blocks of complex networks. Science, 298:5594:824–827, 2002. [66] H. Eldardiry and J. Neville. A technique for relational data graphs. SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008. [67] G. Robins, T. Snijders, P. Wang, M. Handcock, and P. Pattison. Recent devel- opments in exponential random graph (p*) models for social networks. Social Networks, 29:192–215, 2007. [68] D. Banks and K. Carley. Models for network evolution. Journal of Management Studies, 1996. [69] T. Snijders et al. New specifications for exponential random graph models. Sociological Methodology, 2006. [70] A. Tartakovsky, H. Kim, B. Rozovskii, and R. Bla˘zek. A novel approach to de- tection of ’denial-of-service’ attacks via adaptive sequential and batch-sequential change-point detection methods. IEEE Transactions on Signal Processing, 2006. [71] V. Siris and Fotini P. Application of anomaly detection algorithms for detecting syn flooding attacks. Computer Communications, 2006. [72] I. McCulloh and K. Carley. Social network change detection. Technical Report No. CMU-ISR-08-116, 2008. [73] M. Basseville and I. Nikiforov. Detection of Abrupt Changes: Theory and Appli- cation. Prentice Hall, 1993. [74] Y. Kawahara and M. Sugiyama. Change-point detection in time-series data by direct density-ratio estimation. In SIAM International Conference on Data Mining, 2009. [75] J. Chen and A. Gupta. On change point detection and estimation. Communi- cations in Statistics-Simulation and Computation, 2001. [76] G. Box, G. Jenkins, G. Reinsel, and G. Ljung. Time Series Analysis: Forecasting and Control. John Wiley & Sons, 2015. [77] L. Xu et al. Quantifying signals with power-law correlations: A comparative study of detrended fluctuation analysis and detrended moving average tech- niques. Physical Review, 2005. [78] P. Erd¨os and A. R´enyi. On random graphs, i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959. 133

[79] W. Aiello, F. Chung, and L. Lu. A random graph model for massive graphs. Symposium on Theory of Computing, 2000. [80] J. Pfeiffer, T. LaFond, S. Moreno, and J. Neville. Fast generation of large scale social networks while incorporating transitive closures. In IEEE SocialCom, 2012. [81] D. Hunter, M. Handcock, C. Butts, S. Goodreau, and M. Morris. Ergm: A package to fit, simulate and diagnose exponential-family models for networks. Journal of Statistical Software, 2008. [82] C. R. Shalizi and A. Rinaldo. Consistency under sampling of exponential random graph models. The Annals of Statistics, 2013. [83] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization us- ing hyperlinks. Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 307–318, 1998. [84] J. Neville and D. Jensen. Iterative classification in relational data. Proceedings of the Workshop on Statistical Relational Learning, 17th National Conference on Artificial Intelligence, pages 42–49, 2000. [85] B. Taskar, E. Segal, and D. Koller. Probabilistic classification and clustering in relational data. Proceedings of the 17th International Joint Conference on Artificial Intelligence, pages 870–878, 2001. [86] P. Marsden and N. Friedkin. Network studies of social influence. Sociological Methods and Research, 22(1):127–151, 1993. [87] P. Doreian. Network autocorrelation models: Problems and prospects. Spatial Statistics: Past, Present, and Future, pages 369–389, 1990. [88] M. McPherson, L. Smith-Lovin, and J. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27:415–445, 2001. [89] T. Mirer. Economic Statistics and . Macmillan Publishing Co, 1983. [90] L. Anselin. Spatial Econometrics: Methods and Models. Kluwer Academic Pub- lisher, 1998. [91] F. Guo, S. Hanneke, W. Fu, and E. Xing. Recovering temporally rewiring net- works: A model-based approach. Proceedings of the 24th International Confer- ence on Machine Learning, 2007. [92] U. Sharan and J. Neville. Temporal-relational classifiers for prediction in evolving domains. International Conference on Data Mining, 2008. [93] W. Mason, F. Conrey, and E. Smith. Situating social influence processes: Dy- namic, multidirectional flows of influence within social networks. Personality and Social Psychology Review, 11:3:279–300, 2007. [94] E. Noreen. Computer Intensive Methods for Hypothesis Testing: An Introduction. John Wiley & Sons, 1989. [95] J. Onnela, J. Saram¨aki, J. Kert´esz, and K. Kaski. Intensity and coherence of motifs in weighted complex networks. Physical Review, 71(6):065103, 2005. 134

[96] P. Holme, S. Park, B. Kim, and C. Edling. Korean university life in a net- work perspective: Dynamics of a large affiliation network. Physica A: Statistical Mechanics and its Applications, 373:821–830, 2007. [97] G. Bohrnstedt and A. Goldberger. On the exact covariance of products of random variables. Journal of the American Statistical Association, 1969. VITA 135

VITA

Timothy La Fond was born in Orange, California, on February 3, 1986. After finishing high school he attended the University of Washington from 2004 to 2008 earning a B.S. in computer engineering. He received a M.S. in computer science/statistics from Purdue University in December 2012 and a Ph.D. in computer science from Purdue University in August 2016.