Controlling for Confounding Network Properties in Hypothesis Testing and Anomaly Detection Timothy La Fond Purdue University

Purdue University Purdue e-Pubs Open Access Dissertations Theses and Dissertations 8-2016 Controlling for confounding network properties in hypothesis testing and anomaly detection Timothy La Fond Purdue University Follow this and additional works at: https://docs.lib.purdue.edu/open_access_dissertations Part of the Computer Sciences Commons, and the Statistics and Probability Commons Recommended Citation Fond, Timothy La, "Controlling for confounding network properties in hypothesis testing and anomaly detection" (2016). Open Access Dissertations. 791. https://docs.lib.purdue.edu/open_access_dissertations/791 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. Graduate School Form 30 Updated PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance This is to certify that the thesis/dissertation prepared By Timothy La Fond Entitled Controlling For Confounding Network Properties in Hypothesis Testing and Anomaly Detection For the degree of Doctor of Philosophy Is approved by the final examining committee: Jennifer Neville Chair David Gleich Chris Clifton Daniel Aliaga To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material. Approved by Major Professor(s): Jennifer Neville Approved by: William Gorman 7/26/2016 Head of the Departmental Graduate Program Date CONTROLLING FOR CONFOUNDING NETWORK PROPERTIES IN HYPOTHESIS TESTING AND ANOMALY DETECTION A Dissertation Submitted to the Faculty of Purdue University by Timothy La Fond In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2016 Purdue University West Lafayette, Indiana ii TABLE OF CONTENTS Page LIST OF TABLES ................................ iv LIST OF FIGURES ............................... v ABSTRACT ................................... viii 1 Introduction .................................. 1 2 Networks and Hypothesis Testing ...................... 7 2.1 Graph Notation ............................. 7 2.2 Properties Influencing Network Formation .............. 8 2.3 Statistics for a Single Observed Network ............... 10 2.4 Delta Statistics for Pairs of Observed Networks ........... 14 2.5 Hypothesis Tests ............................ 16 2.6 Hypothesis Tests on Dynamic Networks ................ 16 3 Related Work ................................. 20 3.1 First Component (Chapter Five): Testing a Global/Community Prop- erties using a Network Statistic and Permutation .......... 23 3.2 Second Component (Chapter 6): Distinguishing the Properties of Dif- ferent Graphs using Multiple Observations of Consistent Network Statis- tics .................................... 24 3.3 Time Series Analysis .......................... 25 3.4 Network Models ............................. 26 4 Network Statistic Dependencies ....................... 29 4.1 Graph Edit Distance .......................... 31 4.2 Clustering Coefficient .......................... 32 4.3 Degree Distribution ........................... 33 4.4 Centrality Measures and Shortest Paths ............... 33 4.5 Autocorrelation ............................. 34 4.6 Netsimile Distance ........................... 35 4.7 Deltacon Distance ............................ 35 4.8 Summary ................................ 36 5 Randomization Testing Solution ....................... 37 5.1 Introduction ............................... 37 5.2 Problem Definition ........................... 41 5.3 Method ................................. 47 iii Page 5.3.1 Calculating the Expected Chi-Squared Gain ......... 55 5.4 Synthetic Data Experiments ...................... 56 5.4.1 Data ............................... 57 5.4.2 Methodology .......................... 58 5.5 Experimental Results .......................... 58 5.6 Real Data Experiments ......................... 62 5.6.1 Data ............................... 62 5.6.2 Experimental Results ...................... 63 5.7 Discussion ................................ 65 6 Size-Consistent Statistics for Anomaly Detection in Dynamic Networks . 67 6.1 Introduction ............................... 68 6.2 Problem Definition and Data Model .................. 72 6.3 Properties of the Test Statistic ..................... 76 6.3.1 Size Consistency ........................ 76 6.3.2 Size Inconsistency ........................ 78 6.4 Network Statistics ............................ 80 6.4.1 Conventional Statistics ..................... 80 6.4.2 Proposed Size Consistent Statistics .............. 85 6.5 Anomaly Detection Process ...................... 101 6.5.1 Smoothing ............................ 102 6.5.2 Complexity Analysis ...................... 105 6.6 Experiments ............................... 105 6.6.1 Synthetic Data Experiments .................. 106 6.6.2 Semi-Synthetic Data Experiments ............... 110 6.6.3 Real Data Experiments ..................... 113 6.7 Local Anomaly Decomposition ..................... 117 6.8 Summary ................................ 121 7 Conclusions .................................. 124 LIST OF REFERENCES ............................ 128 iv LIST OF TABLES Table Page 3.1 Categorization of related work in static networks, by type of statistic and network property being tested. ...................... 21 3.2 Categorization of related work in dynamic networks, by type of statistic and network property being tested. .................... 21 3.3 Categorization of related work in static networks, by H0 generation method and network property being tested. .................... 22 3.4 Categorization of related work in dynamic networks, by H0 generation method and network property being tested. ............... 22 5.1 Contingency table for linked node pairs and their attribute values. ... 42 5.2 Changes to the contingency table in time t +1. ............. 45 5.3 Type I error and power for the choice-based randomization method. .. 60 5.4 Number of groups detected by randomization tests ........... 63 5.5 Example groups with each possible combination of effect. ........ 64 6.1 Glossary of terms .............................. 67 6.2 Statistical properties of previous network statistics and our proposed al- ternatives. .................................. 82 v LIST OF FIGURES Figure Page 1.1 Diagram of a simple anomaly detection process. ............. 2 1.2 Diagram of the anomaly detection process and confounding effects. .. 4 1.3 Common network statistics and confounded network properties .... 5 2.1 Flowchart of the graph generation process and statistic generation. .. 17 2.2 Creation of null distribution and test point from graph statistics. ... 17 2.3 Flowchart of delta statistic generation. .................. 18 5.1 Confounding effect example with the communication volume property con- trolled. .................................... 40 5.2 Illustration of homophily and influence affect on attributes and links over time. ..................................... 41 5.3 Randomization testing procedure. Semi-synthetic graphs are created by permuting the original to construct a null distribution. ......... 48 5.4 Homophily significance test method .................... 49 5.5 Influence significance test method ..................... 50 5.6 Choice-based randomization method for assessing homophily ...... 53 5.7 Choice-based randomization method for assessing influence ....... 54 5.8 Method to measure Type I error rate. ................... 59 5.9 Method to measure statistical power. ................... 59 5.10 Power analysis of influence and homophily randomization tests. (a) In- fluence test power, as number of group additions increases. (b) Influence test power, as effect size increases. (c) Homophily test power, as effect size increases. .................................. 61 6.1 Statistic values network data generated from same model, but with increas- ing size. Behavior of (a) Consistent Statistic; (b) Inconsistent Statistic. Black pts: avg of 100 trials, red pts: [min, max]. ............. 70 6.2 Controlling for confounding effects through careful definition of the test statistics. .................................. 71 vi Figure Page 6.3 Dynamic network anomaly detection task. Given past instances of graphs created by the typical model of behavior, identify any new graph instance created by an alternative anomalous model. ............... 73 6.4 Graph generation process. The matrix P ∗ represents all possible nodes and their interaction probabilities. By sampling |V | nodes and |W | edges the observed graph G is obtained. .................... 74 6.5 Estimation of the Pt matrix from the observed W weights. ....... 85 6.6 Mass of the cells in P increase as |V | decreases. ............. 86 ∗ ∗ 6.7 Visualization of the regions averaged to obtain p , p, and p V . ..... 87 6.8 Creation of the dynamic graph sequence from message stream using time step width Δ. ................................ 103 6.9 Anomaly detection procedure for a graph stream {G1...Gt}, graph statistic Sk and p-value α. .............................. 104 6.10 Diagram of synthetic experiments and the three sets of generated graphs. ........................................ 108 6.11 Synthetic data experimental procedure for statistic Sk using normal probability matrix PN , anomalous probability matrix PA, and Δ additional edges in False Positive

Controlling for Confounding Network Properties in Hypothesis Testing and Anomaly Detection Timothy La Fond Purdue University

Hypothesis Testing and Likelihood Ratio Tests

This Is Dr. Chumney. the Focus of This Lecture Is Hypothesis Testing –Both What It Is, How Hypothesis Tests Are Used, and How to Conduct Hypothesis Tests

The Scientific Method: Hypothesis Testing and Experimental Design

Hypothesis Testing – Examples and Case Studies

Tests of Hypotheses Using Statistics

Understanding Statistical Hypothesis Testing: the Logic of Statistical Inference

Chapter 8 Key Ideas Hypothesis (Null and Alternative)

Week 10: Causality with Measured Confounding

Skedastic: Heteroskedasticity Diagnostics for Linear Regression

Chapter 10 Heteroskedasticity

Introduction to Hypothesis Testing 3

Sample Quiz #4