Streaming Temporal Graphs

Streaming Temporal Graphs by Eric L. Goodman B.S. Computer Science and Math, Brigham Young University, 2003 M.S. Computer Science, Brigham Young University, 2005 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2019 This thesis entitled: Streaming Temporal Graphs written by Eric L. Goodman has been approved for the Department of Computer Science Dr. Dirk Grunwald Dr. Daniel Massey Dr. Eric Keller Dr. Qin Lv Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. iii Goodman, Eric L. (Ph.D., Computer Science) Streaming Temporal Graphs Thesis directed by Dr. Dirk Grunwald We provide a domain specific language called the Streaming Analytics Language (SAL) to write concise but expressive analyses of streaming temporal graphs. We target problems where the data comes as an infinite stream and where the volume is prohibitive, requiring a single pass over the data and tight spatial and temporal complexity constraints. Also, each item in the stream can be thought of as an edge in a graph, and each edge has an associated timestamp and duration. A real-world problem that is a streaming temporal graph is cyber security data. Machines communi- cate with each other within a network, forming a streaming sequence of edges with temporal information. As such, we elucidate the value of SAL by applying it to a large range of cyber-related problems. With a combination of vertex-centric computations that create features per vertex, and subgraph matching to find communication patterns of interest,D we cover a wide spectrum of important cyber use cases. As an example, we discuss Verizon's Data Breach Investigations Report, and show how SAL can be used to capture most of the nine different categories of cyber breaches. Also, we apply SAL to discovering botnet activity within network traffic in 13 different scenarios, with an average area under the curve (AUC) of the receiver operating characteristic (ROC) of 0.87. Besides SAL as a language, as another contribution we present an implementation we call the Stream- ing Analytics Machine (SAM). With SAM, we can run SAL programs in parallel on a cluster, achieving rates of a million netflows per second, and scaling to 128 nodes or 2560 cores. We compare SAM to another streaming framework, Apache Flink, and find that Flink cannot scale past 32 nodes for the problem of finding triangles (a subgraph of three interconnected nodes) within the streaming graph. Also, SAM excels when the subgraphs are frequent, continuing to find the expected number of subgraphs, while Flink performance degrades and under-reports. Together, SAL and SAM provide an expressive and scalable infrastructure for performing analyses on streaming temporal graphs. iv Acknowledgements A big thanks goes to my advisor, Dirk Grunwald, who helped me hone this dissertation over a lengthy period of time. The path I took was shaped by him and allowed me to create a much better product and a more relevant research contribution. I also thank the committee members for their time and input. This includes not only the current committee of Daniel Massey, Eric Keller, Sangtae Ha, and Qin Lv, but also former members John Black and Sriram Sankaranarayana. Sandia National Laboratories funded my doctoral degree through their university programs, for which I am grateful. I especially appreciate the unwavering support and patience from my managers, Judy Spomer and Michael Haass. My parents were instrumental in fostering my curiousity and desire to learn, which lead me down this path of educational advancement. I still haven't invented the transmat beam, but that will be my next project. v Contents Chapter 1 Introduction 1 2 The Streaming Analytics Language 5 2.1 Case Study . 14 2.2 Static Vertex Attributes . 22 2.3 Temporal Subgraph Matching . 23 3 SAL as a Query Language for Cyber Problems 29 3.1 Crimeware . 30 3.2 Cyber-Espionage . 33 3.3 Denial of Service . 34 3.4 Point of Sale . 36 3.5 Privilege Misuse . 37 3.6 Web Applications . 39 3.7 Summary . 40 4 Related Work 41 4.1 Vertex-centric Computing . 42 4.2 Graph Domain Specific Languages . 47 4.2.1 Green-Marl . 48 vi 4.2.2 Ligra . 49 4.2.3 Galois . 49 4.2.4 Gemini . 50 4.2.5 Grazelle . 51 4.2.6 GraphIt . 51 4.3 Linear Algebra-based Graph Systems . 52 4.3.1 Combinatorial BLAS . 53 4.3.2 Knowledge Discovery Toolbox . 53 4.3.3 Pegasus . 53 4.3.4 GraphMAT . 54 4.4 Graph APIs . 54 4.5 GPU Graph Systems . 55 4.6 SPARQL . 56 4.7 Machine Learning . 59 4.8 Data Stream Management Systems . 59 4.9 Network Telemetry . 61 5 Implementation 63 5.1 Partitioning the Data . 64 5.2 Vertex-centric Computations . 66 5.3 Subgraph Matching . 68 5.3.1 Data Structures . 70 5.3.2 Algorithm . 75 5.4 Summary . 82 6 Vertex Centric Computations 83 6.1 Classifier Results . 83 6.2 Scaling . 86 vii 6.3 Summary . 91 7 Temporal Subgraph Matching 92 7.1 SAM Parameter Exploration . 94 7.2 Apache Flink . 96 7.3 Comparison Results: SAM vs Flink . 98 7.4 Simulating other Constraints . 103 7.5 Summary . 104 8 Conclusions 106 Bibliography 108 Appendix A SAL and Flink Programs 125 A.1 Machine Learning Example . 125 A.2 Apache Flink Triangle Implementation . 126 viii Tables Table 4.1 A table of vertex-centric frameworks and various attributes. 46 5.1 This table shows the conciseness of the SAL language. For the Disclosure pipeline (discussed in Section 2.1), the machine learning pipeline (discussed in Section6.1), and the temporal triangle query (discussed in Chapter 7) the amount of code needed is reduced between 10-25 times. 63 5.2 A listing of the major classes in SAM and how they map to language features in SAL. We also make the distinction between consumers, producers, and feature creators. Producers all share a parallelFeed method that sends the data to registered consumers, which is how parallelization is achieved. Each of the consumers receive data from producers. If a consumer is also a feature creator, it performs a computation on the received data, generating a feature, which is then added to a thread-safe feature map.............................................. 67 6.1 A more detailed look at the AUC results. The second column is the AUC of the ROC after training on the first half of the data and then testing on the second half. The third column is reversed: trained on the second half and tested on the first half. 87 ix Figures Figure 1.1 This dissertation is an intersection of four different research areas and presents a domain specific language (DSL) to express streaming temporal graph and machine learning queries. .2 2.1 This figure represents a common pattern for vertex-centric computations in SAL programs. A stream of edges is partitioned into N vertices that are then operated on by o operators that compute over a sliding window of length n items. .6 2.2 Illustration showing the subgraph nature of a Watering Hole Attack. The target machine accesses a popular site, Bait. Shortly after accessing Bait, Target begins communicating to a new machine, the Controller ................... 10 2.3 This figure demonstrates the use of the STREAM BY statement to logically divide a single stream into multiple streams. The STREAM BY statement is following by a FOREACH GENERATE statement that creates estimates on the top two most frequent destination ports and their frequencies. The FILTER statement uses this information to downselect to IPs where the top two ports account for 90% of the traffic. 16 2.4 This figure demonstrates the use of the TRANSFORM and COLLAPSE BY statements to define parts of the Disclosure pipeline. 19 x 3.1 A subgraph showing the flow of messages for data exfiltration. A controller sends a small control message to an infected machine, which then sends large amounts of data to a drop box. 33 3.2 A subgraph showing an XSS attack. 39 4.1 SPARQL Query number two from the Lehigh University Benchmark. 57 5.1 Architecture of the system: data comes in over a socket layer and is then distributed via ZeroMQ. 64 5.2 The push pull object. 65 5.3 Traditional Compressed Sparse Row (CSR) data structure. 72 5.4 Modified Compressed Sparse Row (CSR) data structure. 72 5.5 An overview of the algorithm employed to detect subgraphs of interest. Asyn- chronous threads are launched to consume each edge. The consume threads then performs five different operations. Another set of threads are pulling requests from other nodes for edges within the Request Communicator. Finally, one more set of threads pull edges from other nodes in reply to edge requests. 75 6.1 This graphic shows the AUC of the ROC curve for each of the CTU scenarios. We first train on the first half of the data and test on the second half, and then switch. 87 6.2 Weak Scaling Results: For each run with n nodes, a total of n million netflows were run through the system with a million contiguous netflows per node randomly selected from the CTU dataset. There were two types of runs, with renamed IPs (to create the illusion of a larger network) or with randomized IPs. For the renamed IPs, there continues to be improved throughput until 61 nodes/ 976 cores.

Load more