UC Berkeley UC Berkeley Electronic Theses and Dissertations

UC Berkeley UC Berkeley Electronic Theses and Dissertations Title Scalable Network Forensics Permalink https://escholarship.org/uc/item/11g2w9sz Author Vallentin, Matthias Publication Date 2016 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California Scalable Network Forensics by Matthias Vallentin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Vern Paxson, Chair Professor Michael Franklin Professor David Brillinger Spring 2016 Scalable Network Forensics Copyright 2016 by Matthias Vallentin 1 Abstract Scalable Network Forensics by Matthias Vallentin Doctor of Philosophy in Computer Science University of California, Berkeley Professor Vern Paxson, Chair Network forensics and incident response play a vital role in site operations, but for large networks can pose daunting difficulties to cope with the ever-growing volume of activity and resulting logs. On the one hand, logging sources can generate tens of thousands of events per second, which a system supporting comprehensive forensics must somehow continually ingest. On the other hand, operators greatly benefit from interactive exploration of disparate types of activity when analyzing an incident, which often leaves network operators scrambling to ferret out answers to key questions: How did the attackers get in? What did they do once inside? Where did they come from? What activity patterns serve as indicators reflecting their presence? How do we prevent this attack in the future? Operators can only answer such questions by drawing upon high-quality descriptions of past activity recorded over extended time. A typical analysis starts with a narrow piece of intelligence, such as a local system exhibiting questionable behavior, or a report from another site describing an attack they detected. The analyst then tries to locate the described behavior by examining past activity, often cross-correlating information of different types to build up additional context. Frequently, this process in turn produces new leads to explore iteratively (\peeling the onion"), continuing and expanding until ultimately the analyst converges on as complete of an understanding of the incident as they can extract from the available information. This process, however, remains manual and time-consuming, as no single storage system efficiently integrates the disparate sources of data that investigations often involve. While standard Security Information and Event Management (SIEM) solutions aggregate logs from different sources into a single database, their data models omit crucial semantics, and they struggle to scale to the data rates that large-scale environments require. 2 In this thesis we present the design, implementation, and evaluation of VAST (Visibility Across Space and Time), a distributed platform for high-performance network forensics and incident response that provides both continuous ingestion of voluminous event streams and interactive query performance. VAST offers a type-rich data model to avoid loss of critical semantics, allowing operators to express activity directly. Similarly, strong typing persists throughout the entire system, enabling type-specific optimization at lower levels while retaining type safety during querying for a less error-prone interaction. A central contribution of this work concerns our novel type-specific indexes that directly support the type's common operations, e.g., top-k prefix search for IP addresses. We show that composition of these indexes allows for a powerful and unified approach to fine-grained data localization, which directly supports the workflows of security investigators. VAST leverages a native implementation of the actor model to scale both intra-machine across available CPU cores, and inter-machine over a cluster of commodity systems. Our evaluation with real-world log and packet data demonstrates the system's potential to support interactive exploration at a level beyond what current systems offer. We release VAST as free open-source software under a permissive license. i To my parents ii Contents Contents ii List of Figuresv List of Tables vii List of Algorithms viii 1 Introduction1 1.1 Use Cases..................................... 2 1.1.1 Incident Response............................. 3 1.1.2 Network Troubleshooting......................... 4 1.1.3 Insider Abuse............................... 4 1.2 Goals........................................ 5 1.2.1 Interactivity................................ 5 1.2.2 Scalability................................. 5 1.2.3 Expressiveness............................... 6 1.3 Outline....................................... 6 2 Background9 2.1 Literature Search................................. 9 2.2 Related Work................................... 12 2.2.1 Traditional Databases .......................... 12 2.2.2 Modern Data Stores ........................... 14 2.2.3 Distributed Computing.......................... 16 2.2.4 Network Forensics Domain........................ 17 2.3 High-Level Message Passing ........................... 19 2.3.1 Actor Model................................ 20 2.3.2 Implementations ............................. 21 2.4 Accelerating Search................................ 25 2.4.1 Hash and Tree Indexes.......................... 25 2.4.2 Inverted and Bitmap Indexes ...................... 26 iii 2.4.3 Space-Time Trade-off........................... 27 2.4.4 Composition................................ 34 3 Architecture 39 3.1 Data Model.................................... 39 3.1.1 Type System ............................... 39 3.1.2 Query Language ............................. 40 3.2 Components.................................... 42 3.2.1 Import................................... 43 3.2.2 Archive .................................. 46 3.2.3 Index.................................... 48 3.2.4 Export................................... 54 3.3 Deployment.................................... 57 3.3.1 Component Distribution......................... 57 3.3.2 Fault Tolerance.............................. 59 3.4 Summary ..................................... 61 4 Implementation 62 4.1 Message Passing Challenges ........................... 62 4.1.1 Adapting to Load Fluctuations with Flow Control........... 62 4.1.2 Resolving Routing Inefficiencies with Direct Connections . 66 4.2 Composable and Type-Rich Indexing...................... 67 4.2.1 Boolean Index............................... 68 4.2.2 Integral Index............................... 68 4.2.3 Floating Point Index........................... 70 4.2.4 Duration & Time Index ......................... 72 4.2.5 String Index................................ 72 4.2.6 IP Address Index............................. 77 4.2.7 Subnet Index ............................... 78 4.2.8 Port Index................................. 79 4.2.9 Container Indexes............................. 79 4.3 Query Processing................................. 81 4.3.1 Expression Normalization ........................ 81 4.3.2 Evaluating Expressions.......................... 83 4.3.3 Finite State Machines .......................... 84 4.4 Code Base..................................... 84 5 Evaluation 88 5.1 Measurement Infrastructure ........................... 88 5.1.1 Machines ................................. 88 5.1.2 Data Sets ................................. 89 5.2 Correctness .................................... 91 iv 5.3 Throughput.................................... 91 5.4 Latency ...................................... 97 5.5 Scaling....................................... 102 5.6 Storage....................................... 104 5.6.1 Archive Compression........................... 104 5.6.2 Index Overhead.............................. 115 5.7 Summary ..................................... 115 6 Conclusion 117 6.1 Summary ..................................... 117 6.2 Outlook ...................................... 118 6.2.1 Systems Challenges............................ 118 6.2.2 Algorithmic Challenges.......................... 119 Bibliography 120 A Multi-Component Range Queries 137 v List of Figures 1.1 Thesis structure.................................... 8 2.1 A summary of research on network forensics over the last decade......... 10 2.2 The actor model ................................... 20 2.3 CAF performance compared to other actor model implementations . 24 2.4 Efficient access of the base data through an index.................. 26 2.5 Juxtaposition of inverted and bitmap indexes ................... 27 2.6 Design choices to map keys to identifier sets.................... 30 2.7 Illustrating how different encoding schemes work.................. 31 2.8 Equality, range, and interval coding......................... 32 2.9 Value decomposition example ............................ 35 3.1 The type system of VAST.............................. 40 3.2 High-level system architecture of VAST....................... 42 3.3 VAST deployment styles............................... 42 3.4 Event ingestion overview............................... 43 3.5 Event ingestion at the archive............................ 47 3.6 Index architecture .................................. 49 3.7 Index lookup ..................................... 50 3.8 Predicate cache at a partition ........................... 52 3.9 Historic query architecture.............................. 55 3.10 Candidate check

UC Berkeley UC Berkeley Electronic Theses and Dissertations

Globalfs: a Strongly Consistent Multi-Site File System

Big Data Storage Workload Characterization, Modeling and Synthetic Generation

A Decentralized Cloud Storage Network Framework

Pack, Encrypt, Authenticate Document Revision: 2021 05 02

Metadefender Core V4.12.2

Improved Neural Network Based General-Purpose Lossless Compression Mohit Goyal, Kedar Tatwawadi, Shubham Chandak, Idoia Ochoa

Genomic Data Compression

Archive and Compressed [Edit]

Kafl: Hardware-Assisted Feedback Fuzzing for OS Kernels

Sequence Analysis Data-Dependent Bucketing Improves Reference-Free Compression of Sequencing Reads Rob Patro1 and Carl Kingsford2,*

The Quantcast File System

Is It Time to Replace Gzip?