High Performance Hypergraph Analytics of Domain Name System Relationships

High Performance Hypergraph Analytics of Domain Name System Relationships Cliff A Joslyn Sinan Aksoy Dustin Arendt Pacific Northwest National Laboratory Pacific Northwest National Laboratory Pacific Northwest National Laboratory Seattle, WA, USA Richland, WA, USA Richland, WA, USA [email protected] [email protected] [email protected] Louis Jenkins Brenda Praggastis Emilie Purvine University of Rochester Pacific Northwest National Laboratory Pacific Northwest National Laboratory Rochester, NY, USA Seattle, WA, USA Seattle, WA, USA [email protected] [email protected] [email protected] Marcin Zalewski Pacific Northwest National Laboratory Seattle, WA, USA [email protected] Abstract—We report on the use of novel mathematical methods methods are well known in discrete mathematics, and are in hypergraph analytics over a large quantity of DNS data. Hy- closely related to important objects in data science such as pergraphs generalize graphs, as used in network science, to better bipartite graphs, set systems, partial orders, finite topologies, model complex multiway relations in cyber data. Specifically, casting DNS data from Georgia Tech’s ActiveDNS repository as and especially graphs proper, which they directly generalize hypergraphs allows us to fully represent the interactions between (every graph is a 2-uniform hyergraph). In HPC, hypergraph- collections of domains and IP addresses. To facilitate large-scale partitioning methods help enable parallel matrix computations analytics, we fielded an analytical pipeline of two capabilities. [7], and have applications in VLSI [12]. In the network HyperNetX (HNX) is a Python package for the exploration science literature, researchers have devised several path and and visualization of hypergraphs, acting as a frontend. For the backend, the Chapel HyperGraph Library (CHGL) is a motif-based hypergraph data analytics (albeit fewer than their library for high performance hypergraph analytics written in graph counterparts), such as in clustering coefficients [14] and the exascale programming language Chapel. CHGL was used centrality metrics [8]. to process gigascale DNS data, performing compute-intensive Complex data commonly analyzed using network science calculations for data reduction and segmentation. Identified methods, and especially including cyber data, often contain portions are then sent to HNX for both exploratory analysis and knowledge discovery targeting known tactics, techniques, multi-way interactions. But while they thus present naturally and procedures. as hypergraphs, still hypergraph treatments are very unusual Index Terms—Hypergraphs, DNS, high performance comput- compared to graph representations of the same data. This ing, Chapel. is due at least to the greater mathematical, conceptual, and I. INTRODUCTION computational complexity of hypergraph methods, since as n Many problems in data analytics involve rich interactions data objects, hypergraphs scale as O(2 ) in the number of 2 amongst multiple entities, for which graph representations are vertices n, as opposed to O(n ) for graphs. In the face of commonly used. High order (high dimensional) interactions, this, complex data are typically collapsed or are simplified to which abound in cyber and social networks, can only be graphs to ease analysis. represented in graphs as highly inefficiently coded, “reified” Our research group is dedicated to facing the challenge of labeled subgraphs. Lacking multi-dimensional relations, it the complexity of hypergraphs in order to gain the formal is hard to address questions of “community interaction” in clarity and support for analysis of complex cyber data they graphs: e.g., how is a collection of entities A connected to provide. A substantial high-performance computing (HPC) another collection B through chains of other communities?; component is thus necessary, despite hypergraph analytics where does a particular community stand in relation to other not receiving much attention in the software engineering communities in its neighborhood? community at large, and the HPC community in particular. We Hypergraphs [2] are generalizations of graphs which al- thus pursue a two-fold approach to developing our methods: low edges to connect any number of vertices. Hypergraph 1) We employ the Chapel Hypergraph Library (CHGL, https://github.com/pnnl/chgl) [11]), a library for hyper- PNNL-SA-139836 graph computation in the emerging Chapel programming language [5], [6], for HPC hypergraph processing, large scale analysis, and data segmentation. 2) We explore single hypergraphs or collections of hypergraphs using HyperNetX (HNX, https://github.com/pnnl/HyperNetX), a Python library being developed by PNNL for exploratory data analytics and visualization of hypergraphs. In our work, CHGL and HNX are two stages of an analytical pipeline: CHGL provides a highly abstract interface for implementation of HPC hypergraph algorithms over large data, identifying segments and subsets which can then be passed to Fig. 1: (Left) An Euler diagram of an example hypergraph H. HNX for more detailed analysis. (Right) Its incidence matrix I. In this paper we first introduce the rudiments of hypergraph mathematics and hypernetwork science in the context of our CHGL and HNX capabilities. We then describe the DNS data set, selections of the ActiveDNS data sets from Georgia An s-path is a sequence of edges he0; e1; : : : ; eni such Tech University. We then describe CHGL, before going on that each ei−1; ei are s-adjacent for 1 ≤ i ≤ n; and an to describe the results of our demonstration analyses. These s-component is a maximal collection of edges any pair of include both basic global statistics like degree and edge size which is connected by an s-path. The s-diameter of an distributions, as well as exploratory and targeted discovery of s-component is the length of its longest shortest s-path. small components. The exploratory discovery involves motif Comparing again to graphs, graph paths are all 1-paths, and mining and computation of simple hypergraph metrics to graph components all 1-components. Our example has two 1- discover outliers. On the other hand, targeted discovery is components (shown obviously), but also four 2-components motivated by known bad activity. We built a blacklist of IP (listed edge-wise) fA; F; G; Hg; fB; Dg; fCg and fEg. It’s addresses and domains that follow a known pattern used by 3- and 4-components are each single edges of size larger than a global criminal operation as described in a FireEye Threat 3 or 4 (respectively), and it has no 5 or higher components. Research Blog entry [4]. Given a hypergraph H, it is possible to construct smaller representations which capture important properties: II. HYPERGRAPH ANALYTICS • Note that in our example, the edges A = F and H = hV; Ei V An undirected hypergraph is a pair with a B = D, and the vertices 1 = 9 and 7 = 8, are E finite, non-empty set of vertices, and a non-empty multiset equivalent, represented as duplicate columns and rows e 2 E 8e 2 of hyperedges (or just “edges” when clear), where in I respectively. Collapsing is the process of combining E; e ⊆ V . Hypergraphs can be represented in many forms, these and replacing them with a representative, while also H two of which are shown in Fig. 1 for a small example with possibly maintaining a multiplicity count to be used for V = f1; 2;:::; 9g jV j = 9 1 , representing IP addresses. On a weighting. The edges E are hereby transformed from a the left is an Euler diagram showing each of eight hyperedges multiset to a set. A; B; : : : ; H, representing domains, as a “lasso” around its • Additionally, note that after collapsing, the smaller 1- V × E I vertices. On the right is a incidence matrix , where component becomes an isolated singleton, effectively a hv; ei 2 I v 2 e v 2 a non-null cell indicates that for some collection of non-interacting vertices, or a diagonal block V; e 2 E We call. each hyperdge e 2 E an s-edge where s = jej. in I. These are especially common in DNS data. Pre- Thus all graphs are hypergraphs, in that all graph edges are collapse, an isolated singleton would indicate the normal, 2-edges, for example H = f4; 5g, saying that the domain H uninteresting behavior in DNS where a single IP is has two IPs 4 and 5. But F = f1; 2; 3; 9g is a 4-edge, with associated with a single domain, and vice versa. But post- domain F having those four IPs. This is not representable in a collapse, they indicate a collection of IPs and domains graph. Where each column of the incidence matrix of a graph which are universally associated only with themselves, has exactly two cells, those of hypergraphs are unrestricted. effectively forming a set of domain and IP aliases. In this Our research group is pursuing hypergraph analytics as an work, these are counted and pruned, but in the future they analog to graph analytics [13]. While our development is could be attended to with respect to their multiplicities. consistent with others in the literature [8], [15], our notation • Finally, note that H ⊂ G is an included edge. Non- and concepts are somewhat distinct. We say that two edges included edges are called toplexes, and not only is the e; f 2 E are s-adjacent if je \ fj ≥ s for s ≥ 1. An s-star is collection of toplexes much smaller than H, but it is a set of edges F ⊆ E sharing exactly a common intersection sufficient to derive some hypergraph information, for f ⊆ V , with jfj ≥ s, so that 8ei; ej 2 F we have ei \ej = f. example s-components. 1H can also be represented as a bipartite graph on the disjoint union V tE, Table I shows some important statistics for our example, with each component a distinct part. first for the initial hypergraph, then after collapsing, and Non-Singleton Initial Collapsed Components In order to explore large volumes of DNS mappings jV j 9 7 6 we turned to ActiveDNS, a data set maintained by jEj 8 6 5 the Astrolavos Lab at Georgia Institute of Technology Aspect ratio 1.125 1.167 1.200 # Cells 23 14 13 (https://activednsproject.org).

Load more