IDENTIFYING APPLICATION PROTOCOLS IN COMPUTER NETWORKS USING VERTEX PROFILES

By

Edward G. Allan, Jr.

A Thesis Submitted to the Graduate Faculty of

WAKE FOREST UNIVERSITY

in Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

in the Department of Computer Science

December 2008

Winston-Salem, North Carolina

Approved By:

Errin W. Fulp, Ph.D., Advisor

Examining Committee:

David J. John, Ph.D., Chairperson William H. Turkett, Jr., Ph.D. Acknowledgements

This thesis is the product of many people’s labors, not just my own. The ideas contained in the pages that follow have been formulated and refined for over a year, with the guidance and support of several people, whose assistance I would be remiss not to mention. I would like to thank Wake Forest University and GreatWall Systems, Inc. for their support. This research was funded by GreatWall Systems, Inc. via the United States Department of Energy STTR grant DE-FG02-06ER86274. 1 I would also like to thank my parents for their support throughout my years at Wake Forest, both as an undergraduate and as a graduate student. Without their encouragement and financial assistance, none of this would have been possible. I also would not be where I am today without the help of my friends, who have made these past several years some of the most enjoyable and most memorable yet. My thesis committee members, Dr. David John and Dr. William Turkett, Jr., were instrumental in providing me with feedback throughout the research and writing process. Their comments and criticism have undoubtedly enabled the success of this endeavor. I would especially like to thank Dr. Turkett for selflessly spending hours assisting me and stepping in as my “adopted advisor” during Dr. Errin Fulp’s sabbatical. Last, but certainly not least, I must thank my advisor, Dr. Errin Fulp. I have been fortunate to work with him in a variety of contexts for more than five years now, and he has been a tremendous influence on both my personal and academic development. His relaxed personality and great sense of humor kept me off-task just enough to save my sanity, while his insight and guidance allowed me to complete my studies and be ready to move on to the next chapter in my life. Many thanks again to all who have helped me along the way — you are much appreciated.

1The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the DOE or the U.S. Government.

ii Table of Contents

Acknowledgements...... ii

Illustrations...... vi

Abbreviations...... viii

Abstract...... x

Chapter 1 Introduction...... 1 1.1 Issues in Network Management and Security ...... 2 1.2 Current Methods of Network Analysis ...... 2 1.2.1 ApplicationsandPortNumbers ...... 3 1.2.2 PacketInspection...... 4 1.3 Interdisciplinary Study of Network Communications ...... 4 1.3.1 SocialNetworks...... 5 1.3.2 BiologicalNetworksandMotifs ...... 6 1.4 Outline...... 7

Chapter 2 Computer Networks and Communications...... 8 2.1 Network Topologies and Architectures ...... 8 2.2 Reference Models ...... 10 2.2.1 TheOSIModel ...... 10 2.2.2 TheTCP/IPModel ...... 12 2.3 Layer3:TheNetworkLayer ...... 13 2.4 Layer4:TheTransportLayer ...... 13 2.5 Layer7:TheApplicationLayer ...... 14

Chapter 3 Graph Analysis...... 16 3.1 GraphTerminologyandBasicProperties ...... 16 3.2 TypesofGraphs ...... 17 3.3 TraditionalGraphMeasures ...... 18 3.3.1 DistancesandPathLengths ...... 18 3.3.2 CentralityMeasures ...... 19 3.3.3 ClusteringCoefficient...... 21

iii iv

3.3.4 Application of Traditional Graph Measures in Computer Net- works ...... 22 3.4 NetworkMotifs ...... 22 3.4.1 DefinitionofaMotif ...... 23 3.4.2 FunctionofMotifs ...... 24 3.5 AnalysisofApplicationGraphs ...... 25

Chapter 4 Data Selection and Considerations ...... 26 4.1 NetworkTraceFiles...... 26 4.2 Challenges Associated with Network Data Collection ...... 26 4.2.1 DataCapture ...... 27 4.2.2 PrivacyandSanitizationofData ...... 28 4.2.3 NetworkandDataView ...... 29 4.3 DataSources ...... 30 4.3.1 Dartmouth College Wireless Traces ...... 31 4.3.2 LBNL/ICSI Enterprise Tracing Program ...... 31 4.3.3 OSDIConferenceNetworkTraces ...... 31 4.4 ProtocolSelection...... 32

Chapter 5 Experimental Methodology ...... 36 5.1 HardwareandLinuxSystem ...... 36 5.2 PacketCaptureandStorage ...... 37 5.3 CreationofApplicationGraphs ...... 37 5.4 TraditionalGraphMeasures ...... 39 5.5 MotifAnalysis...... 40 5.6 VertexProfiles...... 43 5.7 K-Nearest Neighbor Classification ...... 44 5.7.1 MeasuringProfileSeparation ...... 45 5.7.2 Cross Validation of Classification Results ...... 46 5.8 Genetic Algorithm Feature Weighting ...... 46 5.8.1 OverviewofGeneticAlgorithms ...... 47 5.8.2 FeatureWeighting ...... 48

Chapter 6 Results and Analysis...... 49 6.1 PreliminaryInvestigations ...... 49 6.2 InitialResults...... 50 6.2.1 TraditionalGraphMeasureProfiles ...... 51 6.2.2 Motif-basedProfiles...... 54 6.3 Weighted Profiles and Key Attributes ...... 57 v

6.3.1 Attribute Weights of Traditional Graph Measures ...... 58 6.3.2 Attribute Weights of Motif-based Measures ...... 59 6.4 ComparisonofProfileTypes ...... 61 6.5 Considerations for Optimizing Classifier Performance ...... 63 6.6 LimitationsofCurrentApproach ...... 66

Chapter 7 Conclusions and Future Work ...... 67

References ...... 71

Appendix A Examples of Application Graphs ...... 76

Appendix B Code Listings...... 78

Appendix C Test Parameters...... 85

Appendix D Additional Classification Results ...... 87

Vita...... 88 Illustrations

List of Tables

4.1 Summary statistics of three trace files examined ...... 31

5.1 Graphorders for each application protocol ...... 38

6.1 Classification accuracy of 65 application graphs ...... 50 6.2 An example confusion matrix with three classes ...... 50 6.3 Confusion matrix of unweighted traditional graph measures...... 52 6.4 Number of single and multi-class ties for traditional graph measures 53 6.5 Confusion matrix of unweighted motif-based profiles ...... 55 6.6 Number of single and multi-class ties for motif-based profiles .... 55 6.7 Percentage of original data used in motif-based profiles ...... 57 6.8 Attribute weights for traditional graph measures ...... 58

C.1 FANMODtestparameters ...... 85

D.1 Confusion matrix of 65 application graphs using motif frequencies . . 87 D.2 Confusion matrix of weighted traditional graph measures...... 87 D.3 Confusion matrix of weighted motif profiles ...... 87

List of Figures

1.1 ExampleoutputfromNetStat ...... 3 1.2 Graphical depiction of a social network with two distinctly visible clus- ters ...... 6

2.1 Four network topologies: bus, ring, star and mesh [1] ...... 9 2.2 The OSI and TCP/IP reference models [2] ...... 11 2.3 AnIPdatagramheader[2] ...... 13 2.4 UDPandTCPdatagramheaders[2] ...... 14 2.5 Example communication between a client and a web server . .... 15

3.1 Agraphwithfivenodesandfiveedges ...... 17 3.2 Schematic view of motif detection [3] ...... 23 3.3 All 13 configurations of order 3 connected subgraphs [3] ...... 24

vi vii

3.4 Afeed-forwardloop ...... 24

4.1 Tcpdump output containing timestamp, protocol, source IP, source port, destination IP, destination port, packet length and packet flags 27

5.1 Overview of the proposed methodology and tools used ...... 36 5.2 Storing packets from a pcap file into a MySQL database ...... 37 5.3 Amotifwithcoloredvertices ...... 41 5.4 FANMOD edge-switching process for generating random networks [4] 42 5.5 Arrays representing vertex profiles ...... 43 5.6 Single-point crossover of two binary strings ...... 48

6.1 Profile collisions for traditional graph measures ...... 54 6.2 Profile collisions for motif-based profiles ...... 56 6.3 Depiction of three application graphs: HTTP, AIM and SSH ..... 57 6.4 Accuracy of unweighted vs. weighted traditional graph measure profiles 59 6.5 The ten highest-weighted motifs and their corresponding weights . . 60 6.6 Accuracy of unweighted vs. weighted motif-based profiles ...... 61 6.7 Accuracy comparison of unweighted profile types ...... 62 6.8 Accuracy of single attribute classification ...... 64 6.9 Comparison of profile types as the size of the training set increases . 65

A.1 Application graphs depicting AIM communications ...... 76 A.2 Application graphs depicting DNS communications ...... 76 A.3 Application graphs depicting HTTP communications ...... 76 A.4 Application graphs depicting Kazaa communications ...... 77 A.5 Application graphs depicting MSDS communications ...... 77 A.6 Application graphs depicting Netbios communications ...... 77 A.7 Application graphs depicting SSH communications ...... 77 Abbreviations

Acronyms

AIM - AOL Instant MessengerTM

API - Application Programming Interface

AUP - Acceptable Use Policy

DNS - Domain Name Service

FFL - Feed-forward loop

HTTP - HyperText Transfer Protocol

IANA - Internet Assigned Numbers Authority

IDS - Intrusion Detection System

IP -

MSDS - Microsoft Directory Share

OSI - Open Systems Interconnection

P2P - Peer-to-peer

SANSTM - SysAdmin, Audit, Networking, and Security

SMTP - Simple Mail Transfer Protocol

SSH - Secure Shell

TCP - Transmission Control Protocol

UDP - User Datagram Protocol

VoIP - Voice over IP

viii ix

Symbols

| V | is the number of vertices in a graph

eij is an edge from vertex i to vertex j

deg(v) is the degree of vertex v

id(v) is the indegree of vertex v

od(v) is the outdgree of vertex v

N(v) is the set of nodes in the neighborhood of vertex v

e(v) is the eccentricity of vertex v

rad(G) is the radius of graph G

diam(G) is the diameter of graph G

d(u, v) is the distance between vertex u and vertex v

CD(v) is the degree centrality of vertex v

CB(v) is the betweenness centrality of vertex v

CC (v) is the closeness centrality of vertex v

xi is the eigenvector centrality of vertex i

C(v) is the clustering coefficient of vertex v

φ is a port number associated with an application (e.g., 80 for HTTP) Abstract

Edward G. Allan, Jr.

Identifying Application Protocols in Computer Networks Using Vertex Profiles

Thesis under the direction of Errin W. Fulp, Ph.D., Associate Professor of Computer Science

Security and management of computer network resources exemplify two critical activities that challenge system administrators. They face potential threats from out- side intruders as well as internal users whom already have access to the organization’s assets. It is imperative that administrators are aware of what applications are being executed, but the use of data encryption techniques and non-standard port numbers presents difficulties that must be overcome. To that end, this thesis introduces a novel method to identify application protocols based on the analysis of application graphs, which model application-level communica- tions between computers. The performance of two types of node descriptions, called vertex profiles, are compared. “Traditional” vertex profiles characterize each node using several well-studied graph measures. Furthermore, this work uniquely applies motif-based analysis, which has previously been used primarily in systems biology, to the study of application graphs by creating a second type of vertex profile based on a node’s participation in statistically significant motifs. Machine learning techniques are employed to evaluate the importance of specific profile features. The experimen- tal results, using a nearest-neighbor classifier, show that this type of analysis can correctly classify the applications observed with greater than 80% accuracy.

x Chapter 1: Introduction

Managing and securing today’s critical data networks is a daunting and expensive

task. According to INPUT [5], demand for vendor-furnished information systems ° and services by the U.S. government will increase from °71.9 billion in 2008 to 87.8 billion in 2013. This money funds such tasks as system modernization, information sharing, IT management and information security. As computer networks increase in size, speed and complexity, and malicious hackers develop more sophisticated attacks, traditional methods of managing and securing these networks begin to break down. This thesis proposes a novel approach to identifying the actions of hosts within a network by examining the properties of application graphs, which model the social and functional interactions of hosts with one another at the software application level (e.g. HTTP, FTP, etc.). With the aid of machine learning techniques and algorithms, this method exploits graph characteristics of each host in the application graph, such as its connectedness, its position in the graph and the shapes of the subgraphs in which it is found. One distinct advantage to this approach is that classification can be performed “in the dark”, meaning that the packet payloads are either unavailable or have been encrypted, rendering deep packet inspection futile. Knowing what activities users on the network are participating in is crucial to network administrators who must manage allocations, network configurations, performance and security and access policies. The following sections of this chapter provide background information and motivation for the study.

1 2

1.1 Issues in Network Management and Security

To protect itself from litigation and to help ensure the integrity of its network, an organization (such as a school, business, or government) will often develop an Accept- able Use Policy, or AUP. An AUP defines what behaviors are acceptable for internet browsing, what applications can be run by users and other relevant guidelines for usage. The SANS Security Policy Project [6] provides several resources and tem- plates for such policies. Take, for example, a policy that does not allow users to run a personal web server using an organization’s computing resources. Identifying such behavior can help to preserve network bandwidth that is otherwise used for legitimate business activities.

Not only can failure to comply with an organization’s AUP waste computing resources, it can also have serious security implications as well. Continuing with the example above, running an improperly configured web server or hosting insecure web application files gives an attacker an easy point of entry into the network. A study performed by MITRE from 2001-2006 notes a sharp increase in the number of public reports for vulnerabilities that are specific to web applications [7]. For several years buffer overflow attacks had been the most common, but were overtaken in 2005 by web application vulnerabilities such as SQL injection, cross-site scripting (XSS) and remote file inclusion. It is, therefore, in a network administrator’s best interest to ensure that the network is properly utilized in accordance with the policies and guidelines adopted by the organization.

1.2 Current Methods of Network Analysis

Several tools allow system administrators to determine which applications are being used on a network. This information assists them in the maintenance and protection 3 of networked systems. Sophisticated users, however, are able to hide their activities, which could potentially include actions that are against the organization’s AUP, or worse yet, are illegal. This section examines a few of the tools used by administrators and identifies some of their weaknesses.

1.2.1 Applications and Port Numbers

When data is sent to a computer over a network, the destination port number identifies which application on the host computer should receive and process the data. Many applications use port numbers specified by the Internet Assigned Numbers Authority [8]. For example, FTP servers use ports 20 and 21, while web servers use port 80 by default. NetStat is a command line tool that shows information about network connections, both incoming and outgoing [9]. Figure 1.1 demonstrates the output of the NetStat command. $ netstat -ta Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0localhost:2208 *:* LISTEN tcp 0 0*:sunrpc *:* LISTEN tcp 0 0*:auth *:* LISTEN tcp 0 0*:35763 *:* LISTEN tcp 0 0localhost:ipp *:* LISTEN tcp 0 0localhost:smtp *:* LISTEN tcp 0 0localhost:36699 *:* LISTEN tcp6 0 0*:ssh *:* LISTEN

Figure 1.1: Example output from NetStat

Network administrators could look and see that a host on the network is listening on port 80, indicating the presence of a web server. The administrator could then shut down that service and take appropriate disciplinary action toward the user. The problem with this method of detecting network applications is that while many do run on a known port number, they do not necessarily have to. If a web server were reconfigured to listen for connections on port 6000, clients could still connect to it through their web browser by typing http://www.example.com:6000. A user 4 wishing to hide their activities might attempt to disguise an application by using such a non-standard port number. Chapter 2 describes port numbers and other networking concepts in more detail.

1.2.2 Packet Inspection

Another method of detecting network applications is to scrutinize the data contained in each packet as it traverses the network. Packets contain information such as HTTP requests, email headers and MP3 filename searches, as well as protocol-specific session initiations and version numbers that can be used to identify a particular application. Wireshark is a popular network protocol analyzer that has several useful features for viewing packet contents, reassembling sessions and gathering statistics about network data [10]. Packet inspection is commonly used in intrusion detection systems (IDS) such as Snort [11]. A rule-based engine searches packet data, compares it against a list of known attacks and generates a predefined response (such as notifying an administrator). The problem with packet inspection is that traffic is increasingly encrypted. Data payloads that have been transformed into cyphertext are not human- readable until they are decrypted with the appropriate key, nor do the payloads match the known attack strings in the case of IDS.

1.3 Interdisciplinary Study of Network Communications

It is therefore the goal of this study to look beyond current methods for identifying network behavior and propose a novel approach that relies upon high level commu- nication patterns observed among hosts. To accomplish this goal, this study borrows ideas and algorithms from several disciplines. Networks are not unique to computer science; they exist in mathematics, sociology, biology, communications and other ar- eas of study as well. Graphs, a collection of objects (sometimes called nodes) linked 5 by edges, are the abstract model which allows for the analysis of any type of network. They can represent relationships among friends, the interaction of biological entities in a transcriptional regulation network, the collaboration between authors of research papers [12], as well as a myriad of other problem spaces. Chapter 3 illustrates the properties of graphs in more depth.

1.3.1 Social Networks

One key area of study that this thesis borrows from is social network analysis, which focuses on relationships among social entities (also known as actors), and on the patterns and implications of these relationships [13]. The properties of social graphs reveal interesting information such as the spread of disease or material goods through the network, as well as what actors are “influential” (politically, socially, etc.). Social network analysis also has military and intelligence applications. Yang and Ng provide visualizations and analysis of weblog social networks related to terrorism and other crime-related matters [14].

To provide a simple working example of social network analysis, Figure 1.2 depicts the author’s social network of friendships taken from the popular social networking web site FacebookTM. There are two clearly visible “clusters” of friends visible in the graph, created by nodes in each cluster sharing many common links with other nodes in the cluster. In the context of this social network, it means that many of the author’s friends in each group are also friends with each other. The group on the left is primarily comprised of relationships formed during the author’s tenure at Wake Forest University, while the cluster on the right is primarily comprised of relationships formed prior to and during high school.

Several concepts pertaining to social networks can be extended to the study of application graphs performed in this work. Application graphs model the social rela- 6

Figure 1.2: Graphical depiction of a social network with two distinctly visible clusters

tionships between clients and servers in a computer network by showing with which web servers users choose to interact, with whom they communicate via instant mes- saging clients and with whom they choose to share files. For example, the application graph for AOL Instant MessengerTM might show several chat clients communicating with a central chat server, which then passes messages along to the intended recipi- ents. Characteristics of these high-level interactions are used to identify the software application through which the communication occurs. Section 3.3 elaborates upon the graph measures frequently used to quantify aspects of social networks.

1.3.2 Biological Networks and Motifs

The study of biological networks is another key field from which ideas for this thesis are borrowed. Cellular processes are regulated by the interactions of several molecules such as proteins and DNA [15]. These complex interactions can be modeled as graphs. One particular method used to analyze these graphs is to search within them for mo- tifs: recurring, significant patterns of interconnections. Milo et al. find motifs in several types of networks including biochemistry, neurobiology, ecology and engineer- ing. They suggest that motifs are the basic structural elements capable of defining broad classes of networks [3]. 7

Motif analysis is often used in biology [3, 16, 17, 18], but has not yet been applied to application graphs. One goal of this study is to determine if a motif or groups of motifs can help identify what application a computer is using. It finds that several protocols use similar motifs, partly due to the fact that many applications have a client-server architecture (described in Section 2.1). However, there is still enough distinction in how the applications are used at a social level to determine what they are based on the models developed in this work. Chapter 6 discusses some of the motifs found in application graphs.

1.4 Outline

The following is an outline of the remaining parts of this thesis. Chapter 2 covers information regarding computer networks, the different reference models and details the network layers used to create application graphs. Chapter 3 introduces several concepts relating to graph theory, “traditional” measurement techniques of graphs and provides more information about motifs. Data sources and application protocol selection is covered in Chapter 4. Chapter 5 specifies the tools used in this thesis and introduces machine learning techniques used for the modeling and classification of application types. A discussion of the results obtained and an analysis of key motifs and graph metrics is handled in Chapter 6, as well as a comparison between traditional graph measures and a motif-based approach. Finally Chapter 7 concludes this study and explores possible topics for future research. Chapter 2: Computer Networks and Communications

Undoubtedly the interconnection of computers and networks to the world wide web has increased mankind’s ability to share information, perform research and become more efficient at everyday tasks. However, not all users have benign intentions. Illegal hacking, cyber terrorism and fraud wreak havoc on governments, corporations and individuals alike. Data encryption is often used to disguise malicious activity as well as legitimate activity from observation. By exploring the communication patterns found within networks, this study shows that it is still possible to gain some insight into what applications are being utilized. The following sections introduce several basic concepts related to network architectures, protocols and applications.

2.1 Network Topologies and Architectures

Network topologies describe the arrangement and mapping of networked elements, such as computers, printers, wires and routers. Mappings can be physical or logical. Physical topology describes where the elements are actually located and how they are interconnected with wires. Logical topology on the other hand, referrs to the path data appears to take when traveling from one network host to another [1]. A network’s logical topology might be very different from the underlying physical topology, but it is bound by the network protocols that direct how the data moves across the network. Application graphs are a generalization of logical topologies in that they provide a picture of how data moves between hosts, but from a very high-level view.

There are several shapes used to describe network topologies including bus, tree, star, mesh and ring. In the case of a physical network, these shapes have an impact on 8 9

Figure 2.1: Four network topologies: bus, ring, star and mesh [1] network performance, reliability and ease of management. For example, a bus network is cost-effective and easy to implement, but the architecture can only support a limited number of hosts and a bad cable will bring down the entire network. A star network allows for the isolation of the periphery nodes, but the central hub might be a single point of failure for the network. Logical topologies show the exchange of information between entities that are not physically connected by the network infrastructure. For example, IBM’s Token Ring network technology is a logical ring but is physically wired in a star topology.

In terms of software application models, two prevalent architectures are found in computer networks: the client-server model and peer-to-peer (P2P) architectures. In the client-server model, a client machine is responsible for initiating a request to some application running on another computer. The server waits for an incoming request from a client and then sends a response. Client-server architecture allows for computing responsibilities to be divided up among servers in the network, where one computer might act as a web server, another as an email server and so on. While the data sent between the client and server might go through several network devices, the logical data flow is a single link between the two nodes. A star network could then be induced by several clients connecting to a common server (see Figure 2.1). In a P2P 10 network, nodes both initiate and respond to requests from other computers on the network known as peers. Consequently, the logical topology of such interactions could form a mesh network. This study examines the characteristics of logical topologies extended to the application layer, modeled as application graphs.

2.2 Computer Network Reference Models

Application graphs are created using information from several layers of the network communication process. Data goes through a series of transformations before being sent to its destination, including breaking the data into manageable fragment sizes, adding information, specifying how the data should be transmitted and converting it into the electrical pulses that traverse the wire. Three layers in particular are of interest: the network, transport and application layers, described in Sections 2.3–2.5.

There are two fundamental models referenced when describing network layers: the OSI model and the TCP/IP model. The protocols (rules that govern the syntax and meaning of data sent between entities) associated with the OSI model are rarely used, but the features described at each layer are still important. In contrast, the TCP/IP model is not as rigidly defined as the OSI model, but the protocols associated with it are widely used [2]. This section provides an overview of these models, depicted in Figure 2.2.

2.2.1 The OSI Model

The Open Systems Interconnection Basic Reference Model (OSI Model) was designed to promote international standardization of the protocols used in communication networks. There are seven layers in this model: the , , , , session layer, and application layer 11

[19]. The physical layer deals with representing and transmitting raw bits over a communication channel. Well known examples include over twisted pair (10BASE-T, 100BASE-TX) and 802.11/a/b/g wireless standards. The task of the data link layer is to correct transmission errors from the physical layer and provide the means to enable point-to-point communication between hosts within a local area network. This layer arranges data into frames and also provides medium access control to share communication channels between multiple users.

The network layer determines how packets are routed from the source to the desti- nation, allows the interconnection of heterogeneous networks and provides congestion control. The next layer in the model, the transport layer, provides logical commu- nication between processes on the hosts and is the first true end-to-end layer in the model. The session and presentation layers are not generally used; their intent is to provide session management between hosts, synchronization, interruption recovery and “on the wire” management of abstract data structures. The final layer in the OSI model is the application layer. This is the layer at which a user directly interacts with the program (a web browser, for example) that sends network data.

Figure 2.2: The OSI and TCP/IP reference models [2] 12

2.2.2 The TCP/IP Model

First proposed in 1974, the TCP/IP model [20] presents a slightly different view of network communications with four layers that are not as strictly defined as those in the OSI model. Whereas the OSI model was developed before the associated protocols, the TCP/IP model was developed based on protocols that already existed, taking its name from its two key protocols. The host-to-network layer is somewhat ill-defined and does not specify the protocols necessary for a host to send packets to the . It combines elements of the OSI model’s physical and data link layers. The internet layer is analogous to layer 3 of the OSI model. Familiar protocols like IP (Internet Protocol) and ICMP (Internet Control Message Protocol) are a part of this layer.

The third layer of the TCP/IP model is the transport layer, which maps directly to the transport layer of the OSI model. It allows for end-to-end communication of hosts on a network, using the TCP (Transmission Control) and UDP (User Datagram) protocols. A need for the session and presentation layers was not perceived, so the TCP/IP reference model does not contain them explicitly. The fourth layer, the application layer, will contain them if necessary. This layer contains all of the high level protocols such as HTTP, SMTP and DNS.

Although there are certainly similarities between several layers of the two reference models, this paper will use OSI model terminology. This allows for a finer distinction between network services offered at each layer to be made. The important lower-level protocols for application graphs, however, are those that were originally associated with the TCP/IP model, namely TCP and UDP. 13

2.3 Layer 3: The Network Layer

The network layer is concerned primarily with delivering packets from one host to another through a series of routers. It attempts to maintain some quality of service for variables such as delay, transit time and while forwarding packets along until the destination is reached.

Figure 2.3: An IP datagram header [2]

Figure 2.3 shows all of the fields contained in the header of an IP data packet. For modeling network communications, however, only two fields are of interest: the source address and the destination address. Each IP address identifies a unique node in an application graph. The protocol field tells the network layer which transport process to give the data to. Two common options are TCP and UDP, described next.

2.4 Layer 4: The Transport Layer

The transport layer is responsible for getting data to and from applications running on the host machine, providing logical end-to-end communication between the appli- cations. There are two types of service available to the upper layers, connectionless or connection-oriented. The simpler of the two is connectionless, implemented by UDP. The delivery and ordering of UDP packets is unreliable, but there is less connection overhead associated with the transfer. Connection-oriented service, provided by TCP, establishes several properties of the transmission ahead of time, such as data window 14 sizes and congestion control mechanisms. TCP packets are given sequence numbers that are kept in order. Although IP networks are still only “best-effort” as no re- sources are reserved ahead of time, TCP provides reliable communication between hosts.

(a) UDP header

(b) TCP header

Figure 2.4: UDP and TCP datagram headers [2]

TCP and UDP headers (Figure 2.4) contain fields for the source and destination port numbers. Port numbers serve as numerical identifiers for processes. They are 16 bits in length, resulting in 216 possible ports, numbered 0 through 65535. The Internet Assigned Numbers Authority (IANA) is responsible for maintaining assignments of port numbers for specific uses [8].

2.5 Layer 7: The Application Layer

The primary objective of this thesis is to identify application usage via communication patterns at the application layer. Although not 100% accurate, port numbers are used 15 as the application labeling scheme for training the application classifier, described in Chapter 5. Some applications communicate on certain port numbers with a high degree of reliability. For example, when a user opens a web browser and requests a web page, a connection is established from the user’s computer from a randomly assigned upper port number to port 80 of the web server hosting the page. In this case, Hypertext Transfer Protocol (HTTP) is the layer 7 application protocol used, with the web server listening for connections on port 80, the IANA official port for the HTTP protocol. This process is depicted in Figure 2.5.

Source Destination 192.168.1.100:29985 → 208.122.19.56:80 User requests a web document 208.122.19.56:80 → 192.168.1.100:29985 Server responds to request Figure 2.5: Example communication between a client and a web server

There is no shortage of application layer protocols. Common examples include SMTP or POP3 for email services, DNS for domain name resolution, peer-to-peer protocols like BitTorrent and many, others. This study focuses on seven applications that reflect a variety of application types and also have official port assignments from the IANA. Protocol selection is detailed in Chapter 4, while the steps taken to create application graphs based on the layer 3, 4 and 7 information are detailed in Section 5.3. Chapter 3: Graph Analysis

Graphs are a well-studied concept in mathematics, dating back to Leonhard Eu- ler’s 1736 analysis of the Seven Bridges of K¨onigsberg which laid many of the foun- dations of graph theory [21]. Simply put, graphs are a collection of objects with connections between them. These abstract structures model problems in a variety of areas, including logistics, communication systems, biological and chemical com- pounds and social-group structures [22]. The first part of this chapter reviews the basic concepts and terminology required by the study of application graphs and then introduces several “traditional” measures used to describe graphs. In the latter half of this chapter, network motifs are defined in terms of their graph characteristics and are related to application graphs.

3.1 Graph Terminology and Basic Properties

Unfortunately, some of the mathematical notation used in graph theory tends to differ from text to text. Many of the basic properties and definitions are standard, but for those that are not, this thesis borrows notation primarily from two sources: Chartrand and Zhang [23], and Busacker and Saaty [22]. Abbreviations and function-like syntax replace many Greek letters in this style of notation to avoid confusion. For example, x(G) indicates that x is a property of the entire graph, whereas y(v) indicates y is a property local to a particular vertex.

Vertices (or nodes, as they are often called in computer science) are the funda- mental units in a graph. They can represent any object, such as a person, process, city, or a computer. Vertices are linked together by edges, which show a relationship

16 17 between the vertices they connect. Some examples include roads connecting cities, social interactions between people, or physical links between computers in a network. A graph is a collection of vertices and edges taken together. Formally, a graph G con- sists of a finite, non-empty set of vertices V , connected by a set of edges E, written as G =(V, E). This definition implies that a graph must have at least one vertex in it, but it does not necessarily have to contain any edges.

Figure 3.1: A graph with five nodes and five edges

The set of vertices V is written V = {v0, v1,...,vk}. The cardinality of this set, | V |, is the order, or number of nodes in the graph. A graph’s edge set is defined as

E ⊆ {{u, v}| u, v ∈ V }. For brevity, an edge can be written eij to mean an edge linking node i to node j. |E | is the number of edges in the graph, known as its size. The degree of a node, deg(v), is the number of nodes that v is adjacent to in the graph (those that can be reached by traversing one edge). This set of nodes is known as N(v), the neighborhood of v. In Figure 3.1, nodes 2 and 3 are adjacent to node 1, and N(1) = {2, 3}.

3.2 Types of Graphs

Modeling complex systems often requires more detail than just nodes and vertices as described in the previous section. One possible approach is to orient the graph to show asymmetric relationships between objects. In an undirected graph, the edges are pairs of unordered vertices, that is, eij = eji. The edges in a directed graph, however, are ordered pairs, and eij =6 eji. The degree measure can be extended to include the indegree, id(v), and outdegree, od(v), of a vertex to describe the number 18 of vertices of G from which v is adjacent, and the number of vertices in G to which v is adjacent, respectively. The associated undirected graph of a directed graph is obtained by disregarding the ordering of the end points of each edge.

The assembly line process for building an automobile can be modeled as a directed graph, where each stage of the process is represented by a node in the graph. The directionality of the edges indicate that each step follows in a specified order and that the process cannot happen in the reverse order. Edges of a graph can be weighted, usually with an integer or real number, to imply a “cost” associated with traversing an edge, or to further describe how the edge is used within the overall system. In the auto assembly line graph, an edge weight could represent the amount of time a particular step in the process takes.

3.3 Traditional Graph Measures

Several graph measures exist to describe the structure of a network, such as how connected a vertex is, its distance from other vertices, and how it is positioned in the graph. These measures have been used to characterize many different types of networks and describe their growth patterns [24]. The following sections define the measures selected for this study and provide examples of several of the concepts.

3.3.1 Distances and Path Lengths

The distance between two nodes u and v, written d(u, v), is the length of the shortest path between them. In an unweighted graph, this is equal to the number of edges in the path. In a weighted graph, the length of path P is w(e) for e ∈ P . Dijkstra’s P algorithm [25] is one common method for determining this path through a network.

For a vertex v in a connected graph, the eccentricity of v, e(v), is the distance between v and a vertex farthest from v in G. The radius of a graph rad(G) = 19 min{e(v) | ∀ v ∈ V } and the diameter diam(G) = max{e(v) | ∀ v ∈ V }. A vertex is said to be central if e(v) = rad(G) and periphery if e(v) = diam(G).

In Figure 3.1 (reproduced above for convenience), e(1) = 3 because node 5 is the node farthest away from node 1 in the graph and requires traversing three edges to reach it. The radius of the graph rad(G) = 2 because e(1) = e(2) = e(5) = 3, but e(3) = e(4) = 2. Also, diam(G) = 3, the maximal eccentricity value of all nodes in the graph. According to the definitions above, nodes 3 and 4 are central, while nodes 1, 2 and 5 are said to be periphery nodes.

3.3.2 Centrality Measures

It is helpful to describe the centrality measures of a graph in terms of social networks in order to make an analogy: the centrality measures of a vertex indicate how important, prominent, or powerful the vertex is in a graph. The following is a brief examination of four common centrality measures proposed by Freeman and Bonacich [26, 27]. The

deg(v) most basic of these is degree centrality, or CD(v), defined as | V |−1 . This equation can be modified for directed networks to produce CDin and CDout. In terms of social network analysis, indegree is interpreted as a a measure of popularity, while outdegree is interpreted as gregariousness. In a dense adjacency matrix representation of a graph, the time required to calculate the degree centrality for all nodes is O(V 2), since all combinations of vertices must be considered. 20

Betweenness centrality is the fraction of shortest paths between all pairs of vertices that pass through a particular vertex v. This measure is given by the equation:

δst(v) CB(v) = (3.1) δst s6=Xv6=t∈V s6=t

where δst is the number of shortest paths from s to t, and δst(v) is the number of shortest paths from s to t that pass through v. A vertex with a higher betweenness centrality occurs on more shortest paths than those that do not. This measure can indicate how “powerful” a vertex is, because it influences the spread of information through a network. O(V 3) calculations are required to determine betweenness and closeness (described next) using the Floyd-Warshall algorithm to find all shortest paths.

Closeness centrality is defined as the average shortest path length between a vertex v and all other vertices reachable from it. In network theory it is regarded as a measure of how long it will take information to spread from one vertex to the other reachable vertices in the graph. Closeness centrality is given by:

d(v, t) Xt∈V C (v) = (3.2) C n − 1 where n ≥ 2 is the number of vertices reachable from v. Those vertices in G that have shorter paths to other vertices will have a higher closeness centrality.

The eigenvector centrality is a more sophisticated version of the degree count of a vertex, acknowledging that not all connections within a network are equal. The eigenvector centrality score of a vertex i is proportional to the average degree of i’s neighbors. In social networks, this reflects the idea that people connected to influential people will themselves be more influential than if they were connected to 21 less influential people [28]. If the graph is represented as an adjacency matrix A where

Aij = 1 if node i is connected to node j, and Aij = 0 otherwise, eigenvector centrality can be written:

|V | 1 x = A x , (3.3) i λ ij j Xj=1 where λ is a constant, and xi is the degree count of vertex i, and xj is the degree count of vertex j. Defining the vector of centralities x = (x1, x2,... ), the previous equation can be rewritten as λx = A · x (3.4)

To force the centralities to be non-negative, it can be shown that λ must be the largest eigenvalue of A, and x the corresponding eigenvector [28].

3.3.3 Clustering Coefficient

The clustering coefficient measure begins to extract a little bit more information about the shape of structures within the graph, whereas many of the previous measures rely on information about paths and path lengths between nodes. Instead, clustering coefficient of v measures the percentage of edges that exist among neighborhood of v, divided by the number of edges that could possibly exist among them. For an undirected graph, the clustering coefficient is defined by the following equation:

2 |{e }| C(v) = jk : v , v ∈ N(v), e ∈ E (3.5) deg(v)(deg(v) − 1) j k jk

Another way to view clustering is the ratio of triangles (three nodes connected by three edges) to the number of triples (three nodes and two edges, both incident to v) that exist in the neighborhood of v. It has been shown in some types of networks 22

that if v1 connects to v2 and v2 connects to v3, then there is a greater chance that v1 and v3 will be connected as well [29, 28].

3.3.4 Application of Traditional Graph Measures in Com- puter Networks

Past studies have looked at graph characteristics for the purpose of anomaly detec- tion and traffic classification. Staniford et al.’s GrIDS system [30] generates graphs describing communications between IP addresses and can generate alerts based on a set of rules, such as a vertex degree count crossing some threshold value. The BLINC traffic profiling system developed by Karagiannis et al. examines the interactions be- tween hosts to identify an application, and utilizes measures including degree counts and neighborhood information [31]. This thesis is similar to the BLINC study in that they both evaluate interactions among hosts at the functional and social lev- els in order to identify applications. The BLINC study, however, exploits additional information such as the transport protocol and average packet size and attempts to match network behavior to a library of empirically derived “graphlets”. In contrast, this study examines a wider variety of graph measures, and also proposes the unique approach of searching application graphs for motifs.

3.4 Network Motifs

A network motif is a pattern of interconnections that occurs in a graph significantly more often than it does in randomized networks. Studies performed by Milo et al. find motifs in several types of complex networks, and that a small number of network motifs occur repeatedly across network types. They describe motifs as fundamental building blocks of networks, capable of defining universal classes of networks [3, 16]. Research suggests that some motifs can be associated with a particular function, 23 discussed in Section 3.4.2. The work performed in this thesis extends this idea to application graphs to determine if particular motifs indicate what application protocol a host is using.

3.4.1 Definition of a Motif

In mathematical terms, a graph G′ = (V ′, E′) is the subgraph of G if V ′ ⊆ V and E′ ⊆ E. A motif then, is any of such subgraphs that occur significantly more than in random networks. The level of significance required depends on the problem, but as an example Milo et al. consider those patterns with a p-value of 0.01, meaning that there is only a 1% chance of seeing a particular pattern as many or more times in random networks than is observed in the original network [3]. Motif detection is depicted in Figure 3.2.

Figure 3.2: Schematic view of motif detection [3]

Generally speaking, motifs of order 3 or larger are considered when performing motif searches. However, searching for large motifs can be prohibitively expensive because of the computational complexity involved. Several algorithms [32, 33] have been developed to increase the efficiency of these searches and allow for the analysis of large networks containing thousands of edges and nodes. Figure 3.3 shows the 24 thirteen possible directed edge combinations for motifs of order 3. In application graphs, the edge directionality indicates the flow of data between two hosts, such as a request from a client to a server, or the response from the server back to the client. Additional motif characteristics are described in Chapter 5.

Figure 3.3: All 13 configurations of order 3 connected subgraphs [3]

3.4.2 Function of Motifs

Several studies suggest that motifs can be linked to specific functions within a network. Milo et al. analyze the motifs found in the direct transcriptional interactions in Escherichia coli and find three highly significant motifs [16]. Their study states that the appearance of network motifs at high frequencies suggests that they may have some specific functions in the information processing performed by the network.

A different study analyzes the feed-forward loop, or FFL (Figure 3.4). In a FFL, X regulates transcription factor Y , and both jointly regulate gene Z. Mangan et al. show that it acts as a sign-sensitive delay element, in that it responds rapidly to step-like stimuli in one direction (ON to OFF) and at a delay to steps in the opposite direction (OFF to ON). They argue that this type of control mechanism can filter out fluctuations in input stimuli [34].

X → Y

ց _ Z Figure 3.4: A feed-forward loop 25

3.5 Analysis of Application Graphs

The application graphs studied in this work are hybrid networks, reflecting a mix of social interactions and computer network architectures. Although there are no genes present that require precise regulation like in the biological networks discussed previ- ously, network functions are carried out in a controlled environment that must follow a set of established protocols. For example, if a user wishes to talk to another user on a network via the AIM instant messaging service, each user must first authenticate and establish a connection to a central server; the computers do not simply send text back and forth between the two. Protocol behaviors are described in Chapter 4.

In terms of graph properties, application graphs are modeled with unweighted, directed edges and do not contain any self-loops. If a computer connects to a service running locally, the connection goes over the loopback interface, and is not visible on the network traces examined. The edge direction is set to match the observed traffic flow, which may be either unidirectional or bidirectional. If two computers communicate at any time during a period of monitoring, an edge is drawn between them. Edge weights are not used in this study, but may be considered in the future to provide further detail when determining the application type.

The traditional graph measures defined previously are appropriate for the study of application graphs because of the social aspect of the communications. Application graphs are formed through specific user actions, such as surfing the web, checking email, and sharing music. It is also for this reason that the study of motifs within application graphs is interesting. In systems biology, processes such as gene transcrip- tion and regulation are not voluntary tasks; cell survival depends on them. Chapter 5 details the methodology employed to describe application graphs based on their traditional and motif-based characteristics. Chapter 4: Data Selection and Considerations

As is the case in any type of research, proper data selection is imperative for producing accurate results and analysis. This chapter examines several of the issues involved with the collection and sampling of computer network data in an effort to build a baseline measure for “normal” network behavior, and concludes with an overview of the seven application protocols selected for this study.

4.1 Network Trace Files

The pcap library provides the packet-capture and filtering engines of several popular network analysis and monitoring tools [35]. Some examples include tcpdump, nmap, Wireshark and the Snort IDS. Tcpdump in particular is a valuable tool for capturing packets as they come across a network interface card, a process known as “sniffing”, and logging them in a raw format which can then be analyzed by other tools as shown in Figure 4.1. Although tcpdump is able to capture all of the data associated with each network packet such as packet length, flags and checksum values, only a few of the fields specified by the IP, TCP, and UDP RFC documents [36, 37, 38] are needed to model application graphs: source IP, destination IP, source port and destination port. These four pieces of information are enough to uniquely identify a process running over a computer network between two host. The creation of application graphs is discussed in Chapter 3 and the implementation detailed in Chapter 5.

4.2 Challenges Associated with Network Data Collection

Pang et al. identify three key goals of sharing network data with other researchers: verification of previous research, direct comparison of competing ideas on the same 26 27

Figure 4.1: Tcpdump output containing timestamp, protocol, source IP, source port, destination IP, destination port, packet length and packet flags data, and a broader view than a single investigator can obtain on their own [39]. Unfortunately there are several concerns that must be addressed such as the amount of data collected, the accuracy of the data and protection of users’ privacy. This section outlines a few of these issues.

4.2.1 Data Capture

Increased utilization and line speeds of today’s high speed, high capacity networks present challenges for collecting network data in terms of data rate, storage and pro- cessing [40]. A packet sniffer can easily log hundreds of gigabytes of data in a single day, even on a moderately sized network. A study of traffic collected at Dartmouth College shows a significant increase in peer-to-peer, streaming multimedia and VoIP traffic, whereas initial network usage was dominated by web traffic [41]. Both static and streaming multimedia applications require significantly more bandwidth than simple web documents or other non-interactive file types. Research characterizing YouTubeTM traffic found that 90% of videos requested by University of Calgary cam- pus network users were larger than 21.9 MB [42], orders of magnitude larger than the file sizes of other content types.

In addition to requiring a great deal of storage space, high speed packet capturing also requires fast memory access and high disk speed so that packets can be written to the disk before the capture buffer is full and loses packets. Although undesirable, this 28 behavior does not affect the study of application graphs proposed by this study, which uses individual packets to establish a communication link instead of aggregated flows (all packets associated with a particular origin and destination pair). Two nodes in an application graph will be connected if any packets are sent between them, regardless of which part of the flow they come from, beginning, middle, or end. Therefore, partial flows are considered in these graphs.

Another advantage of using individual packets is that TCP and UDP sessions don’t need to be defined. TCP connections are established by a three-way handshake between the client and server, and are terminated by a FIN and FIN-ACK sequence. The formal establishment or tear-down of a TCP session might not be correctly logged for several reasons: the sniffer could be turned on or off in the middle the session, parts of the handshake could be dropped by the sniffer, or either the client or server could disconnect without following the closing protocol. UDP doesn’t establish formal sessions like TCP does, so UDP flows are sometimes segregated by establishing a timeout value for which the flow is terminated if there is no activity. The edges in an application graph are binary in nature and only indicate whether or not host A communicated with host B.

4.2.2 Privacy and Sanitization of Data

Monitoring network traffic may raise serious privacy concerns, as data sent in cleartext (i.e. not encrypted) is easily read by sniffing. Data such as usernames and passwords sent to web sites via the HTTP protocol instead of the encrypted HTTPS protocol can be effortlessly obtained by an attacker on the network. Even if sensitive information is not being sent, an attacker can log all text and images downloaded by a user as he or she surfs the web, and reassemble the browser sessions later.

Not only do researchers who collect this kind of data need to be sure to sanitize 29 the resulting log files to ensure the privacy of users, but they must also disguise the IP addresses of machines on the network so that an attacker does not have a map of the network with which to launch an attack. Several methods and tools have been developed to accomplish these tasks, such as [43, 39, 44, 45].

Often times a network administrator or developer does not need to log the packet payloads to perform tasks such as verifying routes or debugging programs that utilize sockets. If this is the case, only the packet headers are logged and the rest of the packet is discarded. Only storing packet headers also helps alleviate the issue of storage space discussed in the previous section. This shortcut cannot be used in the case of signature-based intrusion detection systems which rely on scanning the of a packet for known signatures that indicate an attack. The methods proposed in this thesis do not consider packet payloads, but only the information readily available in the packet header.

4.2.3 Network and Data View

Ideally, a “God’s eye view” of a computer network would reveal all communication links within the network as well as connections from within the network to other networks outside of it. Unfortunately, many sniffers are placed at gateway nodes at the edge of a network and only capture traffic leaving from and coming to the network. As a result, traffic originating from within a network and destined for internal servers (web and application servers, email servers, etc.) is not logged because it never reaches the gateway. Some data collection projects such as [46] attempt to address the lack of internal enterprise network traffic that is available for research.

One drawback of the research method proposed in this paper is that it currently assumes network activity for a particular application is limited to a single port. This is not true for out-of-band protocols such as FTP that send authentication and control 30 messages on over one port but use another for data transfer. Even if provided with a complete view of the network, the data is segmented into individual port numbers for analysis. Therefore, network communications over multiple port numbers will not be visible. If a client connects to a web server on port 80, and that web server requests

TM data from a MySQL server (default port 3306) or an IBM WebSphere® Application server (default port 2809), only one part of the process is visible at a time, either client to web server, web server to database, or web server to application server. Seeing all components of a particular process would reveal interesting structural motifs, but the motif and node properties examined in isolation still hint at the function of the nodes. Possible techniques for aggregating data for different views are discussed in Chapter 7.

4.3 Data Sources

The data sets used in this study come from three different sources in an attempt to show measureable differences in protocols and behavior, even across networks with different underlying architectures and usage patterns. One data set often used in in- trusion detection research is the 1998 & 1999 DARPA Intrusion Detection Evaluation Data Set [47]. The primary reason this data was not selected, however, is because of its age; as Henderson et al. point out, the type of traffic seen in computer networks has changed [41]. This is not to imply that the approach described in this thesis would not work with older data, but that newer network traces containing a wider variety of application use might prove more interesting to examine. Additionally, traffic for the DARPA initiative is synthetic, whereas the data sets described in this section contain real network data that reflects current trends in in network and protocol use. Table 4.1 provides overview statistics for the traces examined. 31

4.3.1 Dartmouth College Wireless Traces

The CRAWDAD project at Dartmouth College provides an archive of wireless network data from several contributors around the globe. Included in the archive is 163 GB of packet headers captured from eighteen buildings on the campus during the Fall 2003 semester [48]. Data collected is representative of traffic in residential buildings, academic buildings, as well as the library. It has been sanitized in such a way that the IP addresses are consistent across traces, allowing for a more complete picture of network use. The campus wireless network contains several thousand users and over 450 wireless access points.

4.3.2 LBNL/ICSI Enterprise Tracing Program

The ICSI Enterprise Tracing Program hopes to provide a view into the internal traffic for an entire enterprise site [46]. These traces, taken from the Lawrence Berkeley National Laboratory (LBNL) in 2004 and 2005 span more than 100 hours of activity and include traffic from several thousand internal hosts. The data is sanitized in accordance with the methodologies described in [39]. Like the Dartmouth wireless traces, only packet headers were captured and the payload discarded.

Dartmouth LBNL OSDI Capture length (seconds) 21818.575 600.079 193.348 Number of packets 2023527 2261261 324116 Avg. packets/sec 92.743 3768.274 1676.335 Number of bytes 1092602793 778659304 94814149 Avg. bytes/sec 50076.726 1297595.353 490380.757 Table 4.1: Summary statistics of three trace files examined

4.3.3 OSDI Conference Network Traces

The last source of data used for analysis in this paper also comes from the CRAW- DAD archive, and includes traces from ten sniffers at the 2006 Operating Systems 32

Design and Implementation (OSDI) Conference [49]. Researchers collected this data to enable the analysis of the behavior of a heavily used wireless LAN. The data was initially sanitized on-the-fly and then reprocessed off-line to further obfuscate the MAC addresses as necessary. Although this data set does not have the “enterprise” characteristics of the previous two, its inclusion helps to determine the generalizabil- ity of the methods proposed in this work to different networks and network points of view.

4.4 Protocol Selection

Several criteria were used to select the protocols examined in this paper including availability, popularity and diversity. First and foremost there must be enough data samples of a particular protocol in the trace files to be able to perform the graph characteristic and motif analysis. To achieve this goal, more well-known and widely used protocols were chosen. Also, protocols that have different architectures (client- server vs. peer-to-peer for example) were selected in order to highlight the differences in node characteristics. Because packet payloads are not inspected, applications that operate on official IANA port numbers and are in-bound protocols are used so that reasonable assumptions can be made about the data, and that the port numbers accurately reflect the protocol being used. As a reminder, the port number is not used to classify applications, but is only used to provide class labels.

AOL Instant Messenger (AIM)

AOL’s instant messaging client has been a popular application for users around the world for over a decade. AIM uses a proprietary protocol called OSCAR to commu- nicate with other clients [50]. Multicast architectures exist and are used by some chat programs such as IRC, but all AIM connections go through a centralized 33 server. Users authenticate to the AIM login server on port 5190. Once the user’s session has been established, all chat communications also go through central AIM servers on port 5190. The exception to this is when a user attempts to establish a direct connection to another user (such as when sending pictures or other files), in which case the communication goes directly to the other user and bypasses the cen- tral AIM servers. Therefore AIM is primarily a client-server application, with some peer-to-peer capabilities as well. This study restricts itself to communications on port 5190, so any direct file transfers are ignored.

HyperText Transfer Protocol (HTTP)

The HTTP protocol is used to retrieve hyper-linked text documents from the world wide web [51]. A client initiates an HTTP request by connecting to a web server, typically on port 80. The web server then responds with a status line, as well as another message including the contents requested, such as an HTML file or an im- age. HTTP is a stateless protocol, which means no information is retained between requests. This protocol falls directly into a client-sever architecture model.

Domain Name System (DNS)

DNS is a hierarchical naming system that maps meaningful domain names to numer- ical IP addresses [52]. If a DNS server does not know the correct mapping for a given domain, it can instruct the DNS resolver on the client side of where to query next to attempt to resolve the address. DNS primarily communicates via UDP on port 53, and also follows a client-server architecture. Its hierarchical nature however makes it an interesting selection for analysis. 34

Kazaa

Kazaa is a peer-to-peer file sharing application built on the FastTrack protocol that operates on port 1214. This protocol employs the use of supernodes for scalability purposes. A supernode is any node on the network that also acts as a proxy and relayer for the network, and handles data flow and connections for other users. A peer-to-peer network should be more highly connected than a client-server model since all nodes in the network act as both clients and servers for each other.

Microsoft Active Directory (MSDS)

Microsoft Active Directory is a client-server protocol that provides a way to manage objects and relationships across a network. Objects can be resources such as printers, services such as email, or users (accounts and groups). It provides several services such as DNS-based naming, authentication methods and LDAP-like directory services. Active Directory Domain Services (MSDS) is the central location for configuration information, authentication requests and information about network objects [53]. It operates on port 445. Windows shares and Active Directory are commonly used in Windows-based networks, and its inclusion for analysis provides an example of platform-dependant network traffic.

NetBIOS Name Service

Netbios (Network Basic Input/Output System) is used to allow applications on sep- arate computers to communicate over a local area network. It provides three main services: (i) name service for name registration resolution, (ii) session service for connection-oriented communication, and (iii) datagram distribution service for con- nectionless communication. The name service communicates over port 137 with either the TCP or UDP protocol. A computer, which has a unique host name, might have 35 multiple NetBIOS names. The inclusion of NetBIOS for analysis is interesting be- cause it often receives port scans and is frequently the target of malicious attacks. The architecture of Netbios communications is a bit unique in that it does not fall cleanly into a client-sever model, nor does it fit the P2P model. It will occasionally use broadcast messages, and Netbios hosts can also be configured as peers.

Secure Shell (SSH)

Secure shell is a protocol that allows encrypted data to be sent between two com- puters on a network. It is often used for remote administration of other computers, creating secure tunnels for web browsing and securely copying files. SSH is primarily used on UNIX/Linux environments and runs on port 22. SSH utilizes a client-server architecture. Chapter 5: Experimental Methodology

The analysis of application graphs involves several stages and requires the use of many different software tools. The major tasks include: parsing and storing network data, creation of graphs and vertex profiles, node property analysis, motif searching, and creating a classifier to predict application labels. Optimization of the classification process via feature weighting is also considered. This chapter describes the process as well as the tools used, which are open-source and freely available. For the reader’s convenience, a summary diagram is given in figure 5.1.

Traditional Nearest Evolutionary profile creation neighbor attribute and analysis classification weighting Construct Parse and application store data graphs Motif−based Nearest Evolutionary profile creation neighbor attribute Process and analysis classification weighting

Wireshark NetworkX NetworkX Afterglow Python RapidMiner RapidMiner Python Python FANMOD Tools

Figure 5.1: Overview of the proposed methodology and tools used

5.1 Hardware and Linux System

All tests were run on a multi-core system running the Linux kernel version 2.6.22. The system contains four dual-core AMD 64-bit processors running at 1.8 GHz each. It uses a shared-memory architecture with 8 GB of memory. Although most of the tools are not written to take advantage of multiple cores, the hardware architecture allows for analysis of multiple network traces to happen simultaneously. 36 37

5.2 Packet Capture and Storage

The network traces are in the pcap format as described in Chapter 4. Modified parsers based on those distributed as part of The Afterglow Project [54] were used to parse pcap files. Additionally, Wireshark [10] was used to extract basic information from the network trace files, including the source IP, destination IP, source port, destination port, timestamp, protocol and packet length.

tshark −t e −r input.pcap tcp or udp | python tshark2mysql.py t

Figure 5.2: Storing packets from a pcap file into a MySQL database

Once the packets have been parsed, they are stored into a MySQLTM database for later retrieval. This is done to facilitate later steps in the process so that packets can be selected based on their source or destination port numbers, protocol type, or other attribute. Figure 5.2 illustrates the process of parsing and storing information from input.pcap into a MySQL database table t. Each network trace file is stored in a unique table within the database.

5.3 Creation of Application Graphs

The next step in the process is to model the application graphs and analyze the traditional measures as described in the first half of Chapter 3. NetworkX is a package for the creation, manipulation and study of complex networks, written in the Python programming language [55]. Graphs are created by querying a MySQL database table for all entries for which either the source or destination port number matches the port number of one of the seven application protocols. Although port numbers do not always accurately reflect the application bound to them, they are generally a strong indicator, especially for the well known port numbers 0-1023 (e.g., HTTP servers typically listen on port 80 for connections). For the purposes of this work, the 38 applications tied to each port number are assumed to be correct; however, verification is not possible because the packet payloads have been discarded.

Graph Size

There are two possible approaches to consider when creating and comparing appli- cation graphs across different protocols. One approach is to collect network data for a constant amount of time and then study the resulting communications that occurred. For example, each application graph would represent ten hours of SSH communications, 10 hours of HTTP communications, and so on. This approach is complicated for several reasons. The data collected for these experiments come from several different sources where the network monitors were run for variable lengths of time. Because certain applications are much more heavily used than others, there are no guarantees that there would be enough, or conversely, too much, data for each protocol. Additional data pre-processing would be required to mitigate variables such as these.

Instead, this study attempts to analyze application graphs that have a similar number of participating nodes by allowing the network capture lengths to vary. By doing so, the number of hosts in each application graph (Table 5.1) can be more easily controlled, and the interaction patterns that form over a longer amount of time may be viewed. The order of each application graph is consistent within examples of a particular protocol, but not across protocols due to lack of availability.

Protocol AIM DNS HTTP Kazaa MSDS Netbios SSH Order 50 68 80 80 40 76 40 Table 5.1: Graph orders for each application protocol

The graph orders serve as an upper bound for the number of nodes considered in each application graph. For example, if ten network trace files are searched for 39

AIM traffic, the lowest number of hosts found communicating on port 5190 in any of the files becomes a limiting factor for the other trace files. If another file were to have 120 hosts using AIM, only the first fifty would be considered. However, all communications among those fifty hosts over the duration of the network trace would be added to the graph, not just the links created as each node is added.

Connected Components

It is natural to expect that not all nodes in an application graph are connected; groups of nodes exist whom communicate with one another, but there is no communication that connects one group to the next. Scale-free networks, whose degree distributions follow a power law degree, usually contain one larger connected component and a few smaller connected components [24]. Several of the protocols exhibited this behavior of scale-free networks except for SSH, which showed a high number of small connected components. When calculating graph attributes for disconnected graphs, each con- nected component is treated as a separate graph. This ensures that measures such as distances between nodes, radius, diameter and others discussed in Chapter 3, are well defined and don’t use infinite path lengths to represent nodes that are disconnected, as is done in some graph algorithms.

5.4 Traditional Graph Measures

After the application graphs have been created, NetworkX is again used to perform calculations on each connected component. There are eleven node characteristics (described in Chapter 3) examined in this approach: indegree, outdegree, total degree, clustering coefficient, betweenness centrality, degree centrality, closeness centrality, eigenvector centrality, eccentricity, whether or not the node is a center node, and whether or not the node is a periphery node. NetworkX provides functions to calculate 40 all of these values except for eigenvector centrality, whose implementation is listed in Appendix B.

5.5 Motif Analysis

For the task of motif searching, several network analysis tools were considered: FAN- MOD [32], mfinder [56], MAVisto [57] and Pajek [58]. FANMOD (Fast Network Motif Detection) was selected for its rich feature set, including support for a graphical user interface in addition to command line invocation, generation of motif images, the ability to export results in several different formats and support for node and edge colors. FANMOD employs algorithms [59] that allow it to search motifs faster and with less memory usage than other motif searching tools such as MAVisto or mfinder.

FANMOD Parameters

FANMOD’s support for colored vertices and colored edges allows for the encoding of additional information into a motif structure beyond its shape and directionality of its edges. This study assumes edges are directed, but further exploits the flow of information between nodes by defining three classes (colors) of hosts: client, server, and peer (see Figure 5.3). In computer networking these terms refer to nodes that are consumers of a service, providers of a service, or nodes that act as both consumers and providers of a service. Here, they take on a related meaning, but are defined somewhat more generally based on the source IP, source port and destination port of a packet.

Definition 1 Let φ be the port number associated with an application and v be a node in Gφ, the application graph of φ. Also, let P be a packet sent by v over the network, where Psp and Pdp are the source and destination ports of P , respectively. Client, server and peer are defined as follows: 41

• If Pdp = φ then v is a client node, labeled vc

• If Psp = φ then v is a server node, labeled vs

• If vc and vs hold, then v is a peer node, labeled vp

As described in Chapter 2, a client computer will request a service by connecting from a random upper port on its own machine to a particular port φ on a server. Therefore, if the destination port of a node v is the port number φ, then v is consuming the service provided on that port. For many protocols, the server will the send data back to the client from port φ. In this instance, the server becomes the source IP, sending data from φ, as specified in Definition 1. The third part of the definition describes the behavior of peers, computers that act as both “clients” and “servers”. Therefore any node that is found to both send and receive data on port φ is labeled as a peer. Edge colors are not currently used, but are considered for future work.

Figure 5.3: A motif with colored vertices

Random Graphs and Statistical Significance

Milo et al. define “network motifs” as patterns of interconnections that occur in complex graphs at numbers that are significantly higher than those in randomized networks [3]. To determine which motifs are statistically significant, network data retrieved from the MySQL database for a particular application is converted into a FANMOD input file, which describes an application graph. First, the input graph is searched for all motifs of either order 3 or order 4. Next, a set of random graphs is generated and the motif search is repeated for each. The frequency at which motifs occur in the original input graph (the application graph), is compared to the frequency 42 of those same motifs in the random graphs. Motifs that are found significantly more often in the original graph are then reported to the user.

Random graphs are created through a series of “edge switching operations” (Fig- ure 5.4(a)), using the original input graph as a starting point. Several parameters exist to control the randomization process. In this study, the “local constant” model is selected, which means that unidirectional edges are only exchanged with other uni- directional edges. As a result, the number of bidirectional edges incident upon each vertex remains constant. Another option selected is to “regard vertex color” (Fig- ure 5.4(b)), which indicates that edges should only be exchanged if their endpoints have the same color. These options were enabled to create randomized networks that are still structurally similar to the original network and allow for a more stringent comparison [3].

(a) Edge-switching operation (b) Regard vertex color

Figure 5.4: FANMOD edge-switching process for generating random networks [4]

There is some variability in defining the phrase “statistically significant”, as dif- ferent thresholds can be used. The mfinder Tool Guide suggests using 5,000+ random graphs when searching for motifs of order 3, and 10,000+ random graphs when search- ing for motifs of order 4, and suggests that ten occurrences of any individual motif is a good starting point to measure the quality of a result [56]. For the motif analysis performed in this work, similar parameters were used. To keep the problem size rea- sonably small, 5,000 random networks were generated when searching for both order 3 and order 4 motifs; FANMOD supports sampling subgraphs for motif searching, but 43 an exhaustive enumeration of all subgraphs is currently used. The FANMOD output files provide the user with several pieces of information, including the percentage of subgraphs each motif was found in (for the original networks as well as in random networks), and a p-value for each motif.

The p-value is a statistical measure that describes the probability of obtaining a result at least as extreme as the result observed, given that the null hypothesis, or expected outcome, is true [60]. If the p-value falls outside the range of the expected outcome and is less than some threshold value α, the result is said to be statistically significant at the α level. In practice, values of 5%, 2.5% and 1% for α are common. For this study a “significant motif” is any motif that occurs in at least 1% of subgraphs in the original graph and has a p-value of 0 — essentially those motifs that FANMOD determines are the most significant results. By setting the threshold at this p-value, the number of motifs considered for analysis can be limited to a more select group. Experimentally, this results in a list of 130 significant motifs.

5.6 Vertex Profiles

The data collected through traditional graph analysis and motif analysis is used to create “profiles” of each node, used in the classification algorithm described in Section 5.7. Each profile is a data point in d-dimensional space, where d is the number of attributes a in the profile. A list of n vertices labeled v1 ...vn, is written as follows:

v1 = [ a1, a2, a3,...,ad ] v2 = [ a1, a2, a3,...,ad ] . . . .. vn = [ a1, a2, a3,...,ad ]

Figure 5.5: Arrays representing vertex profiles 44

The attributes a1 through ad can be any numerical data type or numerical repre- sentation of a data type. In the traditional graph analysis approach there are eleven attributes (degree counts, centrality measures, etc.), so d = 11. These attributes include integers, real numbers and boolean values represented as a 1 (true) or a 0 (false). The intent is to associate an application with a certain profile.

The idea of vertex profiles based on graph characteristics is adapted to the motif- based approach. Instead of considering the percentage of subgraphs a motif occurs in, however, a binary attribute is created that describes whether or not the vertex participates in the motif. One of the files output by FANMOD motif searches is a comma separated file with the following format:

adjacency matrix,

After the significant motifs have been determined, the script in Listing B.5 parses these files and creates the profiles for each node based on its participation in significant motifs. The dimensionality d of the motif profiles is 130: 42 of these are significant order 3 motifs, while the remaining 88 are significant order 4 motifs. The motif profiles were built putting both order 3 and order 4 motifs together because preliminary investigations indicated that the combination is more successful in separating and identifying protocols than either can do alone.

5.7 K-Nearest Neighbor Classification

The tasks of node classification and feature weighting (Section 5.8) are handled by RapidMiner, an open source knowledge-discovery and data mining tool built on the JavaTM platform [61]. RapidMiner allows for data mining experiments to be quickly constructed through the use of hundreds of modular operators that handle data pre- processing and post-processing, creation and storage of models, clustering and classi- fication tasks as well as statistical analysis. 45

The k-nearest neighbor (k-NN) classification algorithm is a simple machine learn- ing algorithm for classifying objects based on the closest training examples in a feature space. First, the data is broken into a training set and a test set. The proximity of a test point z to every point in the example set is then calculated.

Algorithm 1 The k-Nearest Neighbor classification algorithm [62] 1: Let k be the number of nearest neighbors and D be the set of training examples. 2: for each test example z =(x′,y′) do 3: Compute d(x′, x), the distance between z and every example (x,y) ∈ D. 4: Select Dz ⊆ D, the set of k closest training examples to z. ′ argmax 5: y = v (xi,yi)∈Dz I(v = yi) 6: end for P

After the nearest-neighbor list is obtained, the test example z is classified based on a majority vote of the k nearest neighbors to z. In this study, k = 1, so a test point z is given the same label as the label of its closest neighbor. In line 5 above, yi is the class label for one of the nearest neighbors, and I() is an indicator function that returns the value 1 if its argument is true and 0 otherwise.

5.7.1 Measuring Profile Separation

A number of similarity measures can be used to determine the distance from one point to another (line 3 of Algorithm 1), the selection of which depends on the type of data being examined and its application [62]. For example, there is Euclidean distance, Jaccard coefficient, cosine similarity and simple matching coefficient. Euclidean dis- tance is often chosen for instances of dense continuous data such as that found in the profiles for traditional graph analysis. Although the simple matching coefficient is often applied to binary data such as the motif profiles, the Euclidean distance is also suitable, and is selected for use in this study. Equation 5.1 defines this distance, 46

th where n is the number of dimensions and xk and yk are the k attributes of x and y.

n 2 d(x, y) = v (xk − yk) (5.1) u uXk=1 t

5.7.2 Cross Validation of Classification Results

Cross validation is the process of partitioning a data set into n subsets, training a classifier with n − 1 subsets and using the remaining subset to test. The process is then repeated n times with a different subset left out each time. In 10-fold cross validation, for example, ten subsets are created, each containing 10% of the original data set. In each iteration, 90% of the data is used for training and 10% is used for testing. To avoid the possibility of a particular subset not containing any instances (or very few) of a particular label, stratified sampling is used so that each subset contains roughly the same propotion of labels.

5.8 Genetic Algorithm Feature Weighting

Genetic algorithms provide a unique way to investigate which attributes in the vertex profiles more effectively classify application protocols, as well as increase the accuracy of the nearest-neighbor classifier. This study utilizes a genetic algorithm to perform evolutionary feature weighting, the results of which are applied to each profile and a new classifier is built using the nearest neighbor algorithm as before. Alternatively, a brute-force search of all attribute combinations (given by Equation 5.2) might possible for a small attribute set such as in the case of traditional graph analysis, but is not feasible for motif analysis.

d d c = (5.2) n Xn=1 47

Given that d = 11 for traditional graph analysis, applying the equation above reveals that the number of possible attribute combinations c is 2,047. However when d = 130 for motif analysis, c =1.36 × 1039. Genetic algorithms present one possible way to explore this problem space within a reasonable amount of time.

5.8.1 Overview of Genetic Algorithms

Genetic algorithms view learning as a competition among a population of evolving candidate problem solutions [63]. During each generation, a fitness function (line 4 of Algorithm 2 below) assesses each candidate to determine if it will contribute to the next generation of solutions. Those solutions found to be the most “fit” are selected for mating and mutation and shape the following generation of potential solutions. The algorithm repeats until some termination condition is met, such as convergence to a solution or a predefined number of generations have been tested.

Algorithm 2 General form of a genetic algorithm [63] 1: Set time t =0 2: Initialize the population P(t) 3: while the termination condition is not met do 4: Evaluate fitness of each member of the population P(t). 5: Select members from population P(t) based on fitness. 6: Produce the offspring of these pairs using genetic operators. 7: Replace, based on fitness, candidates of P(t), with these offspring. 8: Set time t = t +1 9: end while

Before the algorithm can begin, candidate solutions must be transformed into an appropriate representation for the problem space. Examples include binary, real value, and tree encoding, the simplest and most studied of which, is binary encoding [64]. Initial populations of candidate solutions are usually chosen at random. The population size depends on the problem space, but studies have shown a population size of 20-30 generally yields good results [65, 66]. At this point, the fitness function evaluates each member of the population, and selects the best candidates for mating. 48

Figure 5.6 shows what a simple crossover of two binary strings might look like.

Input Bit Strings Output Bit Strings 0011|0001 0011|1011 =⇒ 0100|1011 0100|0001 Figure 5.6: Single-point crossover of two binary strings

Just like in evolutionary biology, there is a small chance for random genetic mu- tation to occur. In a binary string, this would equate to one of the bits being flipped from a 0 to a 1 or vice versa, allowing the algorithm to explore more of the problem space and not settle on a local solution. Previous research suggests variable values for mutation probability, such as 0.0001 [65] or 0.005 - 0.01 [66].

5.8.2 Feature Weighting

The RapidMiner distribution contains a prewritten test for evolutionary feature weight- ing using genetic algorithms. In the context of application identification, the function used to determine the fitness of candidate solutions is based upon whether or not the potential solution increases the overall accuracy of the 1-NN classifier. Solutions that do not increase the performance of the classifier are not selected to contribute to the following generation of candidate solutions. The algorithm is run for thirty genera- tions, by which time the system should stabilize and begin to converge to a solution set of attribute weights. The full test parameters, including crossover probabilities, mutation rates and candidate selection can be found in Appendix C. Chapter 6: Results and Analysis

To test the accuracy and performance of the proposed approaches, several exper- iments were run using the method described in Chapter 5. In total, 65 application graphs were examined: ten AIM, ten DNS, ten HTTP, five Kazaa, ten MSDS, ten Netbios and ten SSH, with the discrepancy resulting from fewer examples of peer-to- peer Kazaa traffic being located in the data traces that were downloaded. Profiles were classified using both traditional graph attributes and motif-based attributes. Afterwards, profile attributes were weighted using a genetic algorithm. This step aims to provide two important functions: to increase the accuracy of the classifiers and to provide insight into which attributes are more effective for identifying network applications. Analysis of several key attributes is provided in this chapter, as well as a direct comparison between traditional and motif-based profiles.

6.1 Preliminary Investigations

Because motifs have not been applied in the realm of application identification, some preliminary classification work was required to vet this approach. Profiles for each of the 65 application graphs were created using a combination of significant order 3 and order 4 motifs, where each attribute represents the frequency of a particular motif within that graph. The results provided in Table 6.1 were encouraging (for the full classification results see Appendix D). Perhaps a more interesting question, however, is not if an entire graph of communications can be correctly classified, but instead if the activities of a particular host can be identified. It is on this question that the remainder of the chapter is focused.

49 50

Protocol AIM DNS HTTP Kazaa MSDS Netbios SSH Accuracy 80% 80% 90% 40% 60% 100% 80% Table 6.1: Classification accuracy of 65 application graphs

6.2 Initial Results

Classification results are presented as confusion matrices; each row of the table repre- sents a predicted class label (an application in this case), while the columns represent the true class label. The boldface numbers along the diagonal indicate correct clas- sifications. Confusion matrices also show false positives and false negatives. Data points that are predicted to have a certain class label but are incorrect are known as false positives, found in the rows of the matrices. False negatives are examples of a particular class that are incorrectly labeled, shown in the columns. For example, given a set of data that is predicted to be hosts sharing files via Kazaa, true positives would be those hosts that are actually using the P2P application while false positives would be those hosts that are not. Conversely, given a set of data that is known to be file-sharing hosts, false negatives would be those that are not labeled as using the Kazaa application.

True A True B True C Precision Pred. A 5 2 0 71.4% Pred. B 3 3 2 37.5% Pred. C 0 1 11 91.7% . Recall: 62.5% 50.0% 84.6% „ 70 4% Table 6.2: An example confusion matrix with three classes

The performance of the nearest-neighbor classification models are described by three different accuracy measures. The overall accuracy of a model (denoted by

„ next to the number in the bottom-right corner) is simply the number of correct classifications (true positives) over all classifications. Given a set of predictions of a particular label, class precision is a measure of the accuracy of those predicted labels. 51

It is the ratio of correct predictions of label l to all predictions of label l. It can be written: true positives precision = (6.1) true positives + false positives

Class recall (also called sensitivity) measures the accuracy of predicted labels if provided a complete set of true labels. Recall is given by the following equation:

true positives recall = (6.2) true positives + false negatives

Table 6.2 displays the results of an example classification experiment , as well as the accuracy, precision and recall measures. This confusion matrix shows that while the classifier has some trouble distinguishing between class A and class B, it can effectively detect examples of class C.

6.2.1 Traditional Graph Measure Profiles

To remind the reader, traditional graph measure profiles have eleven attributes in- cluding degree counts, centrality measures and clustering coefficient (see Section 5.4 for the full list). There are a total of 3,940 unique hosts found in the 65 application graphs. Each line of the input file for the nearest neighbor algorithm contains the true label assigned to the host and the eleven graph measures associated with that host. Not all protocols have an equal number of training examples due to the popu- larity and availability of certain applications in the trace files, but each protocol has 400–800 examples.

The computational load of a single test point for the nearest neighbor algorithm is O(nd) where n is the number of training samples and d is the number of attributes.

n When using 10-fold cross validation, the test set is 10 and ten iterations are run, making the overall complexity of this process O(n2) if we absorb the constant d into 52 the expression. Although other methods exist to reduce the number of computations necessary, RapidMiner is able to generate the model and accuracy measures for 3,940 data points in just a few seconds. Table 6.3 shows the resulting confusion matrix, where each row is the predicted label and each column is the actual label.

AIM DNS HTTP Kazaa MSDS Netbios SSH Precision AIM 417 29 89 37 91 41 320 40.72% DNS 2 612 6 1 7 20 10 93.01% HTTP 49 11 658 2 20 32 11 84.04% Kazaa 1 1 3 355 1 1 0 98.07% MSDS 10 5 13 1 255 10 29 78.95% Netbios 13 19 24 4 9 655 2 90.22% SSH 8 3 7 0 17 1 28 43.75%

Recall 83.40% 90.00% 82.25% 88.75% 63.75% 86.18% 7.00% „ 75.63% Table 6.3: Confusion matrix of unweighted traditional graph measures

After this initial classification test there are five protocols that have greater than 80% of their profiles labeled correctly (class recall): AIM, DNS, HTTP, Kazaa and Netbios. The class recall of SSH is strikingly low at 7%. SSH is a particularly difficult application to classify because it is used for a variety of tasks, including remote management of hosts, application tunneling and file transfers using the secure copy program. The traditional measures of SSH application graphs often resemble those of other applications, resulting in a low class recall value.

Despite the fact that about half of the eleven traditional attributes are real-valued, several ties occur when the nearest-neighbor algorithm computes the Euclidean dis- tance from a test point to the training points in the model. In this study, a tie situation is termed a profile collision (described in a moment). This behavior is de- sireable if the tie is between profiles of the same class, suggesting that certain profiles are strongly indicative of a particular application. However, many examples also tie with examples from several different classes. When this happens, RapidMiner naively assigns the first label in the list of ties to the test point. The order of this comparison is affected by the order of the input data. AIM is the first protocol in the list, so many multi-class ties are labeled as AIM, which explains why 80% of SSH traffic is 53 classified as AIM. It also explains the low class precision for AIM, since any multi- class tie involving AIM receives that label. Regardless of RapidMiner’s tie-breaking algorithm, classification inaccuracies are caused in part by the high rate of overlap among classes when using traditional graph measure profies.

Profile Collisions

More often than not, single-class ties between a test point and the model have the same label. With the method used by RapidMiner to break ties, multi-class ties are more likely to be incorrectly labeled than single-class ties. Table 6.4 summarizes the single and multi-class ties for each protocol. The three protocols involved in the most multi-class ties (AIM, MSDS, SSH) also have the lowest class recalls.

AIM DNS HTTP Kazaa MSDS Netbios SSH Single-class Ties 182 567 422 343 219 530 17 Multi-class Ties 110 29 49 46 83 33 331 Table 6.4: Number of single and multi-class ties for traditional graph measures

To explore the properties of vertex profiles in more detail, profile collisions are introduced. A profile collision can occur in one of two ways: if the distance from a test point to a training point is equal to zero, or, if a test point is equidistant from two or more training points, written mathematically as follows:

d(z, t1) = 0, or

d(z, t1) = d(z, t2)[=...=d(z, tn) ]

where z is a test point and t1 – tn are training points. Note that the first type of profile collision results from vertices having identical profiles. The collision graphs in Figure 6.1 show the total number of collisions with other profiles for each protocol. For example, Figure 6.1(a) shows that AIM profiles collide with SSH profiles more frequently than they do with other AIM profiles. 54

(a) AIM (b) DNS (c) HTTP (d) Kazaa

AIM DNS HTTP Kazaa MSDS Netbios SSH

(e) MSDS (f) Netbios (g) SSH

Figure 6.1: Profile collisions for traditional graph measures

6.2.2 Motif-based Profiles

Although profiles based on traditional graph measures can identify applications with some success, there is certainly room for improvement. This section presents the results of utilizing network motifs in a new way to characterize application graphs. The result files from FANMOD were parsed for significant motifs (those with a p-value of 0 and that occur in at least 1% of subgraphs), finding 130 motifs to be used as profile attributes. Only those hosts that participate in at least one of the significant motifs were considered in this part of the study. As a result, the total number of profiles is 3,546 instead of 3,940 as in the traditional approach.

Table 6.5 presents the classifier results using motif-based profiles. Although the discussion comparing the performance of profile types is saved for later, the reader may notice that the class recall has improved for six of the seven protocols measured — all except for AIM. Only four protocols score greater than 80% in the motif-based approach, but three of these four (DNS, Kazaa, Netbios) have improved well into the 55

90% range. Figure 6.2 provides the profile collisions for the motif approach, while the numbers of single and multi-class ties are given in Table 6.6. Again, the protocols involved in a higher percentage of multi-class ties generally score lower than those that do not.

AIM DNS HTTP Kazaa MSDS Netbios SSH Precision AIM 277 10 61 0 21 0 33 68.91% DNS 9 630 5 8 6 0 3 95.31% HTTP 136 13 665 1 21 6 29 76.35% Kazaa 5 0 0 370 4 34 2 89.16% MSDS 4 4 15 1 256 1 4 89.82% Netbios 2 1 0 0 4 699 0 99.01% SSH 35 1 24 2 60 0 84 40.78%

Recall: 59.19% 95.60% 86.36% 96.86% 68.82% 94.46% 54.19% „ 84.07% Table 6.5: Confusion matrix of unweighted motif-based profiles

AIM DNS HTTP Kazaa MSDS Netbios SSH Single-class Ties 93 576 446 195 231 611 17 Multi-class Ties 283 32 223 181 119 47 123 Table 6.6: Number of single and multi-class ties for motif-based profiles

Synthesizing the accuracy results and collision information, it can be seen that the classifier confuses two protocol pairs in particular with one another: AIM with HTTP and Kazaa with Netbios. In the case of Netbios name service, a broadcast message is sent to the local network to locate a particular machine that has a registered name. Somewhat similarly, if a Kazaa user wishes to locate a file, they contact an active supernode which then communicates with the ordinary nodes attached to it to query the desired file.

Even though AIM and HTTP are both classified more accurately with motifs than with traditional measures, the indistinctness of the boundary between the two protocols is a bit surprising. The arrangement of nodes in Figure 6.3(a) reflects the expected communication patterns given the functional and social characteristics of the HTTP protocol. Some web servers are more popular than others and would have a higher degree count than others. Additionally, web servers will often establish 56

(a) AIM (b) DNS (c) HTTP (d) Kazaa

AIM DNS HTTP Kazaa MSDS Netbios SSH

(e) MSDS (f) Netbios (g) SSH

Figure 6.2: Profile collisions for motif-based profiles communications with other web servers to pull content from RSS feeds, ad servers, or other content providers. With the exception of direct connections for file transfers, all AIM communications go through a central server, so one would expect to see stronger influence of a star topology in the application graphs. Figure 6.3(b) shows that this does not seem to be the case. There are several possible reasons for this. For example, the actual IP address of the central AIM server will be anonymized to a different random IP address across each of the network trace files. Given the popularity of instant messaging and the prevalence of the AIM client, it is possible that there are actually several servers that handle connections, and are load-balanced as necessary. A more in-depth examination of some application protocols will be required in future work.

The other protocol that needs further explanation is SSH. Figure 6.2(g) shows that SSH has a high collision rate with other protocols, and points out an important weakness in the current motif-based approach. The application graph in Figure 6.3(c) 57

(a) HTTP (b) AIM (c) SSH

Figure 6.3: Depiction of three application graphs: HTTP, AIM and SSH

Protocol AIM DNS HTTP Kazaa MSDS Netbios SSH Data Kept 93.6% 96.9% 96.3% 95.5% 93.0% 97.4% 38.8% Table 6.7: Percentage of original data used in motif-based profiles shows the tendency of SSH application graphs to be very disconnected, comprised of several much smaller components instead of one larger connected component. Current motif profiles are based on order 3 and order 4 network motifs, so these pairs of connections are ignored. This is made evident by Table 6.7, which shows that less than 40% of SSH traffic is included in the motif model, significantly less than the other six protocols. This difficulty presents another interesting area of work to be performed in the future.

6.3 Weighted Profiles and Key Attributes

In an effort to improve the performance of the two types of classifiers, the attributes of each profile were weighted using a genetic algorithm. This process only increases the accuracy of each model slightly, but it also allows for the investigation of a prob- lem space that might otherwise be too computationally expensive to explore. This section details the results of weighting the attributes and discusses some of the key characteristics. 58

6.3.1 Attribute Weights of Traditional Graph Measures

After running the evolutionary feature weighting experiment for thirty generations, a set of attribute weights was obtained that increased the overall accuracy of the traditional graph measure classifier roughly 4%. The weights of the eleven attributes are provided in Table 6.8.

Attribute Weight Indegree 0.259 Outdegree 0.000 Total Degree 0.023 Clustering Coefficient 0.172 Betweenness Centrality 0.271 Degree Centrality 1.000 Closeness Centrality 0.257 Eigenvector Centrality 0.633 Eccentricity 0.596 Center 0.393 Periphery 0.096 Table 6.8: Attribute weights for traditional graph measures

The weights of the attributes reflect the interaction of all eleven graph measures and are the values that maximize the accuracy of the classifier. They should there- fore not be interpreted too literally in isolation. For example, degree centrality was weighted with a 1.000, the highest possible weight. This does not mean that an accu- rate classifier could be built on this attribute alone. Section 6.5 addresses this point further. However, the table does still provide some insight as to which attributes might be more useful when providing classification tasks. It is not surprising that the degree counts are not weighted especially high, as they are a very generic measure. The “periphery” attribute has a low weight because it is not a very unique measure; out of the 3,940 profiles, 2,132 of them are periphery nodes. In contrast, only 773 nodes are central nodes, which has a higher attribute weight.

Figure 6.4 shows the per-protocol accuracy of both unweighted and weighted pro- files based on traditional graph measures. The confusion matrix for weighted at- 59

Accuracy of unweighted vs. weighted traditional graph measure profiles 100 Unweighted Weighted 90

80

70

60

50

40 % Correctly Labeled

30

20

10

0 AIM DNS HTTP Kazaa MSDS Netbios SSH Overall Protocol Figure 6.4: Accuracy of unweighted vs. weighted traditional graph measure profiles tributes can be found in Appendix D. As one would hope, the weighted attribute profiles perform slightly better than their unweighted counterpart for each protocol. The class recall for SSH is again very low for the same reasons described previously.

6.3.2 Attribute Weights of Motif-based Measures

Because of the high dimensionality of motif-based profiles, it becomes more impor- tant to take advantage of other methods such as genetic algorithms to explore the attributes. Figure 6.5 depicts the ten most heavily weighted motifs and their corre- sponding weighted values. In the figure, green nodes represent clients, black nodes represent servers and red nodes represent peers, as specified in Definition 1. As with the weights of traditional graph measures, these weights reflect the combined infor- mation from all attributes.

Motif 6.5(a) is the most highly weighted of the significant motifs found in this study and only occurs in two application graphs. There are 24 instances of it in a MSDS 60

(a) (b) (c) (d) (e) 1.000 0.662 0.650 0.632 0.585

(f) (g) (h) (i) (j) 0.545 0.537 0.503 0.503 0.502

Figure 6.5: The ten highest-weighted motifs and their corresponding weights graph and another 137 instances of it in a Netbios graph. Although weighted lower, motif 6.5(b) occurs overwhelmingly more frequently in Netbios (1,007 instances) than it does in MSDS (3 instances) or DNS (2 instances). If a node were to occur in these two motifs, there would probably be a good chance that the host was using the Netbios application.

Unfortunately the weights do not indicate which particular application(s) a motif help to delineate, only which motifs successfully increase the overall accuracy of the classifier. Perusing the profile data reveals that instances of many motifs are found in several or all of the applications studied. This is not to say motif profiles are unsuitable for describing computer networks (as they have shown a great deal of promise already), rather that no single motif is indicative of a particular application. Given the complexity of the highly dynamic interactions that occur in computer networks, this is not entirely surprising. It is possible that different types of motifs (described in Chapter 7) could be even more beneficial than the current generation of motifs and motif profiles.

One final point to address before moving on to a comparison of traditional and motif-based profiles is the performance of unweighted vs. weighted motif profiles, 61 shown in Figure 6.6. There is a slight increase in classification accuracy in each of the protocols except for Kazaa, which sees no additional gain from attribute weight- ing. The overall accuracy of the model increases to 85.70%, a difference of 1.63%. Appendix D contains the confusion matrix for weighted motif profiles.

Accuracy of unweighted vs. weighted motif−based profiles 100 Unweighted Weighted 90

80

70

60

50

40 % Correctly Labeled

30

20

10

0 AIM DNS HTTP Kazaa MSDS Netbios SSH Overall Protocol Figure 6.6: Accuracy of unweighted vs. weighted motif-based profiles

6.4 Comparison of Profile Types

This section compares the two profile types side-by-side and discusses some of the advantages and disadvantages of each approach. The motif-based model generally outperforms traditional graph measures, though this is not always the case as shown in Figure 6.7. Notably, the traditional profiles significantly outperform motif-based profiles for classifying AIM traffic, while the reverse is true for SSH (again, the SSH results should be taken with a grain of salt due to the fact that slightly less than 40% of SSH traffic is classified by the second approach).

Weighting the profile attributes benefits traditional graph measures more than 62

Traditional Graph Measures vs. Motif−based Profiles (Unweighted) Traditional Graph Measures vs. Motif−based Profiles (Weighted) 100 100 Traditional Traditional Motif Motif 90 90

80 80

70 70

60 60

50 50

40 40 % Correctly Labeled % Correctly Labeled

30 30

20 20

10 10

0 0 AIM DNS HTTP Kazaa MSDS Netbios SSH Overall AIM DNS HTTP Kazaa MSDS Netbios SSH Overall Protocol Protocol

(a) Unweighted (b) Weighted

Figure 6.7: Accuracy comparison of unweighted profile types motif descriptions. One reason for this might be the type of data used to describe each profile. Traditional profiles are comprised of a mixture of binary, real-valued and integer data. In addition to being purely binary, motif profiles are also sparse; most nodes only participate in very few of the 130 significant motifs. As a result, many of the motif weights are multiplied by zero, resulting in no information gain. Regardless, weighting the attributes does not change which type of classifier performs better for a particular protocol with the exception of HTTP. Unweighted motifs have a 4% accuracy advantage over traditional measures, but fall to a 1% disadvantage when the profiles are weighted.

Advantages and Disadvantages of Profile Types

Motif-based profiles have a slight advantage over traditional measures in a few cate- gories. The overall accuracy of the motif-based classifiers is higher than that of the traditional classifiers, both unweighted and weighted. Also, motif profiles result in more favorable overlap with other profiles. Only 10% of motif profiles do not match another profile, and 61% match profiles of a single label (note that “match” means a Euclidean distance of zero, not an identical profile). With traditional measures on 63 the other hand, 58% match a single label, and nearly 25% of profiles do not match any other profile.

Traditional graph measures are less demanding to compute than their motif coun- terparts. Even though some graph measures are O(n3) where n is the order of the graph, calculations can be performed extremely quickly because n is small in the application graphs examined: 40 ≤ n ≤ 80. Motif searches are computationally expensive and can be prohibitively so when searching for large motifs. This study found that an exhaustive search of order 3 motifs could be completed in roughly 7-8 minutes, while an exhaustive search of order 4 motifs took 6-8 hours to complete.

6.5 Considerations for Optimizing Classifier Performance

There are several ways in which the performance of application classifiers may be improved. An “on the fly” traffic classification system would need to be as fast as possible so that network latency is minimized. One way to achieve increased classifier speed is to reduce the dimensionality of the data. Already the evolutionary feature weighting performed by the genetic algorithm has indicated which attributes are more valuable to the classifier. Attributes below a certain threshold value could be ignored, at the expense of a little bit of accuracy. Figure 6.8 demonstrates the accuracy of models based on a single traditional graph measure.

By far, eigenvector centrality, closeness centrality and degree centrality provide the most information to the classifier, each scoring better than 65% on its own. Most

1 of the attributes score no better than a random guess with a 7 chance of being correct, shown as a vertical dotted line in the graph. Recall that eigenvector centrality assigns a centrality score to a vertex proportional to that of its neighbors. This metric is more “social” in nature than some of the others in that the centrality scores of neighboring vertices are considered in the calculation. The idea of “distance” in an application 64

Accuracy When Using a Single Attribute

Eig.Cent.

Close. Cent.

Deg. Cent.

Betw. Cent.

Clust. Coeff.

Outdegree Attribute Total Deg.

Eccentricity

Indegree

Periphery

Center

0 10 20 30 40 50 60 70 % Correctly Labeled Figure 6.8: Accuracy of single attribute classification graph is a bit tricky because it does not consider the number of hops data must go through to reach its final destination nor the physical distance between hosts. There- fore the “closeness” of closeness centrality describes the social usage of an application and suggests that the average shortest path length between nodes differs somewhat from application to application. The degree centrality is essentially a weighted degree count, which again suggests that the size of connected components within application graphs are important, influenced in part by the popularity of servers and services.

In addition to reducing the dimensionality of the attribute profiles, one can also consider reducing the number of data points used in the training phase of the nearest neighbor algorithm. Exploring the effectiveness of smaller classification models has two important implications. First of all, it suggests that a more lightweight classifier could be built when heading towards a real-time implementation. Secondly, it shows that the methods proposed in this study can be used for smaller networks and not just those containing thousands of nodes.

To test this hypothesis, several unweighted classifiers were built for each profile type with an increasing number of nodes in each model. The data was selected at 65 random, while keeping the proportions of each class label the same as in the models previously discussed. All of the test parameters are as they were before, including the use of 10-fold cross validation to determine the accuracy. The results of this experiment are illustrated in Figure 6.9.

Accuracy as the Number of Traditional Graph Measure Profiles Increases Accuracy as the Number of Motif−based Profiles Increases 100 100

90 90

80 80

70 70

60 60

50 50 AIM DNS 40 40

% Correctly Labeled HTTP % Correctly Labeled Kazaa Netbios 30 MSDS 30 SSH AIM 20 20 DNS HTTP Kazaa 10 10 Netbios MSDS SSH 0 0 500 1000 1500 2000 2500 3000 3500 4000 500 1000 1500 2000 2500 3000 3500 4000 Number of Profiles Number of Profiles

(a) Traditional graph measures (b) Motif-based profiles

Figure 6.9: Comparison of profile types as the size of the training set increases

The classifiers tend to perform slightly better as the number of training data points increases, but sometimes negligibly so. DNS, Kazaa and Netbios seem to benefit the least from having additional training examples, while AIM and MSDS fluctuate quite a bit more. It is interesting to note applications that were previously classified more accurately also exhibit more stable behavior in Figure 6.9. This is true for both profile types. For example, AIM and MSDS were by far the least accurately classified protocols using a motif-based approach, and their trend lines exhibit the most volatility in 6.9(b). In contrast, DNS, Kazaa and Netbios were the most accurately classified protocols, and their trend lines are nearly flat. This finding suggests that the protocols which can be clearly described by a profile (traditional or motif-based) can be learned with a relatively low number of training points. Further investigation into the AIM and MSDS protocols is needed to understand why the accuracy of AIM peaks at 2,500 nodes and then declines, while the accuracy of MSDS 66 peaks at 1,000 nodes and then drops significantly.

6.6 Limitations of Current Approach

This chapter has demonstrated the promise of using vertex profiles to identify appli- cation usage across a computer network. A few of the shortcomings of the proposed methodology have been touched upon already, but are summarized here. Graph size is an important factor to consider, since more “interesting” vertex characterizations arise from the complex interactions of hosts. Motif-based profiles become more de- scriptive as hosts communicate with a larger number of other hosts. The current generation of classification models suffer when there is heavy overlap among profiles, resulting in a distance of zero. A more intelligent tie-breaking scheme could yield better performance for those protocols that share application graph characteristics. Currently, the motif-based approach only considers motifs of order 3 and order 4. This causes a problem for protocols like SSH that tend to have a large number of small connected components instead of fewer large connected components. Some of the stages in the process are computationally expensive. The genetic algorithm used for feature weighting is a very time-consuming endeavor and does not yield the de- sired increase in performance. On the other hand, once a network is learned and a classifier built, the attribute weights need only be computed once and can be applied in O(n) time to the attributes collected for the test points. Additionally, the analysis techniques put forth by this work require a view of the network that shows as many of the interactions as possible. Chapter 7: Conclusions and Future Work

The tasks of managing and securing computer networks are becoming increasingly complicated due to the use of applications over non-standard port numbers as well as the use of data encryption techniques. These practices subvert a network administra- tor’s ability to provide quality of service to legitimate users, ensure compliance with security policies, as well as prevent outside intruders from gaining access to a system. Intrusion detection systems and network monitoring tools that rely on deep packet inspection are ineffective when data transfers are encrypted. Several previous studies have attempted to classify network application usage by examining flow characteris- tics pertaining to a particular series of communications between two hosts, examining attributes such as the size of the data packets being sent, packet inter-arrival time and session lengths.

This thesis has proposed an interdisciplinary approach to the study of networks through the characterization of application graphs. It is an “in the dark” methodology that relies on the communication patterns found in a network, rather than the contents of packet payloads or port numbers used by the application. A wide variety of graph measures heavily borrowed from social network analysis are used to create vertex profiles to determine the application in which the host participates.

Furthermore, this work has uniquely applied motif-based analysis, used almost exclusively in systems biology, to the study of application graphs. This method of detecting significant subgraph patterns has shown a great deal of promise for modeling and classifying application protocols. It has been shown that motifs can not only be used to express communication patterns, but also to indicate the functional role of a

67 68 host. In this study, nodes were labeled as either a client, server, or peer based upon their interactions at the transport layer. This information was used to generate motifs. A second type of vertex profile was defined, based upon a node’s participation (or lack thereof) in the motifs that were found to be significant across all of the application protocols examined.

Through empirical testing, this study has shown that both types of profiles can determine what application a host is using with a reasonable amount of accuracy. Although some protocols like SSH and AIM present difficulties, many of the others can be classified with greater than 80% accuracy, and in the case of weighted motif profiles, as high as 96% for the peer-to-peer application Kazaa. In general, a motif- based approach out-performs traditional graph measures and seems to have more potential for related work in the future.

One issue to consider is how to best manage connected components in application graphs that only contain two nodes. This phenomena was found to occur frequently in SSH, contributing to the fact that less than 40% of SSH hosts were classified by the motif-based approach. Ignoring vertex colors, there are three possible order 2 motifs: A → B, A ← B, and A ↔ B. Unfortunately, the edge-switching operations for creating random graphs will not provide sufficiently randomized graphs, so it is unlikely to find a particular pattern that is statistically significant.

Currently, the only information utilized in the creation of application graphs is the source and destination IP addresses, and the source and destination port numbers. The motif-based approach provides some additional information by using vertex colors to represent node types, but other information could also be exploited to color the edges. For example, colors could be used to denote the amount of data transferred between two nodes. This would help create more detailed profiles that might be able to distinguish between applications with similar connection patterns, but use network 69 bandwidth in ways that are distinct from one another. Also related to the creation of application graphs, it would be interesting to observe the data flow through all nodes involved in a particular activity and not just the flow on a particular port number. A server might request content from an application or database server in response to a client’s request for a web document. Back-end communications to other related services occur on a port other than 80, the usual HTTP port number.

Another area to explore is the different machine learning techniques that can be applied to vertex profiles for classification and feature weighting; nearest-neighbor and genetic algorithms are only two possibilities. The many parameters of these algorithms require further tuning to optimize the classification accuracy of the models built. This thesis describes a process which allows the substitution of particular algorithms. For example, a Bayes classifier or support vector machine could be used instead of nearest-neighbor, while principal component analysis could be used in place of the genetic algorithm [67, 62].

Although not used in the current approach, temporal information could also prove to be useful in classifying application protocols. One approach would be to encode information such as session lengths or packet inter-arrival times into the edge colors. Another use of time-based information would be to observe communication patterns over a much smaller time window (on the order of seconds or minutes instead of hours) and determine how a node’s participation in motifs changes over time.

Moving away from implementation details and algorithm decisions, this type of research can be expanded outside of application identification. Assuming that the process can be tweaked to allow a high accuracy in protocol recognition, this approach could be used to detect anomalies in network behavior. Hosts that participate in activities that look similar to a known application but differ more than an established threshold value would be considered anomalous for that particular application and 70 trigger an alert. One final consideration is pushing this research further into the realm of social network analysis, applying it to the detection of communities and associations within a network, such as locating all hosts that are part of the same online gaming community. References

[1] P. Dyson, Dictionary of Networking. Sybex, 1999. [2] A. S. Tanenbaum, Computer Networks. Prentice Hall, 2003. [3] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simple building blocks of complex networks.” Science, vol. 298, no. 5594, pp. 824–827, October 2002. [Online]. Available: http://dx.doi.org/10.1126/science.298.5594.824 [4] F. Rasche and S. Wernicke, “Fanmod manual,” 2006. [5] Input federal it market forecast 2008 - 2013. [Online]. Avail- able: http://www.input.com/corp/library/detail.cfm?itemid=5437&cmp=OTC- fedinfosecfcst08 [6] The sans security policy project. [Online]. Available: http://www.sans.org/resources/policies/ [7] S. Christey and R. A. Martin, “Cve - vulnerability type distributions in cve,” 2007 technical white paper on the distribution of vulnerabilities reported to CVE. [8] Internet assigned numbers authority: Assigned port numbers. [Online]. Available: http://iana.org/assignments/port-numbers [9] Netstat. [Online]. Available: http://www.netstat.net/ [10] Wireshark: A network protocol analyzer. [Online]. Available: http://www.wireshark.org/ [11] Snort - the de facto standard for intrusion detection/prevention. [Online]. Available: http://www.snort.org/ [12] M. E. J. Newman, “Coauthorship networks and patterns of scientific collaboration,” in Proceedings of the National Academy of Science, 2004, pp. 5200–5205. [13] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. [14] C. Yang and T. Ng, “Terrorism and crime related weblog social network: Link, content analysis and information visualization,” Intelligence and Security Informatics, 2007 IEEE, pp. 55–58, May 2007. [15] E. Yeger-Lotem, S. Sattath, N. Kashtan, S. Itzkovitz, R. Milo, R. Y. Pinter, U. Alon, and H. Margalit, “Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction.” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 16, pp. 5934–5939, April 2004. [Online]. Available: http://dx.doi.org/10.1073/pnas.0306752101

71 72

[16] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptional regulation network of escherichia coli,” Nat Genet, vol. 31, no. 1, pp. 64–68, May 2002. [Online]. Available: http://dx.doi.org/10.1038/ng881

[17] J. Grochow and M. Kellis, “Network motif discovery using subgraph enumeration and symmetry-breaking,” 2007, pp. 92–106.

[18] U. Alon, “Network motifs: Theory and experimental approaches,” Nature Reviews Genetics, vol. 8, no. 6, pp. 450–461, Jun. 2007.

[19] J. Day and H. Zimmermann, “The osi reference model,” Proceedings of the IEEE, vol. 71, no. 12, pp. 1334–1340, Dec. 1983.

[20] V. Cerf and R. Kahn, “A protocol for packet network intercommunication,” Commu- nications, IEEE Transactions on [legacy, pre - 1988], vol. 22, no. 5, pp. 637–648, May 1974.

[21] L. Euler, “Solutio problematis ad geometriam situs pertinentis,” in Commentarii academiae scientiarum imperialis Petropolitanae. St. Petersburg Academy, 1736, vol. 8.

[22] R. G. Busacker and T. L. Saaty, Finite Graphs and Networks, ser. International Series in Pure and Applied Mathematics. McGraw-Hill, 1965.

[23] G. Chartrand and P. Zhang, Introduction to Graph Theory. McGraw-Hill, 2005.

[24] A. A. Nanavati, R. Singh, D. Chakraborty, K. Dasgupta, S. Mukherjea, G. Guru- murthy, and A. Joshi, “Analyzing the Structure and Evolution of Massive Telecom Graphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 5, pp. 703–718, March 2008.

[25] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Numerische Mathematik, vol. 1, pp. 269–271, 1959.

[26] L. C. Freeman, “Centrality in social networks conceptual clarification,” Social Net- works, vol. 1, no. 3, pp. 215–239.

[27] P. Bonacich, “Technique for analyzing overlapping memberships,” Sociological Method- ology, 1972.

[28] M. E. J. Newman, Mathematics of Networks. Palgrave Macmillan, 2008.

[29] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’ networks,” Na- ture, vol. 393, 1998.

[30] S. Cheung, R. Crawford, M. Dilger, J. Frank, J. Hoagl, K. Levitt, J. Rowe, S. Staniford- chen, R. Yip, and D. Zerkle, “Grids – a graph-based intrusion detection system for large networks,” in In Proceedings of the 19th National Information Systems Security Conference, 1996, pp. 361–370. 73

[31] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “Blinc: multilevel traffic classifi- cation in the dark,” in SIGCOMM ’05: Proceedings of the 2005 conference on Appli- cations, technologies, architectures, and protocols for computer communications. New York, NY, USA: ACM, 2005, pp. 229–240.

[32] S. Wernicke and F. Rasche, “Fanmod: a tool for fast network motif detection,” Bioin- formatics, vol. 22, no. 9, pp. 1152–1153, 2006.

[33] R. Itzhack, Y. Mogilevski, and Y. Louzoun, “An optimal algorithm for counting net- work motifs,” Physica A, vol. 381, pp. 482–490, Jul. 2007.

[34] S. Mangan, A. Zaslaver, and U. Alon, “The coherent feedforward loop serves as a sign- sensitive delay element in transcription networks.” Journal of Molecular Biololgy, vol. 334, no. 2, pp. 197–204, November 2003.

[35] Tcpdump/libpcap public repository. [Online]. Available: http://www.tcpdump.org/

[36] J. Postel, “Internet Protocol,” RFC 791 (Standard), Sep. 1981, updated by RFC 1349. [Online]. Available: http://www.ietf.org/rfc/rfc791.txt

[37] J. Postel, “Transmission Control Protocol,” RFC 793 (Standard), Sep. 1981, updated by RFC 3168. [Online]. Available: http://www.ietf.org/rfc/rfc793.txt

[38] J. Postel, “User Datagram Protocol,” RFC 768 (Standard), Aug. 1980. [Online]. Available: http://www.ietf.org/rfc/rfc768.txt

[39] R. Pang, M. Allman, V. Paxson, and J. Lee, “The devil and packet trace anonymization,” ACM Computer Communication Review, vol. 36, no. 1, pp. 29–38, January 2006. [Online]. Available: http://www.icir.org/mallman/papers/devil-ccr- jan06.pdf

[40] G. Iannaccone, C. Diot, I. Graham, and N. McKeown, “Monitoring very high speed links,” in IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement. New York, NY, USA: ACM, 2001, pp. 267–271.

[41] T. Henderson, D. Kotz, and I. Abyzov, “The changing usage of a mature campus-wide wireless network,” Computer Networks, vol. In Press, Accepted Manuscript. [Online]. Available: http://dx.doi.org/10.1016/j.comnet.2008.05.003

[42] P. Gill, M. Arlitt, Z. Li, and A. Mahanti, “Youtube traffic characterization: a view from the edge,” in IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement. New York, NY, USA: ACM, 2007, pp. 15–28.

[43] E. Blanton. (2008, January) tcpurify. [Online]. Available: http://irg.cs.ohiou.edu/ eblanton/tcpurify/

[44] T. Gamer, C. P. Mayer, and M. Sch¨oller, “PktAnon - A Generic Framework for Profile- based Traffic Anonymization,” PIK Praxis der Informationsverarbeitung und Kommu- nikation, vol. 2, pp. 67–81, Jun. 2008. 74

[45] D. Koukis, S. Antonatos, D. Antoniades, E. P. Markatos, P. Trimintzios, and M. Fukarakis, “CRAWDAD tool tools/sanitize/generic/anontool (v. 2006-09-26),” Downloaded from http://crawdad.cs.dartmouth.edu/tools/sanitize/generic/AnonTool, Sep. 2006.

[46] (2005) Lbnl enterprise trace repository. [Online]. Available: http://www.icir.org/enterprise-tracing/

[47] MIT Lincoln Laboratory: 1999 DARPA Intru- sion Detection Evaluation Data Set. [Online]. Available: http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/1999data.html

[48] D. Kotz, T. Henderson, and I. Abyzov, “CRAWDAD trace dart- mouth/campus/tcpdump/fall03 (v. 2004-11-09),” Downloaded from http://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump/fall03, Nov. 2004.

[49] R. Chandra, R. Mahajan, V. Padmanabhan, and M. Zhang, “CRAW- DAD data set microsoft/osdi2006 (v. 2007-05-23),” Downloaded from http://crawdad.cs.dartmouth.edu/microsoft/osdi2006, May 2007.

[50] Oscar protocol. [Online]. Available: http://dev.aol.com/aim/oscar/

[51] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, “Hypertext transfer protocol – http/1.1,” RFC 2616 (Standard), Jun. 1999. [Online]. Available: http://www.ietf.org/rfc/rfc2616.txt

[52] P. V. Mockapetris, “Domain names - implementation and specifica- tion,” RFC 1035 (Standard), United States, 1987. [Online]. Available: http://www.ietf.org/rfc/rfc1035.txt

[53] Active directory. [Online]. Available: http://www.microsoft.com/windowsserver2008/en/us/active-directory.aspx

[54] R. Marty. Afterglow. [Online]. Available: http://www.afterglow.sourceforge.net/

[55] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics, and function using networkx,” in Proceedings of the 7th Python in Science Conference (SciPy2008), Pasadena, CA USA, Aug. 2008, pp. 11–15.

[56] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon, “Mfinder tool guide,” 2002.

[57] F. Schreiber and H. Schw¨obbermeyer, “Bioinformatics applications note structural bioinformatics mavisto: a tool for the exploration of network motifs,” 2005.

[58] W. de Nooy, A. Mrvar, and V. Batagelj, Exploratory Social Network Analysis with Pajek. Cambridge University Press, 2005.

[59] S. Wernicke, “Efficient detection of network motifs,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 3, no. 4, pp. 347–359, 2006. 75

[60] W. Mendenhall and R. J. Beaver, Introduction to Probability and Statistics, 8th ed. PWS-Kent Publishing Company, 1991.

[61] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler, “Yale: rapid pro- totyping for complex data mining tasks.” New York, NY, USA: ACM, 2006, pp. 935–940.

[62] Pang-Ning Tan and Michael Steinbach and Vipin Kumar, Introduction to Data Mining. Addison Wesley, 2006.

[63] G. F. Luger, Artificial Intelligence: Structures and Strategies for Complex Problem Solving, 5th ed. Addison Wesley, 2005.

[64] M. Mitchell, An Introduction to Genetic Algorithms. MIT Press, 1998.

[65] J. J. Grefenstette, “Optimization of control parameters for genetic algorithms,” IEEE Transactions on Systems, Man and Cybernetics, vol. 16, no. 1, pp. 122–128, Jan. 1986.

[66] J. D. Schaffer, R. A. Caruana, L. J. Eshelman, and R. Das, “A study of control param- eters affecting online performance of genetic algorithms for function optimization,” in Proceedings of the third international conference on Genetic algorithms. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1989, pp. 51–60.

[67] D. Lay, Linear Algebra and Its Applications, 2nd ed. Addison Wesley, 2000. Appendix A: Examples of Application Graphs

Figure A.1: Application graphs depicting AIM communications

Figure A.2: Application graphs depicting DNS communications

Figure A.3: Application graphs depicting HTTP communications

76 77

Figure A.4: Application graphs depicting Kazaa communications

Figure A.5: Application graphs depicting MSDS communications

Figure A.6: Application graphs depicting Netbios communications

Figure A.7: Application graphs depicting SSH communications Appendix B: Code Listings

Listing B.1: tshark2mysql.py – stores pcap data into a MySQL database #!/usr/bin/python

# This file parses output from stdout and insets it into a MySQL # database. It assumes the database has already been created and 5 # will create the necessary table # # The Tshark command is: # tshark −t e −r tcp or udp

10 import sys import MySQLdb

if len(sys.argv) != 2: sys.exit(”Supply name of table to store data in \n” ) 15 try : conn = MySQLdb.connect(host = ”localhost”, user = ”root”, passwd = ”pass”, 20 db = ”data”) except MySQLdb . Error , e : sys.exit(”Error %d: %s” % (e.args[0], e.args[1]))

cursor = conn.cursor() 25 cursor .execute(”DROP TABLE IF EXISTS %s” % sys.argv[1])

cursor . execute(””” CREATE TABLE %s ( i d INT (1 1 ) NOT NULL AUTO INCREMENT, 30 t s DOUBLE NOT NULL DEFAULT ’ 0 . 0 ’ , p r o t o c o l VARCHAR(1 2 ) NOT NULL, s i p VARCHAR(1 5 ) NOT NULL, s p o r t INT (5 ) NOT NULL DEFAULT ’ 0 ’ , dip VARCHAR(1 5 ) NOT NULL, 35 dport INT (5 ) NOT NULL DEFAULT ’ 0 ’ , l e n g th INT (1 1 ) NOT NULL DEFAULT ’ 0 ’ , PRIMARY KEY i d ( i d ) ); ””” % sys.argv[1]) 40 r c = 0 while True : line = sys.stdin.readline() if not line : break v = line.split(’ ’) 45 tmp=[] for i in range(len(v)): if v[i] not in (’’,’−> ’,’’): tmp.append(v[ i ]) v = tmp 50 if len(v) == 8: try : ts = float(v[1]) sip = v[2]

78 79

dip = v[3] 55 sport=int(v[4]) dport = int(v[5]) proto = v[6] length = int(v[7][: − 2 ] ) # strip off the newline character sql = ”INSERT INTO %s (ts, protocol, sip, sport, dip, dport, l ength) VALUES (%f , \’%s \ ’, \’%s \ ’, %d, \’%s \ ’, %d, %d)” % (sys.argv[1], ts, proto, sip, sport, dip, dport, length) 60 try : cursor .execute(sql) rc += cursor .rowcount except MySQLdb . Error , e : print ”Error [%d]: %d: %s” % (rc, e.args[0], e.args[1]) 65 except Exception, e: print ”ERROR: ” , v

cursor . close() conn .commit() 70 conn.close()

print ( ”\n%d rows inserted into %s \n” % (rc, sys.argv[1]))

Listing B.2: graph utils.py – implementation of adjacency matrix conversion and eigenvector centrality using the NetworkX API import networkx as NX import math

def adj matrix(G) : 5 ””” Function takes a networkx.Graph as an argument (undirected ) and returns a list of lists representing the corresponding adjacency matrix. It can can be referenced as you would a normal 2D matrix A[i][j] 10 node IDs must be [1 G.order] (taken care of in eigenvector centrality()) ””” adj = [] for n in G.nodes () : 15 row=[] for m in range(len(G.nodes())+1): row.append(0) for m in NX.neighbors(G, n): row[m] = 1 adj .append(row) # Get rid of first element of each row (nodes start at 1, adj is 0−based) 20 for i in range(len(adj)): adj[i] = adj[i][1:] return adj

def eigenvector centrality(G): ””” 25 Function takes an undirected graph (Graph or XGraph) and returns a dictionary of eigenvector centralities , keyed by node ID (similar to centrality functions in networkx)

Function will map node labels to integers [1 G.order] 30 Algorithm adaped from: http://www.analytictech.com/networks/centaids .htm ””” 80

eigenvector centralities = {} evCentrality = [] 35 evUpdate= [] maxValue = −1.0

for i in range(G. order ()): evCentrality.append(1.0) 40 evUpdate.append(0.0)

H = NX.convert node labels t o integers(G, first label=1, discard o l d labels=False ) labels = {} for k , v in H. node labels.iteritems(): labels[v] = k 45 A=adj matrix(H)

# 30 iterations should be enough to converge to a solution for x in range(30) : for i in range(G. order ()): 50 evUpdate[i]=0.0 for j in range(G.order ()): if (A[i ][j] != 0): evUpdate[i] += evCentrality[j] maxValue = 0 for i in range(G. order ()):

55 maxValue+=evUpdate[i] ¶ evUpdate[ i ] maxValue = math. sqrt(maxValue) for i in range(G. order ()): evCentrality[ i ] = evUpdate[ i]/maxValue for i in range(1, G.order() + 1): 60 eigenvector centralities[labels[i ]] = evCentrality[i −1]

return eigenvector centralities

Listing B.3: node props main.py – creates application graphs from MySQL database and computes traditional graph metrics using the NetworkX API #!/usr/bin/python

””” This program creates a DiGraph and calculates various graph metrics , 5 converting DiGraph to Graph as necessary for some metrics

Usage: arg1 = table name arg2 = port number 10 arg3 = max # of nodes to consider ”””

import sys import networkx as NX 15 import MySQLdb

from graph utils import ¶

class Node : ”””Class to hold properties of nodes””” 20 i n degree = 0 o u t degree = 0 degree = 0 81

clustering = 0 betweenness centrality = 0 25 degree centrality = 0 closeness centrality = 0 eigenvector centrality = 0 eccentricity = 0 is center = 0 30 is periphery = 0

if len(sys.argv) != 4: 35 sys.exit(”Provide table name, port number, and # nodes at command l i n e \n” ) table = sys.argv[1] port = sys.argv[2] n max = int(sys.argv[3])

40 # MySQL connection try : conn = MySQLdb.connect(host = ”localhost”,user = ”root”, passwd = ”pass”,db = ” data”) except MySQLdb.Error, e: sys.exit(”Error %d: %s” % (e.args[0], e. args[1])) cursor = conn.cursor()

45 s q l = ”SELECT sip , dip , sport , dport FROM %s WHERE s p o r t=%s OR dport=%s” % (table , port, port) try : cursor.execute(sql) except MySQLdb.Error, e: sys.exit(”Error %d: %s” % (e.args[0], e. args[1]))

# Create a directed graph from SQL results 50 G = NX.DiGraph(name=”%s %s” % (port, n max)) for i in range(cursor .rowcount): r = cursor.fetchone() if r[0] in G and r[1] in G: G.add edge(r[0], r[1]) else : 55 if G. order () < n max: G.add node(r [0]) if G. order () < n max: G.add node(r [1]) if r[0] in G and r[1] in G: G.add edge(r[0], r[1])

# Calculate graph properties 60 myNodes = {} for n in G.nodes() : myN = Node() # Basic Properties myN.degree = G.degree(n) 65 myN. o u t degree = G.out degree(n) myN. i n degree = G.in degree(n) myNodes[n] = myN

””” 70 The following measures are all based on undirected graphs, but are connected components. ”””

H = G.to undirected() 75 CCS = NX.connected component subgraphs (H) for i in range(len(CCS)) : if CCS[ i ]. order() >= 2: cl = NX.clustering(CCS[i], with labels=True) for k , v in cl.iteritems(): myNodes[k]. clustering = v 80 bc = NX.betweenness centrality(CCS[ i ]) for k , v in bc. iteritems(): myNodes[k]. betweenness centrality = v

dc = NX. degree centrality(CCS[ i ]) 82

85 for k , v in dc.iteritems(): myNodes[k]. degree centrality = v

cc = NX.closeness centrality(CCS[ i ]) for k , v in cc.iteritems(): myNodes[k]. closeness centrality = v

90 ec=eigenvector centrality(CCS[ i ]) for k , v in ec.iteritems(): myNodes[k]. eigenvector centrality = v

d = NX.diameter(CCS[ i ]) r = NX.radius(CCS[ i ]) 95 ecc =NX.eccentricity(CCS[i], with labels=True) for k , v in ecc.iteritems(): myNodes[k]. eccentricity = v if v == d: myNodes[k]. is periphery = 1 if v == r: myNodes[k]. is center = 1 100 else : pass

# Print results for k , v in myNodes. iteritems () : 105 s = ”” s += ”%s,%d,%d,%d,%f,%f,%f,%f,%f,%d,%d,%d” % (port , v. in degree , v.out degree , v. degree , v.clustering , v.betweenness centrality , v.degree centrality , v. closeness centrality , v.eigenvector centrality , v.eccentricity , v. is periphery , v.is center) print s

Listing B.4: motif results.py – parses FANMOD results for significant motifs #!/usr/bin/python

””” This program reads FANMOD result files and looks for signifi cant motifs. 5 It associates each motif with an identifying integer ID and p ickles the results for later use ”””

import pickle 10 import glob import pprint

filedir = ”/home/eddie/research/fanmod/res csvs/”

files = glob.glob( ’/home/eddie/research/fanmod/res csvs / ¶ .txt ’) 15 size3 = {} # mapping for size 3 motifs id3 = 0 # first ID for size 3 size4 = {} # mapping for size 4 motifs id4 = 0 # first ID for size 4 20 p thresh = 0.0 # get motifs with pvalue <= p thresh p c t occ = 1.0 # get motifs with frequency >= pct occ

# Iterate through files and make ID associations for i in range(len(files)): 25 inFile = files[i] msize = int(inFile[ −14]) # Motif size is stored in filename f = open(inFile, ’r’) file = [] for l in f : 83

30 l=l[: − 1] if len(l) > 1: file .append(l)

# Ignore stuff at top of file file = file[24:] 35 for j in range(0, len(file), msize): adjMatrix = ”” l1 = file[j].split(’,’) if (float(l1[6]) <= p thresh) and (float(l1[2][: − 1 ] ) >= pct occ): # If this is a significant motif... 40 adjMatrix+=l1[1] for k in range(1, msize): adjMatrix += file [j+k]. split(’,’)[1] if msize == 3 and adjMatrix not in size3 .values(): size3[id3] = adjMatrix 45 id3 += 1 if msize == 4 and adjMatrix not in size4 .values(): size4[id4] = adjMatrix id4 += 1

50 f.close() # Close file handle

# Pickle the resulting dictionaries s3 = open(’s3map.pkl’, ’w’) pickle.dump(size3 , s3) 55 s3.close() s4 = open(’s4map.pkl’, ’w’) pickle.dump(size4 , s4) s4.close()

Listing B.5: motif profiles.py – creates motif profiles from FANMOD dump files #!/usr/bin/python

””” This file reads the pickled s3 and s4 maps and creates the binary 5 motif participation profiles for the NN clustering ”””

import sys import pickle 10 from string import split from pprint import pprint from glob import glob

class profile : 15 ”””Instances of profiles””” def init (self, id, l): self.ID = id self.label = l self.a = [] 20 for i in range(len(s3map) + len(s4map)): self .a.append(0)

def mark(self , m): try : self.a[adjM[m]] = 1 25 except KeyError: pass # insignificant motif, not in our dict 84

# Unpickle adjMatrix mapping s3 = open(’s3map 1pct.pkl’, ’r’) 30 s3map = pickle.load(s3) s3.close() s4 = open(’s4map 1pct.pkl’, ’r’) s4map = pickle.load(s4) s4.close() 35 # Create dictionary for adjMatrix mapping adjM = {} idx = 0 for k , v in s3map. iteritems () : 40 adjM[v]=idx idx += 1 for k , v in s4map. iteritems () : adjM[v] = idx idx += 1 45 seen = {} # dict for nodes

files = glob( ’/home/eddie/research/fanmod/data new lc/dumpfiles/ ¶ ’ ) for i in range(len(files)): #for i in range(3,4): 50 # Open file for reading f = open(files[i], ’r’) # need a unique prefix since we will have multiple node 0, 1, 2, etc ... prefix = (files[i]. split(”/”)[7]).split(” ”)[: − 2] tmp = ”” 55 for j in range(len(prefix)): tmp += prefix[j] + ” ” prefix = tmp label = prefix.split(” ”)[ −2] for lines in f : l = lines.split(”,”) 60 # ignore header lines in dump files if len(l) > 2: for j in range(1, len(l)): myNode = prefix + str(int(l[j])) if myNode not in seen : 65 seen[myNode] = profile(myNode, label) seen[myNode].mark(str(l [0]) )

f.close() # close file handle

70 for k , v in seen.iteritems(): print v.ID, v.label , for i in range(len(v.a)): if i < len(v.a) −1: print v.a[i], else : print v.a[i] Appendix C: Test Parameters

Parameter [default] Value subgraph (motif) size [default: 3] 3/4 # of samples used to determine approx. # of subgraphs [100000] 100000 full enumeration? 1(yes)/0(no) [1] 1 (yes) directed? 1(yes)/0(no) [1] 1 (yes) colored vertices? 1(yes)/0(no) [0] 1 (yes) colored edges? 1(yes)/0(no) [0] 0 (no) random type: 0(no regard)/1(global const)/2(local const) [2] 2 regard vertex colors? 1(yes)/0(no) [0] 1 (yes) regard edge colors? 1(yes)/0(no) [0] 0 (no) reestimate subgraph number? 1(yes)/0(no) [0] 0 (no) # of random networks [1000] 5000 # of exchanges per edge [3] 5 # of exchange attempts per edge [3] 5 Table C.1: FANMOD test parameters

85 86

Listing C.1: GA weights.xml – RapidMiner process parameters for genetic algorithm and 1-NN classification

5 10 15 20 < list key=”application parameters”> < list key=”prediction parameters”> 25 30 35

Appendix D: Additional Classification Results

AIM DNS HTTP Kazaa MSDS Netbios SSH Precision AIM 8 0 0 0 1 0 1 80.00% DNS 0 8 0 2 2 0 0 66.67% HTTP 2 0 9 0 1 0 1 69.23% Kazaa 0 1 0 2 0 0 0 66.67% MSDS 0 0 0 0 6 0 0 100.00% Netbios 0 1 1 1 0 10 0 76.92% SSH 0 0 0 0 0 0 8 100.00%

Recall: 80.00% 80.00% 90.00% 20.00% 60.00% 100.00% 80.00% „ 78.46% Table D.1: Confusion matrix of 65 application graphs using motif frequencies

True 5190 True 53 True 80 True 1214 True 445 True 137 True 22 Precision Pred. 5190: 453 27 63 36 84 41 316 44.41% Pred. 53: 1 621 3 0 4 15 8 95.25% Pred. 80: 23 8 712 1 9 26 4 90.93% Pred. 1214: 1 1 2 361 3 0 0 98.10% Pred. 445: 13 3 2 2 282 8 36 81.50% Pred. 137: 2 19 15 0 3 669 0 94.49% Pred. 22: 7 1 3 0 15 1 36 57.14%

Recall: 90.60% 91.32% 89.00% 90.25% 70.50% 88.03% 9.00% „ 79.54% Table D.2: Confusion matrix of weighted traditional graph measures

True 5190 True 53 True 80 True 1214 True 445 True 137 True 22 Precision Pred. 5190: 298 8 56 0 18 0 32 72.33% Pred. 53: 7 632 3 9 2 0 4 96.19% Pred. 80: 120 14 676 0 19 3 23 79.06% Pred. 1214: 5 0 1 370 5 34 1 88.94% Pred. 445: 2 4 15 2 269 1 1 91.50% Pred. 137: 0 1 0 0 2 700 0 99.57% Pred. 22: 36 0 19 1 57 2 94 44.98%

Recall: 63.68% 95.90% 87.97% 96.86% 72.31% 94.59% 60.65% „ 85.70% Table D.3: Confusion matrix of weighted motif profiles

87 Vita

Edward G. Allan, Jr.

Personal

• 4259 Cezanne Cir. Ellicott City, MD 21042 email: [email protected] phone: (443) 812-6232

Education

• Master of Science, Computer Science Wake Forest University, Winston-Salem, NC December 2008 Thesis: “Identifying Application Protocols in Computer Networks Using Vertex Profiles” GPA: 3.79

• Bachelor of Science, Computer Science Wake Forest University, Winston-Salem, NC December 2006 GPA: 3.51

Publication

• Allan, Edward G., Horvath, Michael R., Kopek, Christover V., Lamb, Brian T., Whaples, Thomas S., and Berry, Michael W.: Anomaly Detection Using Nonnegative Matrix Factorization, Survey of Text Mining II, Springer, 203–217, 2008

Experience

• Research Assistant Wake Forest University, Winston-Salem, NC August 2007 – December 2008 Worked with Dr. Errin Fulp on various projects in . Researched topics in computer networks leading to masters thesis. Assisted in classroom and lab duties for networking class.

• Software Development Intern GreatWall Systems, Inc., Winston-Salem, NC June 2007 – August 2007 Designed and programmed a testing platform for new high-speed firewall product.

88 89

Implemented portions of firewall software in the Python programming language to allow firewall policies to be swapped in place with no gap in coverage.

• Intern – R&D team Tenable Network Security, Columbia, MD June 2006 – August 2006 Developed, implemented, and tested Nessus vulnerability scanner plugins. Imple- mented code for the Tenable Log Correlation Engine product using a proprietary language. Analyzed, assessed, and scored software vulnerabilities according to the Common Vulnerability Scoring System for use with the Nessus Vulnerability scanner.

Honors

• Inducted into the Upsilon Pi Epsilon honor society in 2005

• Graduated cum laude from Wake Forest University in 2006

• 2nd place in the 2007 SIAM text mining competition