<<

University of Nevada, Reno

Network Security Monitoring and Analysis based on Big Data Technologies

A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering

by

Bingdong Li

Dr. Mehmet Hadi Gunes / Dissertation Co-Advisor Dr. George Bebis / Dissertation Co-Advisor

December, 2013 c Copyright by Bingdong Li 2013

All Rights Reserved

THE GRADUATE SCHOOL

We recommend that the dissertation prepared under our supervision by

BINGDONG LI

entitled

Network Security Monitoring and Analysis based on Bigig DataData TechnologiesTechnologies

be accepted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Mehmet Hadi Gunes , Ph.D. , Co-Advisor

George Bebis , Ph.D., Co -Advisor

Murat Yuksel , Ph.D. ,, CommitteeCommittee MemberMember

Minggen Lu , Ph.D. ,, CommitteeCommittee MemberMember

Dong Yu, Ph.D. ,, CommitteeCommittee MemberMember

Yantao Shen , Ph.D. ,, GraduateGraduate SchoolSchool RepresentativeRepresentative

Marsha H. Read, Ph. D., Dean, Graduate School

December, 2013

i

Abstract

Network flow data provide valuable information to understand the network state and to be

aware of the network security threats. However, processing the large amount of data

collected from the network and providing real time information remain as big challenges.

Big data technologies provide new approaches to collect, store, measure, and analyze the

large amount of data. This dissertation aims to provide a system of network security

monitoring and analysis based on the big data technologies.

First, I present an extensive survey of the network flow applications that covers past

research perspectives, methodologies, and a discussion of challenges and future works.

Then, I present system design of the network security monitoring and analysis platform

based on the Big Data technologies. Components of this system include Flume and Kafka for real time distributed data collection; Storm for real time streaming distributed data processing; Cassandra for NoSQL data storage, data processing, and user interfaces. The

system supports real time continuous network monitoring, interactive visualization,

network measurement, and modeling to classify host roles based on host behaviors and to

identify a particular user among the other users.

It is critical to continuously monitor the network status and network security threats in real

time, but it is a challenge to process the large amount of data in real time. I demonstrate

how the big data security system designed in this dissertation supports such features.

Another usage of the network flow data is to measure the contents of the network. I

demonstrate how this big data system provides understanding of the usage of anonymity ii

technologies on the campus . Then, I present methods and the results of

classification and identification of network objects based on the big data system designed in this dissertation. Finally, I use Decision Tree and Support Vector Machine to model host role behaviors and user behaviors. Sample results indicate very high accuracy of host role classification and user identification. iii

Dedication

This work is dedicated to my family; to my parent who loved me unconditionally but I could not be with them at their last minutes which I regret the most in my life, and to my wife Zheng Chen and my son Daniel who sweet my heart day and night. iv

Acknowledgements

I would like to take this opportunity to thank my advisors Dr. Mehmet Gunes and Dr.

George Bebis for their advices and encouragements. This work would not have been successful without their guidance.

Special thanks to my manager Jeff Springer, who supported me in many ways; encouraging me pursue PhD degree and providing the research environment. This work would not have been possible without his support.

Thanks to Dr. Murat Yuskel, Dr. Yantao Shen, Dr. Minggen Lu, and Dr. Dong Yu for accepting to serve on my dissertation committee.

I would also like to thank all people who have inspired me getting this step.

Finally, thanks to my family whom my flesh and soul depend on. My parent taught me the fundamental and most important things in my life; to be a strong, honest, and independent person, to work hard, and to keep tackling difficult things. My wife has been taking care of our home so that I can spend time on my research. My son has been such a great kid and always reminds me to rest at the right time.

During my Ph.D. study, I have went through up and down. There were many times I thought to give up. It is my faith that led me to the final line. v

Table of Contents

Abstract ...... i Dedication ...... iii Acknowledgements ...... iv Table of Contents ...... v List of Figures ...... viii List of Tables ...... x Chapter 1 Introduction ...... 1 1.1 Motivations...... 2 1.2 Objectives ...... 4 1.3 Contributions...... 4 Chapter 2 Background ...... 6 2.1 NetworkFlow ...... 6 2.1.1 NetFlow...... 7 2.1.2 sFlow ...... 8 2.1.3 IPFIX ...... 9 2.1.4 Network Flow Analysis ...... 10 2.2 BigDataandRelatedTechnologies ...... 10 2.2.1 FileSystem ...... 12 2.2.2 Distributed, Parallel and Concurrent Computing ...... 12 2.2.3 DataCollection ...... 14 2.2.4 DataStorage ...... 16 2.3 MachineLearning ...... 16 2.4 WebTechnologies ...... 18 2.4.1 AJAX ...... 18 2.4.2 HTML5...... 19 vi

2.4.3 Nodejs...... 19 2.4.4 Data Visualization on the Web ...... 19 Chapter 3 A Survey of Network Flow Applications ...... 21 3.1 Perspectives ...... 21 3.1.1 Network Monitoring, Measurement and Analysis ...... 22 3.1.2 Network Application Classification ...... 25 3.1.3 User Identity Inferring ...... 27 3.1.4 Security Awareness and Intrusion Detection ...... 27 3.1.5 IssuesofDataError...... 30 3.2 Methodologies ...... 32 3.2.1 Statistics...... 32 3.2.2 MachineLearning...... 33 3.2.3 Profiling...... 36 3.2.4 Behavior-based Approaches ...... 38 3.2.5 Visualization ...... 40 3.2.6 Anonymization ...... 40 3.2.7 AnalysisSystems ...... 42 3.3 Discussion ...... 45 3.3.1 Datasets...... 45 3.3.2 ResearchPerspectives...... 45 3.3.3 Methodologies ...... 46 3.3.4 Challenges ...... 46 3.3.5 Future Directions ...... 47 Chapter 4 The Big Data Security System Design ...... 49 4.1 Approach...... 49 4.2 Components of the Security Analysis System ...... 50 4.2.1 DataCollection ...... 52 4.2.2 DataStorage ...... 52 vii

4.2.3 SecurityGateway ...... 53 4.2.4 DataProcessing...... 54 4.2.5 UserInterfaces ...... 54 4.3 Features...... 54 4.3.1 Real Time Continuous Network Security Monitoring and Interac- tiveVisualization ...... 54 4.3.2 NetworkMeasurement ...... 55 4.3.3 Advanced Network Modeling ...... 55 4.4 Discussion ...... 55 Chapter 5 Real Time Continuous Network Monitoring and Interactive Visualization ...... 57 5.1 Real Time Network Host Querying ...... 58 5.2 Real Time Continuous Network Monitoring ...... 59 5.2.1 Network Flow Status ...... 60 5.2.2 Top N Conversations ...... 61 5.3 Interactive Network Security Awareness Visualization ...... 61 5.4 Discussion ...... 63 Chapter 6 A Case Study of Network Flow Measurement ...... 64 6.1 Usage of Anonymity Network ...... 64 6.1.1 Campus Network Traffic Flows ...... 65 6.2 RelatedWork...... 67 Chapter 7 Classification and Identification of Network Objects ...... 71 7.1 Methods ...... 73 7.1.1 DataSet...... 73 7.1.2 Algorithm...... 73 7.1.3 Modelling...... 74 7.1.4 GroundTruth ...... 74 7.2 HostsRoleClassification...... 74 viii

7.2.1 ClassificationFeatures ...... 75 7.2.2 Classification of Client versus Server ...... 77 7.2.3 Classification of Web Server versus Web Non-email Server 78 7.2.4 Classification of Hosts from Personal Office versus Public Place . 79 7.2.5 Classification of Hosts from Two Different Colleges ...... 80 7.2.6 Feature Contributions ...... 81 7.3 UserIdentification ...... 83 7.3.1 IdentificationFeatures ...... 84 7.3.2 User Identification Results ...... 86 7.3.3 Feature Contributions ...... 87 7.4 Discussion ...... 89 7.4.1 Classification of Network Applications ...... 89 7.4.2 Profiling the Host and Network ...... 90 7.4.3 Machine Learning Approaches ...... 90 7.4.4 Classifying and Clustering Host Roles ...... 91 7.4.5 Identifying the User among Others based on Network Flow.... 92 Chapter 8 Conclusion ...... 93 Chapter 9 Future Work ...... 95 9.1 Improvement to the Current Work ...... 95 9.2 Extensions to the Current Work ...... 95 9.2.1 BackgroundTraffic ...... 95 9.2.2 Detection of Operating System Fingerprinting ...... 96 9.2.3 Identity Anonymity ...... 96 9.2.4 Fusion with Other Network Security Data ...... 96 9.3 Vision...... 97 Bibliography ...... 98 ix

List of Figures

Figure2.1 NetFlowProcess ...... 8 Figure2.2 sFlowProcess ...... 9 Figure2.3 MapReduceWordCountProcess ...... 13 Figure 2.4 A Simple Storm Topology with Word Count ...... 13 Figure 2.5 Flume Agent Data Flow Model ...... 14 Figure 2.6 Kafka Architecture ...... 15

Figure3.1 PublicationsbyYear ...... 22 Figure 3.2 Analyzed Perspectives by Years ...... 23 Figure3.3 MethodsbyYears ...... 33 Figure 3.4 Bipartite Graph (left) and One-mode Projection (right) ...... 38

Figure 4.1 The Big Data Security System Architecture Design ...... 51

Figure 5.1 Real Time Host Traffic Query ...... 60 Figure 5.2 Analyzing Network Targets ...... 61 Figure 5.3 Real Time Network Flow Status Monitor ...... 62 Figure 5.4 Real Time Network Flow Top N Status ...... 63

Figure 6.1 Geo-Location of Anonymity Usage on Campus (sFlow Data) .... 65

Figure 7.1 Average Accuracy of Classifying Client versus Server ...... 78 Figure 7.2 Average Accuracy of Classifying Regular Web Email Server versus WebNon-emailServer...... 79 Figure 7.3 Average Accuracy of Classifying Host from Personal Office Client versusPublicPlaceClient ...... 80 Figure 7.4 Average Accuracy of Classifying Host from Two Different Colleges . 81 Figure 7.5 C5.0 Feature Contributions in Classifying Host Roles ...... 82 x

Figure 7.6 Average Accuracy of Identifying the User ...... 87 Figure 7.7 C5.0 Feature Contribution in Identifying the User ...... 88 xi

List of Tables

Table 3.1 Summary of Security Awareness and Intrusion Detection ...... 31 Table 3.2 Summary of Machine Learning Approaches of Network Application Classification...... 36 Table 3.3 Summary of Machine Learning Approaches of Anomaly Detection . 36 Table 3.4 Summary of NetFlow Visualization Applications ...... 41

Table 6.1 Analyzed Anonymity Systems ...... 64 Table 6.2 Usage of Anonymity Systems based on sFlow Data ...... 65 Table 6.3 Application Usage of Anonymity Systems based on sFlow Data . . . 70

Table 7.1 Formal Description of Network Objects ...... 73 Table7.2 AnalyzedSystems ...... 74 1

Chapter 1. Introduction

Understanding what information is flowing in the is valuable for network

administrators, accounting, network planning, network security, forensics, and counter- terrorism. Network flow data are supported by a wide range of products including Cisco

Netflow [91], Juniper “cflowd”, NetStream by 3COM and Huawei, and sFlow. Network flow 1 records high-level descriptions, i.e., meta-data, of Internet connections but not the actual data transferred. In general, collection and analysis of network flow information are more efficient than . It protects the privacy of the users and works even for encrypted traffic. The gathered information helps to uncover both external activi- ties as well as internal activities such as network misconfiguration and policy violation. Since the Cisco’s NetFlow patent in 1996, extensive research of network flow has been conducted and many applications have been developed [112]. We have seen a rise of re- search interest and application development for network flow monitoring and analysis in recent years [112]. Traditionally, network flow analysis has focused on the network moni- toring, planning, and billing. Recent research, however, has focused more on the network security analysis. Behavior-based approaches, which are based on normal network host and user behaviors rather than static assumptions, have become the mainstream [69]. Advanced techniques such as machine learning have been employed as well. The analysis systems are moving toward distributed system to provide more scalability, robustness, and compu- tational power for real-time in depth analysis. However, many challenges still remain in- cluding a huge volume of network flow data, a small amount of relevant information in the network flow data, and real time requirements for network security analysis. Additionally,

1Network flow is used to refer these data in this dissertation to avoid these trademarks. 2

network security is becoming more challenging as attacks are getting more sophisticated, organized, targeted, persistent, and dynamic. Attacks are not only from external sources but also internal sources.

1.1 Motivations

Traditional network security systems such as firewall and anti-virus software have been developed and deployed for a long time. These traditional security systems assume a static system. There have been many network flow monitoring and analysis systems that have been developed and deployed. However, few of them have made the network security as the primary goal, and none can meet the real time performance requirement of network se- curity monitoring. These systems store data in traditional databases (e.g., Oracle, Microsoft SQL, or MySQL) or flat files, which can not handle real time performance requirements.

Computer hardware and software have been constantly improving, and so have hackers’ abilities. Hence, new intelligence-driven real time security systems that are more agile, dynamic, risk-aware, and contextual are in need. Big data [5], which is defined as volume, velocity, variety and veracity, has become a new trend in dealing with large amount of data in real time. These new technologies provide innovative approaches to collect, store, real time measure, and analyze a large amount of data. Among these technologies, Apache Hadoop [3] related technologies are open source and provide a valuable platform for public and researchers. The Apache Hadoop project includes a collection of tools for reliable, scalable, and distributed data collection, storage, and processing. Big data technologies have played an important role in global information processing, and have showed considerable promise for improving network security. There are a large amount of data that are valuable to network security including network flow data, network 3

logs, application access records, and firewall logs. However, these data have not been used effectively because the data are too large and variable making it challenging, not just to store the data but also to analyze it in meaningful ways. These data are growing hourly as the network is ever-expanding, connecting, and mobilizing. In particular, the big data technologies may provide valuable information to network security operation in a timely manner. Big data technologies may transform the network security into a new stage where these additional data are utilized [6]. Additionally, analysis by adding context of role-based behavior provides a new direction for accurate cyber security monitoring. Behavior-based user identification provides information for user identification and forensic investigation, and opens a possible area of network user identification based on the user’s network flow data.

Behavior-based approaches and machine learning have been valuable for network flow analysis. Machine learning represents a collection of methods for discovering knowledge by searching for patterns, and has been very successful in computer vision and speech recognition. Applying machine learning methods to role-based behavior approach pro- vides overall dynamic and contextual abilities to the network security analysis system. Ad- ditionally, behavior-based approaches first learn normal behaviors to build models and then make decisions based on the deviation from the normal behavior. They overcome the static assumptions made by traditional security systems. Different behavior based studies have focused on application classification, anomaly detection, visualization of host behaviors, and host clustering. Most of the studies use basic statistics of network flow data, and fo- cus on network and known attack patterns. Few studies use machine learning approaches, and none, as far as we are aware of, has applied machine learning to classify and identify network objects based on roles. 4

1.2 Objectives

Given the challenges network security is facing and new opportunities that Big data tech- nologies provide, this dissertation proposes a network security monitoring and analysis system based on big data technologies to collect, store, access, measure, and analyze net- work flow data. This system makes the network security a primary goal, measures the campus network, provides real time continuous network monitoring and interactive visual- ization, and enables intelligent network objects, i.e., network hosts and users, classification and identification based on role behavior as context. Compared with most network flow analysis approaches of modeling abstract and un- known objects such as virus, botnet or anomaly with static signatures, in this dissertation, I model concrete and specific network objects such as hosts and users based on their profile and behaviors. The big data network security system is low cost and built on open source projects. By building the system to collect online streaming data, to real time continuously monitor, and to analyze the role-based behavior, the network security system provides real time continuous monitoring, interactive visualization, intelligent analysis, and is low false rate of classification and identification of security threats.

1.3 Contributions

The contributions of this work are relevant within the network security monitoring, mea- surement, classification, identification, and intrusion detection. The experiment of pro- cessing network flow based on the big data technologies in this dissertation provides an example and insights for the use of big data technologies in cyber security. The big data technologies have opened new opportunities to design a better system for handling secu- rity related information flows and providing agile response. The platform designed in this 5

system can be used for security as a service for customers. In addition, the distributed network flow monitoring and analysis system has been implemented and is being used by

the network security administrators of our campus. The highlights of my contributions can be summarized as below.

• A distributed system based on big data technologies for network flow real time mon- itoring, network measurement, and analysis based on the open source big data tech-

nologies.

• Use of the big data technologies to achieve real time continuous network monitoring and interactive visualization.

• A case study of network measurement by measuring usage of anonymity technolo- gies on campus.

• Models of classification of clients versus servers, clients from public places versus personal offices, servers of web email servers versus web servers, and clients from different colleges using sFlow data, and identifying a particular user among the other

users using NetFlow data. 6

Chapter 2. Background

This chapter provides background information related to technologies used in this disserta-

tion. These technologies are summarized in four categories:

• Network Flow: NetFlow, sFlow, IP Flow Information Export protocol (IPFIX), and Network Flow Analysis

• Big data and Related Technologies: Apache Hadoop open source distributed platform for collecting, storing, and analyzing large amount of data

• Machine Learning

• Web Development Technologies

2.1 Network Flow

A network flow is defined as an unidirectional or bidirectional sequence of packets be- tween two endpoints (from server to client or from the client to server) with some attributes in common. Most important key fields includes: source/destination IP address, source/des- tination port number, protocol type, type of services, bytes transferred, and input logical interface. Other fields may be included that depend on the NetFlow version or configuration for the export. These fields provide a rich set of traffic statistics including user, protocol, port, and type of service that can be used for a wide variety of purposes that include net- work security, network monitoring, traffic analysis, capacity planning, traffic classification, accounting, and billing. The general process of working with NetFlow includes capturing, sampling, generating, exporting, collecting, analyzing, and visualization. 7

There are various systems that capture network flow data from network links or devices such as Ntop [46], NG-MON [77], NetFlow, NetFlow-lite, sFlow, cflowd, and NetStream.

NetFlow and sFlow are widely used systems, and IPFIX [14] is the new defined standard for IP flow format. In this section, I give a brief introduction to NetFlow, sFlow, IPFIX,

and traffic analysis.

2.1.1 NetFlow

NetFlow is a traffic monitoring technology developed by Darren and Barry Bruins in 1996

at Cisco [91]. It defines how a router exports information and statistics of routed sockets. As a de facto industry standard, it is a built-in feature of most routers and switches produced

by Cisco, Juniper, and other vendors. Network devices look at the packets arriving on the interfaces, and capture traffic statistics per flow based on configuration for sampling or

filtering, then they create a flow cache, aggregate and export the data through UDP or SCTP. NetFlow cache entry is created by the first packet of a flow, maintained for similar flow characteristics, and exported to collectors periodically based on flow timers or flow cache

management. The export format was fixed before version 8. After version 9, extensibility and flexibility were added to integrate with Multiprotocol Label Switching (MPLS), IPv6,

Border Gateway Protocol (BGP), and user defined records. NetFlow version 5 and 9 are the most popular versions. Sampled NetFlow is a variant originally introduced by Cisco to reduce computational

burden by reducing the number of NetFlows. Sampling rate can be configured as prede- termined nth packet or randomly selected interval. Figure 2.1 presents the basic process of

NetFlow formation, exportation, storage and analysis. Due to the large volume of network traffic and limited computational resources (ie., memory, CPU and bandwidth), caching, sampling and UDP exportation are used. These can cause quality issues for the collected NetFlow data: (1) some new flows will not be counted when cache is full; (2) sampling 8

Network Packet Cache Flow Collector Store & Analysis

New Flow

Export

Flow Data

Figure 2.1: NetFlow Process reduces the accuracy of flows, especially when sampling rate is adjusted by the traffic rate;

(3) exported flow records do not necessarily correspond to the order in which the flow traf- fic arrived at the router. There are varieties of NetFlow collectors and analysis tools from commerical vendors such as Cisco, freeware or developed in-house for special purposes [13, 15].

2.1.2 sFlow

Packet sampling of traffic flow [52] has been used even before NetFlow was developed. sFlow was developed by InMon Inc. and has become an industry standard defined in RFC

3176. It is a technology using simple random sampling and is supported by Alcatel, Ex- treme, Force10, HP, Hitachi by embedding the sFlow agent within switches and routers. The sFlow agent is a software process that combines interface counters and flow samples into sFlow datagrams and immediately sends them to sFlow collectors via UPD. Immedi- ate forwarding of data minimizes memory and CPU usage at the network device. Packets are typically sampled by Application-Specific Integrated Circuits (ASICs) to provide wire- speed performance. sFlow data contain complete packet header and switching/routing in- formation, and provides up to the minute view of the network traffic. sFlow is able to run at layer 2 and capture non-IP traffic as well. The sFlow collectors are servers that collect 9

sFlow Analysis System

Collector Data Storage Analysis

Switch/Router

Management

sFlow agent

ASICs Counters Samples

Figure 2.2: sFlow Process the sFlow datagrams. The official sFlow web site [18] provides a list of available sFlow collectors. Figure 6.1 presents the basic components and processes of sFlow analysis.

2.1.3 IPFIX

IP Flow Information Export (IPFIX) is an IETF standard for exporting network flow based on NetFlow version 9, and is defined in RFC 5101 for information transmitting protocols,

RFC 5102 for information model, and RFC 5103 for exporting bidirectional flow. IPFIX was designed to meet the fast growing requirements of observing network traffic, to provide

an extensible and flexible data model that can be customized, and to support reliable and secure data transfer through SCTP, TCP, and UDP. IPFIX flow definition is less restrictive than traditional flow definition. 10

2.1.4 Network Flow Analysis

Network flow analysis is the process of discovering useful information by using statistics or other sophisticated approaches. The basic process includes capturing, collecting and storing data, aggregating the data for query and analysis, and analyzing the data for useful information. This information is mostly related to , measurement, and network security. There are different ways to collect network flow data including

SNMP, NetFlow, raw packet, or auditing data from network infrastructure such Firewalls, and VPN gateways. Typically, there are two strategies: (1) depth-first when there is known information and clear purpose, or (2) breadth-first when looking for a general view of the network without a particular purpose.

In general, deep packet inspection needs packet level information and consumes more computational resources. Flow level analysis, such as NetFlow and sFlow, typically con- sumes less computational resources. There are many products and tools developed by industry or open source community. Analysis of network flow information has become crucial as the Internet has become the living blood in our society and is expanding at a fast pace around the world. There are many challenges in analyzing network flow data such as, huge amount of data due to networks becoming larger and faster, limited high-level infor- mation, and complex statistical properties. As discussed in sections 3.1 and 3.2, various perspectives have been analyzed and many algorithms have been developed.

2.2 Big Data and Related Technologies

Big data technologies are still in its inception stage. Its concept is constantly growing as we try to understand its size, complexity, and methods for data collecting, storing, processing, analyzing, sharing, and visualizing. In general, Big data [5] refer four dimensions or four Vs: 11

• Volume: data size is from terabytes to petabytes in a data set.

• Variety: data type can be structured (e.g., data in traditional SQL database), un- structured (e.g., conversation between people), or semi-structured (e.g., are

semi-structured with structured data of ’from’, ’to’, ’subject’, and unstructured email body).

• Velocity: processing the big data in a short time or real time to maximize the data value for users.

• Veracity: truthful intelligence can be gained from more noise, less truth, and high uncertainty data (e.g., the NetFlow data has high uncertainty to identify a user in this dissertation).

Applications of big data technologies are tremendous, at scientific data of physics, bioinformatics, computational social science, climate simulation, government, and more. In this dissertation, I monitor and analyze security aspect of a large amount of data gener- ated by computer network. Research related to big data technologies is very active and is supported by govern- ment and private sectors. Big data require technologies from many aspects of computer science, from high performance computing, collecting, and storing. The Apache Hadoop project [3] is a open source project from Apache, which includes a collection of tools for reliable, scalable and distributed data collecting, storing, and pro- cessing. Its core components include the Hadoop Distributed File System (HDFS), the parallel data processing Map reduce, and other common core modules. In the following, I briefly introduce some open source projects of big data technologies related to this dissertation. 12

2.2.1 File System

Hadoop Distributed File System (HDFS) is a distributed file system that is designed to handle big data (in the order of petabytes), to provide high-throughput access to a large amount of data and highly fault-tolerant, and to run on low-cost hardware. HDFS is the basics for other Apache Hadoop related technologies.

2.2.2 Distributed, Parallel and Concurrent Computing

Big data demand more computing power than traditional data processing. Distributed, par- allel and concurrent computing provide these need. Parallel and concurrent programming,

however, is a nightmare for many programmers. Fortunately, there are several open source projects to support distributed, parallel and concurrent computing. The following applica- tions are experimented in this dissertation.

2.2.2.1 Map Reduce

The map reduce is a programming model for distributed parallel computing, and is com- posed of a pipeline of map and reduce. A simple example is the word count example of

lines of words as figure 2.3. In the map stage, it outputs word as the key and 1 as the value; In the reduce stage, it outputs words as key and sum as the value. In general, map reduce is

good fit for log processing, data mining, and machine learning. Map reduce has interface for Java, C++, and high level script Pig Latin.

2.2.2.2 Apache Pig Latin

Pig Latin [141] is a platform and has a scripting language for analyzing large data sets in parallel using map-reduce programs. It makes map reduce parallel programming easier. 13

Input Splitting Mapping Shuffling Reducing Final Result

Deer River Deer 1 Bear 1 Bear 1 River 1 Car 1 Car 1 Bear 1 Deer River Car 1 Car 1 Car River Car River Deer 2 River 1 Deer 1 Deer Bear River 2 Deer 1 Deer 2

Deer 1 River 1 Deer Bear Bear 1 River 1 River 2

Figure 2.3: Map Reduce Word Count Process

2.2.2.3 Storm

Storm is a distributed, reliable, fault-tolerant, and real time distributed computing system

developed by BackType (now Twitter) in 2011 [20]. It was designed for real time com- puting of streaming data. Now it is also used for continuous computation and distributed remote procedure call (RPC). Compared with map reduce which performs batch process-

ing of jobs, Storm uses the term topology which processes messages forever. The two main components of a topology are spouts which are sources of streams to accept streaming data, and bolts which are single threads to process the stream data as shown in Figure 2.4. Compared with traditional big data solutions such as Apache Hadoop, Storm supports continuous streaming data processing without termination.

Continuous Input Spout Bolt

Deer River Deer Bear Deer,River, Deer 3 Deer,Bear, Bear 2 Car River Bear Deer Car,River, Car 1 Bear,Deer River 2

Figure 2.4: A Simple Storm Topology with Word Count 14

2.2.3 Data Collection

Big bata collection requires distributed, reliable, scalable, extensible, and efficient data collection, aggregation, and migration. There are several distributed message collecting systems available, and some of them are open source projects. In general, the main consid- erations are configuration, fault-tolerance, and maintenance. Flume and Kafka stand out of and are experimented in this dissertation.

2.2.3.1 Flume

Flume [2] was built by Cloudera and is a Apache project. It was designed as a highly reli- able, distributed logging service HDFS. It has two main components: agents and masters.

Masters are responsible to keep track and manage all agents. Agents are responsible to collect, aggregate, and move data to data store. Figure 2.5 presents the Flume agent data

flow model. Multiple layers of agents can be added for different roles such as data collec- tion and aggregation. In the same way, more masters can be deployed to make the system fault-tolerant so that any one of them fails, the system will not fail.

Flume-ng is the next generation Flume with many improvements that make it more configurable and provide more features. Flume Agent

Source Sink

Network Flow To Cassandra

Figure 2.5: Flume Agent Data Flow Model 15

2.2.3.2 Kafka

Kafka [99] is a distributed data collecting system developed by LinkedIn, and later became an open-source Apache project. It was designed as a distributed publish-subscribe messag-

ing system for high throughput. A agent in Kafka has three basic roles: a producer which produces message, a broker which handles message and routes them to the appropriate

topic queue, and a consumer which consumes the message in the message queue of the the specific topic. Figure 2.6 presents the Kafka architecture. Kafka can play as a traditional distributed event logger to save the message for offline

analysis as Flume does. Kafka can also support real-time streaming data analysis by the consumer. Flume and other distributed event log systems are typically push based system whereas Kafka is a poll based message processing system. Kafka sends data to consumers over network via Linux sendfile() API which is more efficient.

Producer 1 Producer m Producer N

Broker 1 Broker N ------Topic1 Topic1 /part1 /part1 … … /partN /partN TopicN TopicN /part1 /part1 … … /partN /partN

Consumer 1 Consumer m Consumer N

Figure 2.6: Kafka Architecture 16

2.2.4 Data Storage

The data storage of big data is NoSQL, i.e., Not Only SQL. NoSQL supports un-structured and structured data, multiple replications, and large volume of data. There are many

NoSQL solutions from commercial to open source. Each solution has its unique char- acteristic such as Apache Hadoop based, document store, key value store, and graph. In this dissertation, Cassandra is chosen as the data storage.

Apache Cassandra [1] is an open source, distributed NoSQL storage system. Starting at Facebook, it combines Amazon’s Dynamo architectural aspects and Google’s Bigtable data model as an open source project. It provides scalability, reliability, durability, and high performance. It is one of the most widely used among other NoSQL solutions, and has been adopted by companies such as NetFlix, Twitter, and Cisco. Eric Brewer’s well-known ”CAP Theory” [70] claimed that it is impossible for a web service to provide consistency, availability, and partition-tolerance at the same system. Cassandra is known for availabil- ity and partition-tolerance, but arguable provides consistency by specifying a consistency level. Compared with traditional relational databases, Cassandra has a simple schema com- prising keyspace (like database in Microsoft SQL or schema in Oracle database), column families (like tables in relational databases), rows and columns. Each row is stored as a single file on a machine, and column size and name can be static or dynamic. Cassandra provides application programming interfaces for Java, C++, Python, R, and Pig for data processing.

2.3 Machine Learning

The machine learning approach is a collection of methods for discovering knowledge by searching for patterns in a data set. The machine learning approach refines and improves its knowledge base by learning from experience. The basic learning types are listed below: 17

• Classification: classify inputs to labeled outputs.

• Clustering: group inputs into clusters.

• Association: discover interesting relations between features.

• Prediction: predict outcome in terms of a numeric quantity.

Machine learning schemes include information theory, neural networks, support vector machines, genetic algorithms, and many more [165]. Machine learning applications require the collection of training and test data sets and depend on algorithms for feature extraction, feature selection, and learning. Initially, the system is trained using example data to learn specific data associations; then, the system is deployed in a similar environment where test data are used for classification. The Decision Tree and Support Vector Machine are two supervised machine learn- ing algorithms that are often applied to network flows and typically perform better than others [112]. Therefore, in our experiments we chose both of these supervised machine learning algorithms.

2.3.0.1 Decision Tree

Decision Tree [65] uses inductive inference to produce a tree like structure which includes a root node, internal nodes, and leaf nodes. Classification is processed from the root node and moving down toward some leaf nodes. The strength of Decision Tree algorithms come from performing fast classification without requiring much computation, and providing information of which features are most important for classification. The weakness of Deci- sion Tree is that the training process is computationally expensive as the tree grows bigger. 18

2.3.0.2 Support Vector Machine

Support Vector Machine was first introduced in the 1960s [31]. It has evolved over the time and has been successfully used for classification of different applications. A Support Vector

Machine separates the classes with optimal hyperplanes that the largest possible points of the same class are on the same side. Support vectors are built based on the data points close to the hyperplanes. Different kernels are used to map data to feature spaces. The most common kernels are linear, polynomial, radial basis function, and sigmoid. Selecting a kernel is very important for success of the classification. Support Vector Machines are scalable for high dimensional data. Online Support Vector Machine [31] is an extensive version that stores the model for each training, and then adds new examples while removing the least relevant examples for the new data set. It saves memory and provides flexibility for scenarios involving stream data.

2.4 Web Technologies

The Internet is providing an exponentially growing content and is moving toward the cloud computing. Web related technologies have improved significantly in the recent. The fol- lowing are several web development technologies used in this dissertation.

2.4.1 AJAX

Asynchronous Javascript XML (Ajax) is a new model of web development. In Ajax de- velopment model, web application can send data to or retrieve data from a server syn- chronously or asynchronously. Ajax provides fast performance and opens new doors for interactive user experience and rich contents. 19

2.4.2 HTML 5

Hyper Text Markup Language (HTML) 5 is built on the latest development of web applica- tions: multimedia, multi-platform, more complex web applications, low-powered devices such as smartphone, and more. A server-sent event is a new feature of HTML 5 that the web page automatically gets update from the server. Different from traditional stateless web page, WebSocket provides a full-duplex communications over a single TCP connec- tion.

2.4.3 Nodejs

Rapid web development is on demand because of quick prototype, short time development and low cost. Many new platforms and programming languages have been developed in recently. Nodejs [17] is a server-side JavaScript for writing scalable Internet applications, and is a high performance and concurrent network application framework. Nodejs runs on V8 which is an open source JavaScript engine developed by Google [21]. Nodejs highly leverages event-driven model to implement non-blocking operation and everything inside nodejs runs in a single thread to make it lightweight.

2.4.4 Data Visualization on the Web

Data visualization on the web has gone through a breakthrough in recent years, especially with the new features of HTML 5 such as the canvas element and SVG images. WebGL is another JavaScript web graphical library supported by most browsers. Hardware and high performance JavaScript engines make these possible for interactive graphs on the web. Interactive visualization allows users to interact with the graph and provide dynamic information about the data by user inputs. There are many open source JavaScript libraries that make data visualization on the 20

web easier. D3 (d3js.org) and Flot (www.flotcharts.org) are used in this disser- tation to visualize the network status. 21

Chapter 3. A Survey of Network Flow Applications

Research in network flow analysis has become very active in the recent years as observed in figure 3.1, which shows the published paper distribution with respect to publication year. It is necessary to look back what perspectives have been achieved, and what methods have been used and are more effective in order to move forward. This chapter presents a survey of NetFlow-like applications that studies were published between 1998 and early 2012 with emphasis on network flow applications. The rest of this chapter is organized as follows. Section 3.1 reviews the key perspec- tives which have been addressed in the literature including networking monitoring, analysis and management, application classification, inferring user identity, and network security awareness. Section 3.2 explains the key methodologies which have been employed for data analysis including statistic, machine learning, profiling, behavior-based approaches, sampling, visualization, and computational infrastructures. Section 3.3 discusses the limi- tations of existing approaches, challenges and future directions.

3.1 Perspectives

In this section, I survey the main research perspectives of network flow applications. In particular, it covers network monitoring, measurement and analysis, application classifica- tion, user identity inferring, security awareness and intrusion detection, and issues related to error and bias in NetFlow collection and analysis. Figure 3.2 shows the distribution of 22

45

40

35

30

25

20

15

10

5

0 1998 2000 2002 2004 2006 2008 2010 2012 Figure 3.1: Publications by Year published papers with respect to five main perspectives: monitoring, classification, user identifying, security, and issues related to errors. As it can be observed, network security has been the main research topic using NetFlow data.

3.1.1 Network Monitoring, Measurement and Analysis

Network monitoring and measurement provide valuable information to network adminis- trators, as well as ISPs and content providers. Compared to other technologies, such as

SNMP or Windows Management Instrumentation(WMI), network flow data contain addi- tional information for further analysis. For example, they can provide bandwidth analysis, specific protocol monitoring, and system performance, etc. Monitoring based on NetFlow can be categorized as:

• Network monitoring: provides information about routers and switches as well as network-wide basis view, and is used for problem detection along with efficient trou- 23

20 monitor classification identity security issues

15

10

5

0 1998 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Figure 3.2: Analyzed Perspectives by Years

bleshooting.

• Application monitoring: provides information about application usage over the net- work, and is used for planning and allocation of resources.

• Host monitoring: provides information about user utilization of network and appli- cations, and is used for planning, network access control, violating security policy.

• Security monitoring: provides information about network behavior changes, and is used to identify DoS attacks, viruses and worms, and network anomalies.

• Accounting and Billing: provides network metering, and is used for billing.

In this section, we focus on network, application, user and resource monitoring while sec- tion 3.1.4 focuses on security related monitoring. In the following, we discuss specific research perspectives. 24

3.1.1.1 Network Monitoring

Many aspects of networks have been studied using NetFlow data, including network per- formance based on round-trip time [170], delay measurement [97, 108], connectivity [157],

misuse of bandwidth [121], traffic characterization [101], finding heavy hitters [178], mon- itoring for special purpose QoS [113], and diagnosis of troubleshooting [171].

3.1.1.2 Application Monitoring

Liu and Huebner [115] investigated the stochastic characteristics of distributions of flow length, packet size, throughput, etc. for the popular and bandwidth consuming applications. Kalafut et al. [86] proposed a heuristic method to differentiate wanted and unwanted traffic based on the sampled NetFlow data.

3.1.1.2.1 VoIP Voice over IP (VoIP) service is widely used, however, it introduces secu- rity threats that include Session Initiation Protocol (SIP) scanning, SIP flooding, and Real- time Transport Protocol (RTP) flooding. Lee [106] developed a system that can monitor VoIP service and detect VoIP network threats based on NetFlow statistics and behaviors.

Kobayashi [96] presented a method for measuring VoIP traffic fluctuation by using Net- Flow and sFlow based on the variance of the interval of the target RTP packets. Lucas [47] provided an open source VoIP monitoring system based on protocol characteristics.

3.1.1.2.2 Mobile Network Sinha et al. [163] analyzed the flow-level upstream traffic behavior from Broadband Fixed Wireless (BFW) and Digital Subscriber Link (DSL) to provide traffic characteristics of these networks. Moghaddam [128] used wireless NetFlow data to measure and simulate user behavior and provide information for future mobile net- work design. 25

3.1.1.2.3 IPv6 With the transition from IPv4, there is a need to understand IPv6 usage including user behavior, traffic volume, transitional technologies, assignment of IPv6 ad- dress, IPv6 percentage of network traffic. NetFlow can provide this information including application types, usage of transitional technologies of IPv6 to IPv4, interface identifier assignment schemes, etc [161, 202].

3.1.1.3 Host Monitoring

Host profile and relationships in the network can be used for resource planning as well as for network security analysis. Caracas et al. [34] proposed an algorithm based on Net-

Flow data to describe the dependencies among computer systems, software components, and services. Kind et al. [94] presented a method to uncover the relationships between IT infrastructures using NetFlow data. Chen et al. [39] developed novel heuristics to ana- lyze characteristics and correlations between inter-data centers and client traffic, provided insights into data center design and operation. Several methods have been proposed for profiling behaviors on the end hosts [189, 194, 195]. Behavior-based approaches will be discussed in detail in section 3.2.4.

3.1.2 Network Application Classification

Network application classification classifies network traffic into certain application cate- gories which can be coarse- or fine-grained. Network application classification is a chal- lenging task because of obfuscation techniques such as content encryption, dynamic ports, and proprietary communication protocols. Classification approaches can be divided into four categories: port based, payload-based, heuristic based using transport layer statistics, and machine learning based. Port based approaches are no longer reliable because of cer- tain applications that randomly assign ports. Payload based approaches do not work on encrypted traffic, and are resource-intensive and scale poorly with high bandwidth. Ap- 26

proaches based on heuristic and machine learning approaches provide alternative methods. There are many reasons for network traffic classification. Network administrators need information for applications running at the network (e.g., file sharing), whether they are legitimate users or worms. ISPs and content providers need the information for quality of service assurances. Research in this area has conducted for over ten years, but it is still growing. There is a list of 68 published papers and 86 data sets collected in CAIDA web pages [12]. There are several surveys on network application classification using traffic classification approaches [92] and machine learning approaches [139]. We discuss machine learning approaches in detail in section 3.2.2.1. It is worth mentioning that Perelman et al. proposed a method that investigates the application signatures of web browsers, mail client, or media-players in network flow [145]. Peer to peer networks have become a major security concern and the focus of most network classification studies. We discuss peer to peer classification in detail below.

3.1.2.1 Peer to Peer Network

Peer to Peer (P2P) networks have been widely utilized for file-sharing, video distribution and voice communications. They consume more Internet traffic than traditional applica- tions, and have been a concern for network administrators and a challenge for network security. There is interest from ISPs and network administrators to identify and control the P2P network traffic [72, 200]. NetFlow provides an alternative approach that is more efficient in terms of storage and processing than deep packet inspection (DPI). Recently, there has been considerable effort on NetFlow P2P analysis. These include methods based on: (a) default P2P port for heavy-hitters [183], (b) port usage pattern of specific P2P net- work such as BitTorrent [30, 72], (c) flow statistic characteristics such as packet length and time-interval [30, 148, 193], (d) TCP flags that a host, as both client and server, send/re- ceive a packet with both SYN and ACK at the same time [85], (e) machine learning that 27

using features such as IP address and port, packet size, bytes exchanged. Among machine learning approaches we discuss in section 3.2.2.1, six out of eight classifiy P2P traffic.

3.1.3 User Identity Inferring

Identifying a person based on extrinsic biometric is not new; well-known examples include

signatures and keystrokes. Inferring user identity based on network flow patterns how- ever is a new field. Melnikov et al. [125] discussed the potential of inferring user identity using NetFlow feature distribution and cross-correlation of various trace parameters and relationships among packets. Even though the reported results were preliminary, additional research may yield more promising results.

3.1.4 Security Awareness and Intrusion Detection

In this section, we focus on security related awareness, detection and monitoring. Ta- ble 3.1 provides a list of studies that provide perspectives on security awareness and in- trusion detection. Table 3.3 also lists approaches that use machine learning approaches. IDSes can be categorized based on how they identify intrusions: anomaly-based, misuse- based (knowledge-based or signature-based), or combination of both anomaly and misuse- based [167]. Alternatively, IDSes can be categorized based on what they target: host-based, network-based or both. Network anomaly detection refers to finding patterns that are not expected users be- haviors, also known as anomaly-based IDS. Compared with misuse-based IDS, these pat- terns are previously unknown. Most content-oriented systems belong to knowledge-based detection, which looks for known signatures of malware by inspecting traffic packets. Most behavior-oriented systems belong to anomaly-based detection, which differentiate anoma- lous behavior from normal behavior. NetFlow based IDSes use existing NetFlow data and limited information, and avoid privacy issues compared to content-oriented approaches. 28

However, NetFlow based IDSes are more difficult because of limited information in the NetFlow data. Consequently, recent research has shown that machine learning approaches

are better than statistical and streaming methods. Sperotto et al.s [167] conducted an overview of IP flow-based intrusion detection that focused on flow-based IDS, concept of flows, classification of attacks, and defense techniques. In the following, we discuss per- spectives of security awareness and intrusion detection that can be achieved using NetFlow data.

3.1.4.1 Top N

Top N refers a set of statistic and models of NetFlow data. They reflect the basic network status. It is relatively simple with NetFlow analysis. It can be used to find the big talkers or heavy-hitters. It also can be used for abnormal traffic detection [201].

3.1.4.2 Port Scanning

Port scanning is the act of systematically scanning a computer’s ports, and is usually done by using small packets that probe the target machines. In most network attacks, port scan- ning is the first reconnaissance step. Port scanning can be classified in three categories: scanning many ports on a single host, scanning a single port on many hosts, and combi- nation of both. Detection of port scanning is addressed in most studies cited in table 3.1.

Approaches include host incoming/outgoing connections, probability of entropy, Bayesian logistic regression, distances from baseline models, and machine learning.

3.1.4.3 Denial of Service

A denial-of-service (DoS) or distributed denial-of-service (DDoS) attack is an attempt to make the target host or network resource unable to respond to its requests. Detection of DOS or DDoS is addressed in most flow based IDSes. Gao et al. [68] proposed a resilient 29

DoS detection based on sketch-based schemes that use a hash table for storing aggregated flow measurement. Kim et al. [93] described different DoS attacks based on traffic pat- terns and presented a network anomaly detection method that can detect flooding attacks. New developments include using novel dynamic entropy to measure the anomaly [89], an attack detection method based on statistic aggregation that can detect DDoS and port scan- ning [66]. Table 3.1 provides a list of related studies.

3.1.4.4 Worms

A worm is a standalone malicious program that replicates across the networks by exploit- ing software vulnerabilities or tricking users to execute it by social engineering. Worms can cause mildly annoying effects, damaging data or software, DoS, stealing data, etc.

Detection of worms can be categorized as trap-oriented, packet-oriented and connection- oriented [37]. Detection of port scanning is one of the important steps for worm detection, and hence many similar approaches are used in both types of detection. NetFlow-like ap- proaches are connection-oriented and include: analysis of host behavior on the basis of incoming and outgoing connections [50], correlation between NetFlow data and honey- pot logs [49], and detecting hit-list worms using protocol graphs [44]. Chan et al. [37] proposed FloWorM system that includes tracker, analyzer and reporter based on NetFlow data. Abdulla et al. [23] presented a Support Vector Machines (SVM) method based on the fact that a scanning activity or email worm initiates a significant amount of traffic without

DNS.

3.1.4.5 Botnet

Botnets are malware at the infected target and controlled by a remote entity known as bot- master. They have become one of the major security threats credited for DDoS, spamming, phishing, identity theft, and other cyber crimes. Many botnets rely on communication chan- 30

nels varying from centralized IRC and HTTP to decentralized P2P networks. Detection of a botnet is relatively more difficult than detection of port scanning and worms. Zhu et al. [37]

conducted a survey on understanding, detection and tracking, and defending against bot- nets. Recent approaches use advanced methodologies and combine host and network level

information. Zeng et al. [199] proposed a method that combined host and network-level in- formation with protocol-independent detection. BotCloud’s detection is based on MapRe- duce and combining host and network approaches [63]. BotTrack’s tracking is based on PageRank of NetFlow data and host behavioral model [62]. Finally, Barsamian used a net- work statistical behavioral model for botnet detection [26], and Weststrate used heuristic methods to find botnet servers [191].

3.1.4.6 Policy Validation

Peer-to-peer networks can be used legitimately, or misused by botnet, or violate network usage policy. Section 3.1.2.1 details peer-to-peer classification using NetFlow data. Kr- micek et al. [100] proposed an approach to detect the use of unauthorized Network Ad-

dress Translation (NAT) via a heuristic method based on NetFlow data. NetFlow data can also provide information about legitimate flows denied by the security policy, and help net- work administrator with troubleshooting. Frias-Martinez [64] proposed a behavior-based

network access control mechanism with a true rejection rate of 95%.

3.1.5 Issues of Data Error

NetFlow data is exported using UDP. Data can be lost due to overloaded segments between routers and collectors, an overload of collectors with benign traffic increases, burst nature of NetFlow traffic, or attacks in progress. Similarly, errors may happen in the process of

sampling, transporting and collecting. In order to address these problems, several methods have been proposed. Cohen et al. [43] proposed a framework for calculating confidence 31

Table 3.1: Summary of Security Awareness and Intrusion Detection

Year Methodology Perspective 2001 [57] Histogram and chart IDS 2001 [98] Statistic DoS and DDos 2004 [197] Links between machines or do- IDS mains 2004 [93] Statistic patterns DoS and DDoS 2005 [50] Host behavior based Worm outbreaks 2005 [51] IP aggregation Detection and monitoring 2006 [152] Flow aggregation IDS 2007 [151] Trust and reputation model IDS 2007 [49] Flow signature and honeypot logs Worm detection 2008 [37] Heuristics Worm detection 2008 [204] Statistic Anomaly detection 2009 [100] Heuristics NAT detection 2009 [182] Decision tree Dictionary attack 2009 [64] K-means Behavior-based NAC 2009 [196] Information theory Risk detection 2009 [201] Statistic Top N detection 2009 [181] Statistic Spam machines 2010 [180] NBA Malware 2010 [80] Spatial-temporal aggregating Malicious website detection 2011 [156] Statistic of host behavior Attack Detection 2011 [89] Dynamic entropy DoS 2011 [66] Statistic DDoS and port scan 2011 [62] Host behavior and PageRank Botnets detection 2011 [168] Time series IDS 2011 [63] PageRank Botnets detection 32

intervals to address the estimation errors in a multistage combination of sampling and ag- gregation. Trammell et al. [176] characterized, quantified, and corrected timing errors, which are consequence of Cisco NetFlow version 9 protocol design that estimates the true base time from derived base time information. Rohmad et al. [154] proposed an enhanced

NetFlow version 9 using nProbe GPL. Fioreze et al. [59] investigated the trustfulness of NetFlow measurements and found that octets and packets are reliably reported, but the

flow duration of samples are shorter than the actual duration. Zhu et al. [205] studied the errors of utilized bandwidth measurement of NetFlow, and provided guidance for correctly estimating the utilized bandwidth. Finally, Ricciato et al. [153] described a methodology to estimate one-way packet loss from IPFIX or NetFlow flow records.

3.2 Methodologies

In this section, we review various methodologies used to analyze NetFlow data. Figure 3.3 provides a chronological summary of the methodologies discussed in this section. As it can be observed, a considerable number of studies have focused on using machine learning algorithms and real time analysis.

3.2.1 Statistics

Statistic approaches are the most common methods in NetFlow analysis. In general, it is the basic step before applying heuristic-based approaches, machine learning and visualization.

NetFlow data contains statistics of network flow information generated and exported from routers. Duffield et al. [55] investigated the resource usage of NetFlow formation and ex- portation as well as statistical properties of original traffic from sampled traffic data. Proto et al. [147] proposed a statistical model for network intrusion detection system. Sawaya et al. [156] proposed an approach of attack detection based on traffic flow statistics of hosts. 33

10 statistic ML profiling behavior-based visualization 8 anonymization system

6

4

2

0 1998 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Figure 3.3: Methods by Years

Barsamian [26] proposed a botnet detection method using statistical signatures. Liu et al. [29] proposed an analysis and monitoring system using NetFlow statistic, and an IDS

based on variance similarity. Compared to other approaches, statistical approaches are usually easier to implement, provide accurate results and consume less resources. However, statistical approaches are

good only for known cases and lack the ability to adapt to new cases.

3.2.2 Machine Learning

In this section, we provide a survey of machine learning approaches in NetFlow applica- tions, which include traffic classification, anomaly detection, and security awareness. Selecting an appropriate set of features for a specific problem is critical. Example of features are shown in Tables 3.2 and 3.3, and are categorized as: (1) basic features such as NetFlow data fields, source and destination IP address and port, network interface, trans- 34

port protocol, type of service, start and finish timestamps, cumulative TCP flags, number of bytes and packets transmitted, and MPLS labels; (2) derived features such as flow length

(finish time - start time), average packet size (bytes / number of packets), average flow rate (bytes / length), average packet rate (number of packet / length), aggregation of IP subnet

and traffic load bytes, percentage of traffic load on a node, percentage of traffic load at the current sub-tree with time period and aggregation threshold [184, 186]; (3) application spe- cific heuristics such as webmail traffic [158] that has properties as close service proximity, daily and weekly pattern, and duration of client session; and (4) advanced features such as abacus signature, degree distribution, self-similarity of flow interval, entropy, kernel func- tion, mutual information and Hellinger distance [179], or data fusion with other log files such as Snort, DNS related requests [23] (number of DNS requests, response, normals, and anomalies for each host over a certain period of time). Methods for feature selection include symmetric uncertainty [84], information gain [169], subgroup, keyword selection, gradually reduction based on efficiency [116], and rough sets.

The type of data sets and features being employed are very important for a successful ma- chine learning approach. Typically, a large dataset is necessary to cover various relations in the data, including temporal and spatial relations. Training data has to be attack-free or attack-specific, both of which are difficult to obtain. The data sets in Table 3.3 can be categorized as (a) Internet backbone of more than one week period, (b) Internet backbone of less than one week period, (c) intranet of more than two weeks, and (d) simulated data or honeypot log.

3.2.2.1 Application Classification

Nguyen et al. [139] surveyed the application of machine learning techniques for traffic clas- sification from 2004 to 2007; Even though NetFlow was not specified as analysis data set,

but the basic methodologies are applicable to NetFlow data. Kim et al. [92] conducted an 35

evaluation of traffic classification using traces with collected payloads. Their evaluation included seven machine learning algorithms: Naive Bayes (NB), Naive Bayes Kernel Esti- mation (NBKE), Bayesian Network (BN), C4.5 Decision Tree (DT), k-Nearest Neighbors,

Neural Networks, and Support Vector Machines (SVM). They concluded that SVM consis-

tently achieved higher accuracy. Soysal et al. [166] conducted more specific evaluations and comparisons of BN, DT and Multilayer Perceptons on flow-based network traffic clas-

sification using flow trace data. They concluded that BN and DT are suitable for Internet traffic flow classification. Nor et al. [129] evaluated a large number of machine learning algorithms in terms

of their performance on NetFlow data with the objective of classifying HTTP, gmail, and video streaming. The highest accuracy machine learning algorithms had an accuracy more

than 99.33%. Unfortunately, they did not provide information about the features used. Table 3.2 summarizes the algorithms, accuracy, features and data types for traffic classifi- cation using NetFlow data. Since accuracy varies considerably, there is a need to evaluate

these algorithms and features on the same data set.

3.2.2.2 Security Awareness and Anomaly Detection

Table 3.3 provides a summary of machine learning algorithms for anomaly detection in

terms of algorithms, features, and research perspectives. The highest reported detection rate is 98% [192]. Sommer et al. [165] found that applying machine learning for network anomaly detection is harder than in other domains. This is mainly due to the great variety of traffic and the fundamental nature of machine learning approaches that are better suited at finding similarities than identifying relationships that are not present in the training data. 36

Table 3.2: Summary of Machine Learning Approaches of Network Application Classifica- tion

Year Algorithm Accu.(%) Feature Application 2007 [84] NBKE 91 Basica and P2P, email, Multi- derivedb media 2009 [35] DT 90 Basic P2P, VoIP, DNS, email, FTP 2010 [38] Clustering 90 Applicationc SNMP, email, DNS, IRC 2010 [155] SVM 90 Advancedd P2P 2010 [158] SVM 94 Application Webmail 2010 [25] DT 90 Basic and P2P, HTTP, VoIP, DNS, derived FTP, email, games 2011 [179] SVM 70 Advanced P2P 2012 [114] BN 95 Derived BULK, email, P2P

aBasic NetFlow data fields bCalculation and aggregation of basic features cApplication specific properties from basic and derived features dAbstract information from basic and derived features Table 3.3: Summary of Machine Learning Approaches of Anomaly Detection

Year Algorithm Feature Dataset Perspective 2005 [102] Cluster Advanceda Internet Anomaly 2007 [116] Multiclass Advanced Internet NSSA SVM 2008 [187] GA-based Derivedb non-NetFlowc DDoS 2010 [186] Kernel Derived Internet Monitoring 2010 [169] SVM Derived Intranet Masquerade 2011 [23] SVM Applicationd non-NetFlow Worm 2011 [184] SVM Derived Internet Attacks 2011 [185] SVM Advanced Internet Attacks 2011 [192] SVM Basice and derived non-NetFlow IDS

aAbstract information from basic and derived features bCalculation and aggregation of basic features cSimulation or log data dApplication specific properties from basic and derived features eBasic NetFlow data fields

3.2.3 Profiling

Network profiling is an important step for further analysis. Various profiling levels have been discussed in the literature including user, application, host, and network profiling. 37

• User profiling: There is limited work on the user profiling based on NetFlow data. Melnikov et al. [125] proposed a set of correlation and distribution of user flow data related to time and packet to identify a users. Different user behavior based ap-

proaches have employed various features that are discussed in section 3.2.4. User profiling research, however, may provide helpful information for network security in the future.

• Application profiling: Liu and Huebner [115] discussed the stochastic characteristics of some of the most popular applications (i.e., FTP, HTTP, SNMP, NNTP, DNS, and

Napster): flow length and time by probability density function and tail distribution, average packet size distribution, and average throughput distribution. Karagiannis et

al. [87] proposed traffic patterns of social behavior, function (provider or consumer), and application ports that were used to classify traffic based on heuristic rules.

• Host profiling: Wei et al. [189] proposed an approach for Internet host profiling using a data structure that can be expressed in XML-like format at listing 3.1, where the communication similarity is the average of Dice similarity values for the host. Kuai

et al. [194] proposed an approach based on bipartite graphs to represent host commu- nication and one-mode projection of bipartite graphs to capture the social-behavior

similarity of end hosts as figure 3.4. For networks with few hosts, we need more de- tailed information for further analysis. Minarik et al. [127] proposed a host behavior profiling based on the bi-directional NetFlow that use communicating peers (number

of servers contacted, clients answered, and single flows), amount of traffic (amount of requests, replies, and single flows), and traffic structure (number of client, server and

single flows). Frias-Martinez et al. [64] defined a host behavior profile that contains seven features: the total number of flows, average flow size, average flow duration,

total number of packets contained in all flows, average number of packets per flow, total number of unique IP addresses contained in all flows, and average packet size. 38

• Network profiling:Cho et al. proposed Aguri tree [41], an aggregation-based traffic profile that aggregates small volume flows with a fixed number of nodes in an IP tree for spatial measurement. Jiang et al. [83] characterized network prefix-level traffic

profiling as daily traffic volume, distributions (over time, direction, applications, and flow size), and ratio of upload-download. Lakhina et al. [103] described Origin- Destination flows using a routing metric, and further analyzed using Aguri tree to

include time, features (i.e., source and destination address, source and destination port) and volume to represent both time and space attributes [102] .

A A X B E B C Y D Z C D E

Figure 3.4: Bipartite Graph (left) and One-mode Projection (right)

3.2.4 Behavior-based Approaches

Recently, behavior-based approaches to network security have received attention [69].

Compared to signature-based approaches, behavior-based approaches first learn normal be- haviors, and then detect anomalies. This approach has been applied with many research per- spectives: application classification, anomaly detection, zero day attack detection, network access control [64], and network design [163]. Types of behavior-based approaches include threshold, statistic and learning-based. Levels of behavior-based approaches include ISP- based Internet backbone behavior [50, 194, 195], network behavior [50, 150, 181], user behavior [125], host behavior [50, 87, 120, 173, 180, 194] and application (or protocol) behavior [87, 106, 166]. 39

i p address

daily destination number daily byte number average TTL

port1 ,port2 , ... port1 ,port2 , ...

destination address daily byte n u m daily connection n u m

average duration time port1 ,port2 , ...

destination address

daily byte n u m daily packet number

port1 ,port2 , ...

communication similarity

Listing 3.1: Internet Host Profile (Courtesy of [189]) 40

3.2.5 Visualization

Network visualization provides interactive visual displays for exploration of network traf- fic. It is a challenging task to visualize a large amount of information and provide suffi-

cient level of detail to be meaningful and useful. Visualization can be at different levels of network abstraction (i.e., whole network, individual machine, and between whole net- work and individual machine), and described by different mechanisms (histogram, chart or

glyph-based and 3D graph). Table 3.4 presents a chronological summary of related stud- ies with their abstract level, mechanism of data processing and visualization, and research

issues. Most applications use statistics and chart methods; whereas few applications use advanced methodologies such as machine learning, graph theory and quad-tree. In terms

of research perspectives, most of them focus on security detection while others provide network monitoring. Besides the approaches summarized in table 3.4, several other projects are worth men- tioning. NFSen [16] is an open source, graphical web based front-end tool. It aggregates network traffic by protocols, direction or hosts using charts, and is used for network inves- tigation. AURORA [4] is an IBM research project for traffic analysis and visualization. It was designed for large networks, supports multiple levels of abstraction, and uses chart and graph to visualize traffic, anomaly detection or real time traffic flow. Finally, the Spinning Cube of Potential Doom [105] is a 3D display of network links for anomaly detection and visualized as a cube.

3.2.6 Anonymization

There is a need to anonymize NetFlow data to protect the privacy when the data is shared among parties. There are several approaches for NetFlow specific anonymization. Slagell et al. [164] proposed an anonymization tool for sharing network logs for computer forensics. 41

Table 3.4: Summary of NetFlow Visualization Applications

Year Abstract Mechanism Perspective 2000 [146] Network Aggregate traffic volume of Network protocol and protocols, chart traffic amount 2001 [57] Multiple Histogram and chart IDS 2004 [197] Multiple Links between machines or IDS domains, graph 2004 [104] Multiple Activities of IP, histogram, Security situational glyph-based graph awareness 2004 [24] Multiple Map between internal and ex- Security ternal traffic, graph 2004 [123] Network Aggregation based on port, Security event detec- chart and graph tion 2005 [51] Network IP aggregation of traffic Worm detection and bursts, chart & graph backbone monitoring 2005 [144] Network Manifold learning, chart Monitoring, detection 2006 [140] Individual Extended the quad-tree,3D Internet traffic navigation and playback 2006 [143] Network Network statistics of proto- Network statistics cols, chart 2006 [152] Multiple Statistic, flow aggregation, IDS chart & graph 2007 [120] Multiple Host behavior, Force-directed Host behavior graph 2008 [126] Individual Graph theory, graph Network traffic 2008 [60] Network TreeMap with splines, chart Network security moni- and graph tor 2008 [173] Multiple Aggreate data per port, 3D Intrusive behavior graph 2009 [162] Network Based on Simple K-Means Detect anomalies clustering, chart 2009 [42] Network Pattern of shape, graph Network attacks 2009 [174] Multiple Aggregate and Map, graph Network monitoring and chart 2009 [71] Multiple Aggregate, tree view, Geo- Network security location, chart and graph 2012 [160] Multiple Sphere Network traffic 42

Their tool can anonymize the common fields in multiple ways. Similiarly, Foukaraki et

al. [61] proposed an anonymization tool with flexible features and high-performance.

3.2.7 Analysis Systems

As the traffic volume is very large, methodologies to improve performance of capturing, collection, and analysis are needed. There are three commonly used methods to reduce data size: aggregation, filtering, and sampling [52]. In the following, we will survey opti- mization, sampling, and distributed analysis systems.

3.2.7.1 Optimization

Optimization can be applied in many stages of the NetFlow analysis process: capturing, collecting and analyzing. Bouhtou and Klopfenstein [32] proposed mathematical models to select the NetFlow interfaces based on robust optimizations to deal with probabilistic constraints. Sagnol et al. [157] proposed a method of Successive c-Optimal Design to select NetFlow interfaces and find the optimal sampling rates. Hu et al. [81] proposed an entropy based adaptive flow aggregation algorithm to improve efficiency of storage and

export, and improve the accuracy of legitimate flows. Zadnik et al. [198] proposed an archi- tecture of network flow monitoring adapter based on hardware platform COMBO6, which is able to monitor one million simultaneous flows on an 2Gbps link. Nagaraj et al. [138] proposed an efficient aggregation techniques to speed up querying based on attributes and filter condition of queries.

3.2.7.2 Sampling

Sampling network flow reduces the burden of handling massive volumes of flow data in collection, storage and analysis. Duffield [52] conducted a review of Internet measurement sampling in 2004, focusing on classical sampling methods, new applications and sampling 43

methods, and applications areas. In 2007, Haddadi et al. [75] revisited the issues of Net-

Flow sampling which focuses on data distortion and techniques for the compensation of data distortion.

Sampling methods, impact of sampling, integration of system-wide sampling, and re- covering sampled data from distortion are mentioned in below studies. Duffield et al. [52, 54] developed a size-dependent sampling scheme suitable for billing purposes. Estan et al. [58] proposed an Adaptive NetFlow which dynamically adapts the sampling rate to achieve robustness without sacrificing accuracy. Brauckhoff et al. [33] evaluated the impact of sampling on anomaly detection metrics using flows with the Blaster worm, and found that entropy-based features are less affected. Barlet-ros et al. [25] analyzed the impact of sampling on the accuracy of traffic classification using machine learning methods, and proposed a solution to reduce the impact. Cheng et al. [40] proposed a resource-efficient sampling system that combines three models: a pre-sampling model that records the esti- mated value rather than the measured value, a sampling and holding model that process the sampled packets to update the cache, and a non-uniform sampling model and keep the long

flows in cache. Hao et al. [78] developed a sampling scheme based on sampling two-runs to improve time and memory efficiency. Han et al. [76] proposed a pFlours tool that fetches

a packet and performs sampling to eliminate the synchronization problem during network traffic sampling. Duffield et al. [53] discussed trajectory sampling, methods to eliminate duplications, and methods to join incomplete trajectories. Sekar et al. [159] presented a

system-wide approach that samples as a router primitive. To identify high-rate flow, Zhang el at. [203] developed two methods: fixed sample size test which uses user specified accu- racy, and truncated sequential probability test through sequential sampling. Lee et al. [107] proposed a method for related sampling where flows from the same application session are given higher probability. Bartos et al. [27] proposed adaptive, feature-aware statistical sampling techniques to reduce the impact of sampling on anomaly detection. 44

3.2.7.3 Distributed Analysis System

More applications demand real time analysis, advanced detection and classification. Cen- tralized analysis systems face the difficulties of performance, scalability, and robustness.

Although sampling provides an approach to reduce those burdens, there are tasks that can not be based on sampling data. Distributed systems provide new mechanisms for cap-

turing, accounting and monitoring [134]. Several distributed analysis systems have been mentioned below. Kitatuji et al. [95] proposed a real-time system with a bit-pattern based flow definition and round-robin mechanism to balance packet steams. Sekar et al. [159] proposed cSamp, a monitoring tool based on a coordinating mechanism for flow sampling, hash-based packet selection, and workload distributed. Morariu et al. [135] proposed a distributed IP traffic analysis system. DiCAP [133], a flow capturing system, uses round- robin and a Distributed Hash Table (DHT) to distribute the workload and uses off-the-shelf hardware at network links. DIPStorage [131] is a distributed flow storage platform for IP flow records based on DHT. SCRIPT [132] is a distributed flow analysis framework that distributed flow records equally to multiple nodes. Others used peer-to-peer commu- nication infrastructure [62, 67] and map-reduce for efficient computation. Recent studies use the existing Hadoop based clustering platform and the map-reduce framework. Lee et al. [109] proposed using Hadoop based map-reduce to process packet trace files. Fran- cois et al. [62] proposed botnet detecting system based on Hadoop based clustering and

PageRank. Morken [136] compared two map-reduce frameworks of Apache Hadoop and Nokia Disco, and concluded that Nokia Disco provides fast response time while Hadoop provides rich features, and map-reduce model is a very good approach for flow filtering and

aggregation. 45

3.3 Discussion

Even though a large body of research has focused on traffic flows, many issues remain open.

In particular, NetFlow data analysis is challenging because of the difficulty in collecting real data, huge data sets with limited information, and lack of systematic methodologies. In this section, we discuss our view about datasets, research perspectives, methodologies, challenges, and possible future research directions.

3.3.1 Datasets

Because of privacy and other concerns, researchers lack effective traffic flow data sets. Simulated data and other log data have been used as alternatives. Even though there is some real data, this data is either old or does not cover a large enough time period. Acquiring training data sets is another challenge for supervised machine learning. Moreover, there is no publicly available data for comparing different methodologies. Accurate analysis depends on real-time data collection. In surveyed papers , very few discuss a real time data collection solution [67].

Despite the popularity of sFlow and its wide deployment, few papers have focused on sFlow as their data source.

3.3.2 Research Perspectives

Current studies have covered most perspectives of network monitoring, measurement, and network security. Application of network flow data in network monitoring is more success- ful than in network security, while real time network security is in high demand for network management. Basic top N data is not enough to understand the current complex network security situations. More specific perspectives such as referring user identity will provide clear information for security and forensic purpose. New perspectives will probably from 46

network security because network security is becoming more important and challenge.

3.3.3 Methodologies

Heuristic approaches are easier to implement and seem more effective than machine learn- ing approaches; however, practical experiences and findings are difficult to gain. Statistical approaches with heuristic methods give accurate results for known situations. For situations involving anomaly, more research is needed to develop advanced approaches that leverage information theory, machine learning and data mining. Much of the work has been limited to specific problems such as port-scan, DoS, or worms. A system that covers wide network security situations is needed for network security administrators. Moreover, visualization needs to focus more on IT operations and provides easy to understand and helpful infor- mation. For machine learning approaches, feature selection is a very important step that needs to be specific to the problem. Currently, there is no study available for understanding and comparing the affect of feature selection in the context of NetFlow data. Integrating data from other IT infrastructures will provide more information. As there is no publicly available dataset for comparing different approaches, researchers use their own private data sets in their experiments.

3.3.4 Challenges

With the constantly changing nature of networks, new applications and protocols being added to the Internet, network analysis will have to keep up with the speed of changes. For example, IPv6 addresses can be randomly generated and may not be identified as a unique host or user. Since IPv6 over IPv4 packets can bypass firewalls [73], new approaches for IPv6 measurements are needed. New applications and protocols, faster Internet speeds with increased backbone bandwidth, and more complex content will make the analysis more difficult. In particular, the cloud computing that is based on moving contents to cloud 47

services will make the analysis more complicated. In the following, we discuss specific challenges.

3.3.4.1 Feature Representation and Selection

Because NetFlow data only provides the header information, representing and selecting a set of appropriate features is challenging. For a specific task, the key question is how to effectively represent and extract features, and how to select the right features for a specific problem. With NetFlow version 9, it would be important to effectively leverage these new information.

3.3.4.2 Real Time Analysis and Data Storage

Analysis results need to be available in real time or within some fairly short period of time as the traffic is flowing. Furthermore, data needs to be continuously stored for certain amount of time for future need. Real time data collection is a challenging task because of the data size and the nature of the network traffic. Real time analysis requires understanding the dynamic nature of network traffic. As David [190] pointed out ”that is the face of knowledge in the age of the Net: never fully settled, never fully written, never entirely done”.

3.3.5 Future Directions

Despite significant work in the field, future research is needed to address the above men- tioned challenges.

3.3.5.1 Distributed Data Collection and Analysis

Real time analysis is in high demand in network security. Centralized analysis systems have difficulty dealing with huge data and real time analysis. Scalability and robustness 48

are required to analyze data from multiple collectors. New technologies, such as Apache

Hadoop related distributed data collection and analysis systems, open up more opportuni- ties for re-thinking the network traffic analysis. Distributed applications and map-reduce

model will provide more power and bring more insight and understanding.

3.3.5.2 Advanced Analysis Methodologies

Advanced methodologies using behavior-based features have the potential to mine helpful

information. As Sommer et al. [165] pointed out, machine learning algorithms excel at find- ing similarity rather than at identifying anomalous behaviors. To make machine learning approaches more accurate and efficient, there is a need for better understanding of different types of features and heuristics for specific goals. In practice, selecting and understanding an effective set of features is challenging and labor-intensive.

3.3.5.3 Integration

Integrating with existing network infrastructures (e.g., IDS, firewall and VPN gateway), integrating with log file event activities, as well as integrating with host IDS (e.g., meta- events) all show a trend. NetFlow analysis can fill in the gap that IDS, firewall and host- based anti-virus tools can not provide. It can provide monitoring, reporting, security alter- ing, validating policy and configuration, assisting for forensic investigation, and serving as complimentary approaches for other network applications. Correlating with existing net- work infrastructures (e.g., NIDS may alert for an attack then NetFlow data will validate the alert) can give a high probability factor to remove false positives. Liu et al. [116] proposed a method using Snort logs and NetFlow data fusion with SVMs to create network security awareness. Integrating together with other approaches (such as deep packet inspection), NetFlow-like approaches can provide a breadth-first approach for early investigation, and cover more hierarchies of network layers. 49

Chapter 4. The Big Data Security Sys- tem Design

There are many network flow analysis tools as surveyed in chapter 3. Different from ex- isting systems, as far as we know, this design is the first network flow analysis system that is completely distributed and fully leverages big data technologies. In this chapter, I will discuss the design approach, the components and featrures of the analysis system, and experience I have gone through.

4.1 Approach

The objective of this dissertation is to build a reliable distributed real time system for col- lecting, storing and analyzing network flow data, log data, and other network security re- lated data. To achieve this, it requires:

• Scalability: The system needs to be scalable to meet future requirement for process- ing and analyzing large volume of data, including network flow, data from Simple Network Management Protocol (SNMP), security event log, and others. More hard-

ware and software components can be added without downtime or system construc- tion in the future. The system has clusters for multiple nodes, multiple processes,

multiple threads, distributed, and parallel computing.

• Real Time Continuous Monitoring and Interactive Visualization: Traditional security monitoring systems can not process these large amount of data in real time. Real 50

time continuous monitoring is the future of network security monitoring and analy- sis. TNational Institute of Standards and Technology (NIST) use the term Contin-

uous Monitoring in emphasizing the ongoing monitoring and awareness of network security [10].

Interactive visualization is related to real time monitoring and requires responding to

human input in a timely manner for illustrating the network security situation.

• Intelligent Analysis: The system needs to support more intelligent analysis. In par- ticular, the system models network objects based on host profiling and user behavior

to provide a platform for network security analysis including classifying host and identifying a particular user among the other users.

The overall philosophy of the approach is

• Rather than looknig for an unknown and constantly changing object of malware or asking a unclear question of normal or anomaly, I focus on specific object, i.e., net-

work hosts and users, and ask specific questions such as: ”what is the role of the network object and does the object behave according to its role?”

• The system leverages the big data technologies to collect, store, and analyze these data.

4.2 Components of the Security Analysis System

This system is multi-layered as in figure 4.1. As each layer is loosely coupled, change in one component will not affect the other components. Similarly, new models can be added without affecting other modules. The system provides real time continuous network security monitoring and interactive visualization through web user interfaces, near real- time network measurement and advanced analysis. 51

Data Sources

Collector Collecting Producer Producer Producer Aggregating Broker Broker Broker Broker

Consumer Consumer Consumer

Spout Spout Spout Real Time Process Bolt Top N Bolt Bolt Statistic Online Classification Bolt Update Model Bolt ... Bolt

Bolt Bolt Bolt

NoSQL Storage

Gateway Data Gateway

Administrator & Use Case Web User Off-line Analysis

Figure 4.1: The Big Data Security System Architecture Design 52

4.2.1 Data Collection

I have experimented three distributed data collecting systems: Flume, Flume-ng, and Kafka. The big data community is very active in innovation and moves quickly into new applica- tion domains as there are many demands for big data collection and processing. Flume was experimented at first. It provides ability for plug-in sources and sinks, and provides web user interfaces for configuration. Unfortunately, it provides very little con-

figurable feature. A Flume client can have only one channel. However, we have many data sources, and one data channel of one Flume client is not sufficient. Then I experimented with Flume-ng, which is the next generation Flume that provides more configurable fea- tures and channels for a client. Flume-ng provides mechanisms fro our need of multiple data channels and configuration. Lastly, I experimented with Kafka [99]. Kakfa is considered more stable since it has been in the product line of large scale dynamic systems such as LinkedIn. Compared with the Flume and Flume-ng, Kafka is more efficient in memory management and supports multiple source producers, brokers, and consumers. Kafak also provides ability for real time analysis. I consider Kafka as the right solution for data collection in this dissertation.

4.2.2 Data Storage

Cassandra is the choice of NoSQL data storage for storing sFlow, NetFlow, and security event logs in this system. Compared with other NoSQL solutions, we select Cassandra because:

• It is scalable with nodes of clusters range from several to hundreds.

• It has build-in index mechanisms: the primary index, the secondary index, and the composite column. 53

• It is optimized for fast writing and reading.

• There are many client programming interfaces, tools, and third party index tools.

These Cassandra features support our goals of data storage and data schema with:

• Query by source or destination IP

• Query by time, protocol, source or destination port number

• Top network consumer (in term of package size)

• Top talkers (in term of connection counter)

• On-line detecting of port scanning

• Storing data for future off-line advanced analysis and scientific research

The experience with Cassandra is not very pleasant. We have experienced several release versions, and re-organized the cluster structure to meet the Cassandra hardware requirements. At this moment, the system uses Cassandra version 1.2.5. It is more sta- ble compared to previous versions experimented in this system. Cassandra is a complex NoSQL solution and may take years to learn best practices of running Cassandra clusters.

4.2.3 Security Gateway

Protecting the big data is a critical challenge. There are many secure approaches inside the data clusters: nodes validation, client validation, data encryption, key management, and more. It is also important to protect the data from external adversaries and the security gateway protects the big data from outside access. 54

4.2.4 Data Processing

Network flow data are processed in three levels: real time data aggregation in Kafka data collection, real time detection in Storm, and off-line advanced data analysis as shown in

figure 4.1. Kafka works as data collector and basic counter aggregator. Storm provides the online real time computation of tasks that need more computational resources such as network object online classification and identification. Finally, off-line advanced data processing is provided by Pig, R, Java, and C++ models.

4.2.5 User Interfaces

There are two types of user interfaces in the system. The data gateway is for administrators and off-line advanced network object modeling by using Pig, R, Java, C++, CQL, and

Cassandra console command lines. The web interfaces are for network security admin- istrators who are not necessarily familiar with these tools or systems. The web interfaces

provide real time query to the network status, network security awareness, and interactive visualization.

4.3 Features

In this section, I discuss the features that this system supports.

4.3.1 Real Time Continuous Network Security Monitoring and Inter-

active Visualization

Network security administrators want to know what is happening in the network in real time continuously, and be aware of any security events. There are two categories of net- 55

work monitoring. One is normal network status such as trends of protocol usage, network throughput, top conversation, and top usage of sub-networks. Another aspect is hacking

activities such as port scanning and operating system fingerprinting. For better understand- ing and analysis, these information need to be visualized in an interactive way. The detail

of real time continuous network security monitoring and interactive visualization are dis- cussed in Chapter 5.

4.3.2 Network Measurement

There are needs for measuring the network for some statistic results or reports that are not existed in the data storage. One example we did is measuring the anonymity network usage

on the Internet. The detail of campus anonymity network usage measurement is discussed in Chapter 6.

4.3.3 Advanced Network Modeling

As network security becomes complicated, advanced methods and modeling are applied to analyze the network. For instance, role-based network object classification and user

identification provide a new approach to detect malicious activities such as intrusion, and is discussed in Chapter 7.

4.4 Discussion

The general trend of network flow analysis is toward distributed systems. Big data tech- nologies provide not just the computing power of distributed analysis but also methods to handle the large volume of data in real time. In this chapter, I present a system design that fully leverages the open source big data projects of Kafka, Storm, and Cassandra. The system supports real time and continuous network flow analysis for network security. Com- 56

pared with systems discussed in Chapter 3, the system is a complete network flow analysis system, more scalable, and low cost due to use of open source projects. The system com- ponents are relatively independent and is very easy to adopt the system for different goals. With additional nodes, the system can be scaled to handle more data for bigger organiza- tions. 57

Chapter 5. Real Time Continuous Network Monitoring and Interactive Visualization

Neither network monitoring nor interactive visualization is new. However, it is challenging to monitor network data in real time continuously and to visualize the information inter- actively in web browser for network security purpose. There are many factors that drive the need for real time continuous network monitoring. It is critical for network security administrators to know the information in real time continuously. Traditional approaches of data warehouse and data processing are infeasible to access a large amount of data in real time. The big data security analysis system design in Chapter 4 makes it feasible. In this chapter, I present how the real time continuous network monitoring and interac- tive visualization are achieved for these large volume of security data in web user interface. I focus more on methodology and theory rather than best user interfaces and rich features which can be done at development time. I demonstrate:(1) querying the campus network host status in real time continuously, (2) monitoring the network trends in real time and

(3) visualizing these information in interactive approaches in real time. For security and privacy reasons, the real identity is replaced by text description in the following web user interfaces. 58

5.1 Real Time Network Host Querying

The network flow information of each host is sliced by time such as hourly or daily, and is

stored in the Cassandra data storage by IP address as the key. The Cassandra stores the whole row in a file and a disk, and is indexed by its key. Accessing these information is direct in O(1). It is still O(1) when the query is across several time slices. Hence, it is very rare to query the host across a big number of time slices, which is the limitation of the design. So it is important to find a proper time slice for the most often used scenarios.

Figure 5.1 demonstrates that overall it took 56 milliseconds for querying 24 hours network flow information of a host.

The network flow information can be presented in many levels after accessing in real time. Figure 5.1 presents an example with three levels: raw data, statistics, and network. Raw data are used for investigating detailed network flow information by experienced net- work security administrators. Statistic summary is used for providing brief information and is computed at teh web server to be presented to the web browser. These summary items are configurable. List 5.1 is an example of items in the summary. Network graph is used to allow users interactively investigate and track user connections with other hosts. Figure 5.2 shows the network for the target, its sources, and its destination. These are the most impor- tant information for security administrators. It provides drag-and-drop interactive ability so that the user can continuously trace the network. More detailed discussion of interactive network visualization is at Section 5.3. 59

Object ID

Object Type Defined Roles Classified Roles

Distinguished Destination Number Most Frequent Source

Most Frequent Destination Upload Byte

Download Byte TCP Destination Ports TCP Source Ports

UDP Destination Ports UDP Source Ports

Alert

Listing 5.1: An Example of Items in Object Summary

5.2 Real Time Continuous Network Monitoring

Real time continuous network monitoring includes monitoring network status by network

flow, significant network traffic and top N trends. These results are computed at real time in the Kafka and Storm clusters as demonstrated in Figure 4.1. The results are then stored in the Cassandra data storage based on the time series. The client requests the informa- tion based on the time series and presents it to users. Similarly, the web server continuously retrieves the information based on the time series key and returns it to the web browser. Be- 60

Figure 5.1: Real Time Host Traffic Query cause results are pre-computed in real time and indexed based on the time series, retrieving

the information is O(1).

5.2.1 Network Flow Status

Figure 5.3 demonstrates the real time continuous network status of time, flow number, bytes, top port, packets by protocols (i.e., ICMP, TCP, UDP, and others), total packets, total bytes, and missed flow. It shows that users can interact with the graph by placing mouse over the interested point to view more detail about that point. By placing mouse over TCP or UDP traffic point, users can view the details about the top N by the port number, flow count, bytes and packet. Users can select polling data continuously to view the live data 61

Figure 5.2: Analyzing Network Targets

update. If users do not give specific time, it will poll immediate real time data and update the data continuously.

5.2.2 Top N Conversations

Figure 5.4 demonstrates the top N hosts of internal and external network flow conversation. Users can view the top N hosts by traffic frequency (i.e., flow number) or resource (i.e., bytes). By mouse clicking on the interested target, users can investigate who connected with the target and what network applications the conversation used.

5.3 Interactive Network Security Awareness Visualization

It is desirable for network security administrators interacting with these network flows and security information by keyboard, mouse hover-over, or mouse-clicking. Latest technolo- gies of HTML 5, AJAX, browsers, and graphic libraries make the interactive data visual- 62

Figure 5.3: Real Time Network Flow Status Monitor ization feasible on the browsers. Interactive visualization requires retrieving data in real time and it becomes challenging with a large amount of data, such as network flow and se- curity data. In this chapter, I demonstrate that users can interact with the network security information in real time and continuously in Figures 5.2, 5.3, and 5.4. 63

Figure 5.4: Real Time Network Flow Top N Status

5.4 Discussion

This chapter demonstrates how real time and continuous network security monitoring is supported in the system, and methods of interactive visualization. Compared with systems discussed in Chapter 3, the focus of the system is on monitoring network security and hosts continuously and interactively. More approaches of intuitive and interactive visualization of large scale data is needed. Moreover, feedback from security administrators would improve teh intuitiveness of the visualization. 64

Chapter 6. A Case Study of Network Flow Measurement

Traditionally, network flow measurement is focused on application and network measure-

ment. Recently, network flow has been used as samples to measure contents of the whole network such as social network, cloud storage, and etc. In this chapter, I demonstrate a

case study to measure the usage of anonymity technologies on the campus Internet.

6.1 Usage of Anonymity Network

In this section, we investigate the usage of anonymity technology from the application per- spectives. For this, we compared the observed IP addresses of applications to the collected

IP addresses of the anonymity server. Table 6.1 provides an overview of the anonymity sys- tems we considered 1. We identified the originating countries of these IP addresses from IP address geo-location database provided by ipinfodb.com.

Table 6.1: Analyzed Anonymity Systems

Network Servers Service Tor 61,798 General 2,267 peer-to-peer JAP 44 General Remailers 52 email Proxies 7,246 General Commercial Anonymizer, GoTrusted General

1These data were combined efforts from authors in a published paper ”An Overview of Anonymity Tech- nology Usage” [111]. 65

1 10 100 1000 One instance: Bahamas, Belarus, Belgium, Bulgaria, Cambodia, Chile, US Colombia, Estonia, Ghana, Greece, Germany Hungary, Ireland, Israel, Jamaica, Jor- China dan, Korea, Mongolia, Namibia, Nige- Netherlands ria, Pakistan, Panama, Philippines, Russian Slovakia, Turkey, Ukraine, Vietnam, UK Zimbabwe Brazil Two instances: Chad, ChezchRep, Canada Denmark, Hongkong, Iran, Japan, Sweden Kazakhistan, Poland, Romania, Spain, Four Switzerland Three Three instances: Austria, France, Sin- Two

One gapore Four instances: Australia, Indonesia, 1 10 100 1000 Taiwan, Thailand Number of IP Figure 6.1: Geo-Location of Anonymity Usage on Campus (sFlow Data)

6.1.1 Campus Network Traffic Flows

Network traffic data provide information about the network usage including application protocols, users, traffic accounting and profiling. We are interested in the anonymity tech- nology usage on our campus. Traditionally, network flow data processing is challenging because of huge data. Hence, sFlow samples the traffic to monitor the network at a higher performance [19]. I use Hadoop platform [9] and Pig Latin [142] as a distributed map- reduce computing platform to process these big data.

Table 6.2: Usage of Anonymity Systems based on sFlow Data

Packetsa (%) Traffic (MB %) Observed IPsb (%) Proxies 5,580 (62.65) 8.13 (43.53) 234(3.229) Tor 3,129 (35.13) 9.04 (48.37) 152(0.246) I2P 190 (2.13) 1.50 (8.02) 23(1.014) Commercial 7 (0.08) 0.016 (0.08) 2(not-available) Total 8,906 (100) 18.69 (100) 411

aThe sampling rate is 1/8192 bNumber of anonymity IPs observed (percentage of known IPs) 66

We collected one month data from two main gateways on our campus during Septem- ber 5 to October 5, 2011 by using InMotion sFlow toolkit [11]. We analyzed the data stored in a Cassandra cluster by creating a mapreduce program and using the Pig Latin script lan- guage [142]. Our sample logged 205 million packets with a total data size of 1.44 TB. The sampling rate is 1/8192, which samples one packet out of every 8192 received. Figure 6.1 presents usage distribution and geo-location distribution of identified anonymity technolo- gies and servers for the sFlow data2. Results are consistent with other results indicating that proxies and Tor are widely used. Table 6.2 provides the overall usage of anonymity technologies and Table 6.3 provides the detailed application usage.

Table 6.2 indicates that proxies were the most utilized anonymity technology on cam- pus where 62.65% of logged anonymizer packets (i.e., 3.23% of observed ∼ 173M IPs) belonged to a proxy system. Tor is the second frequently used in regards to number of packets (i.e., 35.2% of anonymizer packets) and has the most traffic data volume (i.e., 48.37% of anonymous bandwidth). Additionally, we observed some traffic from I2P and commercial anonymizer systems. Even though port-based application classification is not accurate, it provides some valuable information about the anonymity usage as sFlow did not contain all application level packet headers and payload. Table 6.3 provides the port numbers of the packets that were from/to an anonymity network. We observe that Tor flows have the most number of unique ports. Additionally, port 9001 has been observed very often (i.e., 35.6% of the pack- ets and 19.6% of the bytes) as it is the Tor default port. Even though 9001/tcp is registered by ETL Service Manager at IANA [22], we believe the traffic we observe is the Tor traf- fic related to directory fetch or internal communications. Among I2P packets, we observe non-common port numbers (i.e., 43.16% of I2P packets and 51.44% of I2P bandwidth) as it selects a random port for the configuration. Finally, across different anonymizer systems,

2As sFlow data is used in the analysis, we are more likely to miss shorter flows that is typical of Tor communications [111]. 67

port 80 (i.e., HTTP) and 443 (i.e., SSL) are seen very often.

6.2 Related Work

There have been many studies on the anonymity service and anonymity systems [8, 45, 56, 90] while some studies have analyzed anonymity technology usage [36, 82, 110, 117, 119,

122, 137]. Most of the anonymity usage studies focused on the Tor due to its popularity. Studies on the anonymity network usage is difficult due to the nature of the anonymity network, and ethical and legal issues related to privacy. An important step in such a mea- surement study is ensuring privacy of users and minimizing stored information. Moreover, as analysis results depend on the data collection mechanism, measurements, in general, should be performed over multiple network domains that are geo-diverse for a sufficiently long time duration.

Loesing et al. provided guidelines for a statistical analysis of the Tor data focusing on the countries of connected clients and exit traffic by port [118]. Pointing to privacy issues, the authors derived guidelines for measuring sensitive data in anonymity networks. Moreover, they pointed to interesting cases such as the increase in the Tor usage by Iranian

IP space in June 2009 after the Iranian elections; and Tor blocking by China and consequent increase in bridge usage by Chinese IP addresses. Earlier studies on Tor have: used controlled Tor exit nodes [119, 122]; analyzed exit node traffic using OpenDPI [36, 110]; analyzed the network using TorStatus script tns update.pl [137]; collected URL of HTTP traffic using urlsnarf tool [82]; and collected traffic data from a controlled network [110]. Likewise, to analyze I2P network, Timpanaro et al. designed approaches to retrieve information from distributed hash tables [175].

Below, we briefly survey earlier anonymity usage studies by data and analysis meth- ods, and studied usage aspects of the anonymity networks. 68

• Application: McCoy et al. analyzed application-level protocols via controlled Tor exit nodes [122]. According to their measurements, interactive protocols such as HTTP made up 92% of the connections and consumed 58% of bandwidth. Similarly,

SSL consumed 4% connections and 1.55% of bandwidth, and BitTorrent accounted for 3.3% of the connections and 40% of bandwidth. Chaabane et al. also reported similar results that indicated HTTP, SSL and BitTorrent as the top 3 consumers of

connections and bandwidth at their exit nodes [36]. Li et al. reported similar re- sults regarding HTTP and SSL but BitTorrent was not a popular application any-

more [110]. Timpanaro et al. also reported I2P application usage where web surfing and file-sharing were popular [175].

• Geo-location: Geo-location distribution of clients and servers is the most investi- gated aspect of anonymizer networks [36, 110, 118, 119, 122]. China, Germany, and United States are the top 3 countries for Tor servers while slightly different usage is

observed for Tor clients [36, 110, 117, 122, 137]. Manils et al. [119] reported clients of BitTorrent on top of the Tor per country and per AS.

• HTTP Usage: Chaabane et al. [36] and Huber et al. [82] analyzed HTTP usage of Tor to inspect content types and web categories. Chaabane et al. reported that con- tent types included images (31.7%), text/HTML (27.9%), applications (18%), flash

(11.11%), videos (2.4%) and others (8.89%). The rank of HTTP contents were: search engine/portals, pornography, computers/Internet, social networking,

blogs/web communications, streaming media/MP3, software downloads, hacking, political, illegal/questionable and illegal/drugs. Moreover, Huber et al. reported sim-

ilar details about URL ratios and exchanged file formats.

• P2P Usage: Chaabane et al. [36] reported that most BitTorrent users (73%) used Tor to hide from trackers and not to distribute content, and most downloaded con- 69

tents were copyright protected. Manils et al. [119] reported similar details regarding proportion of users connected to a tracker and distributing content.

• Malicious Routers: McCoy et al. [122] developed a method to detect malicious log- ging at Tor exit nodes. Huber et al. [82] further discussed Tor exit server attack

scenarios and provided a solution to avoid these attacks.

• Misbehaving Clients: McCoy et al. [122] reported misbehaving clients of Tor that distributed illegal contents, hacking, IRC bot, and etc. Chaabane et al. [36] also

reported mis-use of the Tor network by exploiting the Tor exit nodes as 1-hop SOCKS proxies.

• Statistics: Tor network status web site ://metrics.torproject.org gives historical view of the Tor network. Loesing [117] observed the Tor relay server

trends (flags, versions, dynamic IP, and bandwidth by country) by analyzing the pub- lic directory information. Mulazzani et al. [137] reported information and patterns of overall network size, total numbers of servers, and geo-location per country. 70 2,055(100) 2,055(100) Traffic(byte %) Commercial 7(100) 7(100) Packet(%) Flow Data 7,547(0.71) 3,469(0.33) 3,167(0.30) Traffic(byte %) 1,066,141(100) 887,505(83.24) 164,453(15.42) Proxies 81(1.45) 44(0.79) 30(0.54) Packet(%) 790(14.16) 5,580(100) 4,635(83.06) 82(0.0) 36(0.30) 246(0.02) 164(0.01) 1128(0.10) 3,150(0.27) 9,370(0.79) 8,568(0.72) 12,765(1.08) 12,418(1.04) 15,766(1.33) 108,192(9.13) Traffic(byte %) 124,776(10.53) 278,293(23.49) 232,515(19.63) 373,533(31.53) 1,184,532(100) Tor lservicemgr by IANA [22] 2(0.06) 1(0.03) 4(0.13) 2(0.06) 16(0.51) 12(0.38) 10(0.32) 11(0.35) 12(0.38) 19(0.61) 109(3.48) Packet(%) 552(17.64) 437(13.97) 335(10.71) 493(15.76) 3,129(100) 1,114(35.60) 64(0.0032) 16,603(8.46) 196,287(100) 40,588(20.68) 38,067(19.39) Traffic(byte %) 100,965(51.44) I2P 1(0.52) 190(100) 22(11.58) 54(28.42) 31(16.32) 82(43.16) Packet(%) Table 6.3: Application Usage of Anonymity Systems based on s c a Total - 9001 Other b SSH - 22 DNS - 53 SSL - 443 NTP - 123 HTTP - 80 SMTP - 25 Tor DLS - 2047 Gopher - 70 Ipcd3 - 1209 MS-DS - 445 Port(Protocol) Acp-port - 2071 Seagull-ais - 1208 Ideafarm-door - 902 Rjcdb-vcards - 9208 port number was not registered the port number and protocolTor is default based port on number [22] is 9001 whereas 9001 is assigned as Et c a b 71

Chapter 7. Classification and Identification of Network Objects

Network objects refer to users and hosts in a local area network as defined in Table 7.1.

Internal network objects classification and identification are critical for securing network, and are valuable for building behavior-based intrusion detection, botnet detection, anti- virus, security policy violation detection, and building host fingerprints to identify a host or its user for forensic purposes. For example, if a client suddenly behaves similar to a server, it may indicate that the client has been compromised or is violating security policies. It provides information for network visibility and identifies threats and anomalies in the network.

Classifying network objects based on the network flow patterns is challenging due to huge volume of network traffic data and overlap among host roles [172]. Tan et al. [172] discussed the role classification problem, and presented an algorithm of grouping hosts based on the network connection similarity and a correlation algorithm that correlates hosts between groups. Berthier et al. [28] developed and evaluated 6 server identification heuris- tics that used flow timing, port, IP address, and protocol. Meiss et al. [124] described heuristic methods to differentiate client and server based on the assumption that servers use most common or well-known ports. Similar to our work of classifying role of regular web non-email servers versus web email servers, Schatzmann et al. [158] used support vector machine to classify HTTPS traffic of web mail versus web non-mail using features of IP 72

proximity, session duration, and flow inter-arrival times. Himura et al. [48] used minimum spanning tree to cluster hosts into groups of servers, clients, and peer-to-peer based on host behaviors. Use of port number in classification is often unreliable. Computer network is very dynamic and dedicated port numbers are not always used, especially by malicious adversaries. Approaches that rely on network flow statistics are also not effective. Behavior-based network security analysis has advantages compared with traditional approaches of code patterns or signatures, and has become the mainstream approach of securing a network [69]. We also have achieved better accuracy using behavior-based ap- proaches compared with the previous approaches.

In this chapter, I classify network objects based on behaviors with context of roles using network flow data with Decision Tree and Support Vector Machine. These roles are, namely, clients versus servers, regular web non-email servers versus web email servers, clients at personal offices versus public places of laboratories and libraries, and personal office clients from two different colleges. I identify a particular user among other users.

The overall goal is to develop a behavior based network security awareness, detection and analysis system.

The term role has been used in many aspects of computer and network security, and mainly used to control resource access. We define a server as a network device that provides services, a client as a network device that requests services. A client usually uses multiple network applications. An office client host is used only by one individual user. Public place clients are at public places such as laboratories or libraries and are used by different users. Web email server hosts are web servers that provide only email service through the web. Web server hosts are regular web non-email servers that provide web contents but not emails. Users are people who use the client hosts. Our approach is end user and host centric.

In the rest of the chapter, I describe the data set, algorithms, features, and the ground 73

Table 7.1: Formal Description of Network Objects

Component Description Network Object (O) A host or a user Network Object Data (DO) NetFlow, sFlow, Event Log data Network Object Profiles (PO) Profiles of O from network object data Network Object Model (MO) Model for network object Network Object Features (FOM ) Features for each specific object and model truth. Then, I present classification results of host roles and user identification.

7.1 Methods

7.1.1 Data Set

The data we used in this chapter are about three months sFlow and NetFlow data from September 30, 2012 to December 31, 2012 from a large campus. There are about 8,402 hosts that were analyzed. Table 7.2 presents the detailed information about these hosts and their roles 1.

7.1.2 Algorithm

In this experiment, we utilized both the Decision Tree and Support Vector Machine algo- rithms. Since host behaviors can change over time, we use on-line support vector machine to update the models over the time. We use the implementation of On-line Support Vector

Machine [31] and Decision Tree implementation of [7]. As the focus of our purpose is not developing a new machine learning algorithm, we use these algorithms directly.

1User data were never directly analyzed adn all user meta-data was anonymized in the analysis. 74

7.1.3 Modelling

After the models have been initially built off-line, they are loaded to the Storm cluster, and the Storm cluster classifies network objects online. These models are updated when the classification accuracy is lower than a certain threshold.

7.1.4 Ground Truth

I developed a crawler to collect host information from Microsoft Active Directory server, to get actual information about the hosts roles. The crawler collects host’s data and validates their current active status daily for the period of data collection. Based on the collected information, we built the ground truth, which is reported in Table 7.2.

7.2 Hosts Role Classification

In general, there are lots of network devices in the network with an IP, including printer, router, wireless-access-point, mobile device, and servers. In the context of our interests, we define a server as a network device with an IP that provides a service, a client as a network device that requests service from servers. It is desirable to classify whether the network flow pattern of a device on the notwork is client or server. Applications include

Table 7.2: Analyzed Systems

Role Count Client 5,494 Server 1,920 Host at Public Places 784 Host at Personal Offices 416 Host at Personal Office in College1 163 Host at Personal Office in College2 253 Regular Web Server 56 Web Email Server 25 75

behavior-based anti-virus, abnormality detection, and botnet detection. If we see clients talk to other clients excluding P2P traffic, it is an indication of botnet. At this moment, there are two methods to classify if a host is a client or server: (1) port-based [124, 188] and (2) heuristics-based ports [28]. Section 7.4 provides discussion about related work.

The Decision Tree and the On-line Support Vector Machine are used to classify host roles: e.g., client versus server, web email server versus web non-email server, hosts at per- sonal office versus at public place, and hosts at personal offices from two different colleges. As we have enough samples for training and testing, we simply split the data as 80% for training and 20% for testing, and calculate the features as described in Section 7.2.1.

Features are constructed based on domain knowledge [74]. Decision Tree provides fea- ture contribution of training classification. The results of contribution are reported in Sec- tion 7.2.6. We compared four different kernel types of On-line Support Vector Machine (namely, linear, polynomial, radial basis function, and sigmoid) in classifying these roles. Default parameters we used in the experiments are as below. For On-line Support Vec- tor Machine, online optimizer with finishing step, random selection, number of candidates to search is C=50, tolerance of termination criterion is τ=0.001, cache memory is 256 MB.

For Decision Tree C5.0, parameters are ignoring cost, no cross validation, global pruning, and confidence level as 0.25 for pruning.

7.2.1 Classification Features

There are many features as discussed in [130]. As we have clear purpose and domain knowledge, features are constructed based on the domain knowledge [74]. We carefully select a set of features with the purpose of the host role classification in this chapter, rather than collecting as many features as possible with little understanding. For the purpose of online classification and distributed computation, I use aggregations of these features. Utilized features are aggregation of N sFlows (where N is 10, 20, 30, 40, 50, 60, or 70) as 76

listed below. Note that, system port range is (0-1023), user port range is (1024 - 49151), and dynamic port range is (49152 - 65535) [RFC 6336]. Moreover, host is the target host we are interested in and its communication peer is called the other party. The following is the list of the features used in classification of hosts.

1. count of unique host system ports 2. standard deviation of host system ports 3. count of most often used host system port

4. count of unique host user ports 5. standard deviation of host user ports

6. count of most often used host user port 7. count of host dynamic ports

8. count of other party unique system ports 9. standard deviation of other party system ports 10. count of most often used other party system port

11. count of unique other party user ports 12. standard deviation of other party user ports

13. count of most often used other party user ports 14. count of other party dynamic ports

15. count of host port numbers smaller than other party 16. average time to live 17. average upload byte size per flow

18. average download byte size per flow 19. count of unique protocols

20. count of the most often used protocol 21. sum of first byte of other party’s IP addresses 22. sum of second byte of other party’s IP addresses 77

23. sum of third byte of other party’s IP addresses 24. sum of fourth byte of other party’s IP addresses

All feature data are scaled to a continuous value between 0 and 1, inclusive, as follows.

• Counts are normalized by the total number of sFlows.

Maximum−Minimum • Standard deviations are normalized by the maximum value 2 of the standard deviation where Maximum is the maximum value and Minimum is the minimum value for the data range.

• Average values are normalized by the maximum value of that field.

• IP address bytes are normalized by 255 ∗ N.

7.2.2 Classification of Client versus Server

In classifying client versus server hosts, we experimented with both Decision Tree and On- line Support Vector Machine algorithms along with four kernel types of linear, polynomial, radial basis function, and sigmoid. Figure 7.1 is the average accuracy results versus num- ber of sFlows. The best accuracy achieved by On-line Support Vector Machine is 96.5% average with 70 sFlows for both radial basis function and linear kernel. The best overall accuracy 99.3% average is achieved by Decision Tree (C5.0) algorithm with 50 sFlows. As the C5.0 algorithm provides individual feature contribution to the classifier, we realized that overall average download bytes per sFlow contributed the most. For the best accuracy with 50 sFlows, feature contributing most to the classifier in the order are: average download byte size per flow, count of most often used host user port, sum of fourth byte of other party’s IP addresses, count of host dynamic ports, count of other party dynamic ports, count of the most often used protocol, count of unique host user ports, sum of second byte of other party’s IP addresses, count of unique other party user ports, count of most often 78

used other party user ports, sum of first byte of other party’s IP addresses, and standard deviation of other party user ports.

In order to identify a server, Berthier et al. [28] showed that port number is not efficient feature by itself, but the number of distinct tuples {IP address, IP protocol, Port number} is the most efficient. They used NetFlow data and relied on bidirectional flow. In our work of classifying a client and server using sFlow, average download byte contributes the most to the classification. From the role perspective of a client and a server, a client generally has less distinct tuples than a server, and a server’s role usually is not to download.

7.2.3 Classification of Web Email Server versus Web Non-email Server

In classifying web email servers versus web non-email servers, we experimented with both Decision Tree and On-line Support Vector Machine algorithms along with four kernel types of linear, polynomial, radial basis function, and sigmoid. Figure 7.2 presents the results of average accuracy versus number of sFlows. The best accuracy of on-line Support Vector Machine was 94.7% with 60 sFlows and linear kernel type. The accuracies of kernel type of linear, radial basis function, and sigmoid are very close. The Decision Tree (C5.0) algorithm achieved highest average accuracy 100% with 50, 60, and 70 sFlows.

C5.0 SVM Linear SVM Polynomial SVM RBF SVM Sigmoid

100

80

60

40 Accuracy(%)

20

Errorbar Y +/- STD

0 20 30 40 50 60 70 Number of sFlow

Figure 7.1: Average Accuracy of Classifying Client versus Server 79

For Decision Tree algorithm with 70 sFlows, features contributing most to the classi- fier in the order are: sum of second byte of other party’s IP addresses, count of most often used host system port, sum of third byte of other party’s IP addresses, count of unique pro- tocols, count of most often used host user port, count of unique other party user ports, and sum of first byte of other party’s IP addresses. Schatzmann et al. [158] used the feature of service IP address proximity, session duration, and flow inter-arrival times in classifying

HTTPS traffic of web email versus web non-email using NetFlow. They reached overall 93.2% accuracy.

7.2.4 Classification of Hosts from Personal Office versus Public Place

In classifying host roles of personal office and public place, we experimented with both Decision Tree and On-line Support Vector Machine algorithms along with four kernel types of linear, polynomial, radial basis function, and sigmoid. Figure 7.3 presents the results of average accuracy versus number of sFlows. The best average average accuracy of On- line Support Vector Machine is 92.0% with 20 sFlows and kernel types of linear, radial bass function, and sigmoid. The best overall average accuracy is 93.3% achieved by the

C5.0 SVM Linear SVM Polynomial SVM RBF SVM Sigmoid

100

80

60

40 Accuracy(%)

20

Errorbar Y +/- STD

0 20 30 40 50 60 70 Number of sFlow

Figure 7.2: Average Accuracy of Classifying Regular Web Email Server versus Web Non- email Server 80

C5.0 SVM Linear SVM Polynomial SVM RBF SVM Sigmoid

100

80

60

40 Accuracy(%)

20

Errorbar Y +/- STD

0 20 30 40 50 60 70 Number of sFlow

Figure 7.3: Average Accuracy of Classifying Host from Personal Office Client versus Pub- lic Place Client algorithm Decision Tree (C5.0) algorithm with 40 sFlows. For Decision Tree algorithm with 40 sFlows, features contributing most to the classifier in the order are: count of host dynamic ports, count of unique protocols, sum of first byte of other party’s IP addresses, and standard deviation of other party system ports. There is no comparison to the related work since there is no similar classification.

7.2.5 Classification of Hosts from Two Different Colleges

In classifying host roles of personal office from two different colleges, we experimented with both Decision Tree and On-line Support Vector Machine algorithms along with four kernel types of linear, polynomial, radial basis function, and sigmoid. Figure 7.4 presents the results of average accuracy versus number of sFlows. The best accuracy of on-line

Support Vector Machine is with 10 sFlows and linear kernel type. The accuracies of linear kernel type is generally higher than others.

The best overall average accuracy is achieved by the Decision Tree (C5.0) algorithm with 50 and 60 sFlows. For 60 sFlows, features contributing most to the classifier in the order are: count of most often used host user port, standard deviation of other party system ports, count of unique other party user ports, count of unique protocols, and sum of first 81

C5.0 SVM Linear SVM Polynomial SVM RBF SVM Sigmoid

100

80

60

40 Accuracy(%)

20

Errorbar Y +/- STD

0 20 30 40 50 60 70 Number of sFlow

Figure 7.4: Average Accuracy of Classifying Host from Two Different Colleges byte of other party’s IP addresses.

This is the most difficult classifier as the difference between the two classes is not obvious. Both classes are hosts from personal offices, and the only difference is users

being from two different colleges. Hence, this classification is more about users’ network flow patterns of usage behavior. Compared with previous classification in Section 7.2.2,

7.2.3, and 7.2.4, the accuracies of classifying hosts from two different colleges fluctuate dramatically. It may indicate that more optimization needs to be done on features or other parameters. For instance, as NetFlow provides information such as the inter-duration of

network flow, new features related to user behaviors may provide better results. In the study of Tan et al. [172], hosts of enterprise internal network were grouped into servers,

department of engineering, and department of sales or administrators, etc. Their grouping is based on connectivity and similarity between hosts of internal connections.

7.2.6 Feature Contributions

Selecting features or discriminators plays an important role in network traffic classification. Moore et al. [130] gave a list of about 249 features in flow-based classification. There are

many methods to select features [74]. In this work, features are constructed based on our 82

domain knowledge of the host roles. The Decision Tree algorithm provides individual feature contribution to the classifier in the training. Understanding these contributions may help us choose better features in the future applications.

client-server web-email personal-Public college1-college2

100

80

60

40 % Training Classification

20

0 3 4 6 7 9 11 12 13 14 18 19 20 21 22 23 24 Feature Number Figure 7.5: C5.0 Feature Contributions in Classifying Host Roles

Figure 7.5 shows each individual feature’s contribution to the four models discussed earlier. Overall, the feature of user port plays important role in the classification of client versus server and hosts from different colleges, which indicates these classifications are more about the user. The feature of host dynamic ports, count of unique protocols, and sum of first byte of other party’s IP addresses plays important role in the classification of host from a personal office versus public places, which indicates hosts from public places have been used to connect with more third party addresses and with more different appli- cations. Overall, average download bytes, and other party’s IP addresses are features that 83

contributed the most to these four models. Schatzmann et al. [158] used the feature of service IP address proximity. In this section, second and third byte of the other party’s IP

addresses contribute more to the classification of web email server versus web non-email servers because web email server is accessed by IP addresses around local physical area,

but not necessarily the web non-email servers.

7.3 User Identification

Researchers have been trying to identify a person using behavioral biometric such as sig- nature or handwriting for a long time.. Identifying user based on the network flow is very

desirable in the computer security world. Network traffic data are rich, plenty, and avail- able now. However, there is very little research effort to identify a person among other users based on the network communication behavior [112]. In this dissertation, I use machine

learning approaches to model user network behaviors. The purpose of the models is to be applied to the anomaly detection, botnets detection, anti-virus, and identifying users of the

network. I use similar approaches as in Section 7.2 in identifying the particular user. The Deci- sion Tree [7] and the On-line Support Vector Machine [31] are used to identify the particu- lar user from a set of users at personal offices. Data are simply split by as 80% for training and 20% for testing. Features are constructed based on domain knowledge and the purpose of the models [74]. The detail of features are described in Section 7.3.1 and the results of contribution are reported in Section 7.3.3. I compare four different kernel types of On-line

Support Vector Machine (namely, linear, polynomial, radial basis function, and sigmoid) in identifying users.

Default parameters we used in the experiments are as follows. For On-line Support Vector Machine, online optimizer with finishing step, random selection, number of candi- 84

dates to search is C=50, tolerance of termination criterion is τ=0.001, cache memory is 256 MB. For Decision Tree C5.0, we ignore cost, no cross validation, global pruning, and confidence level is 0.25 for pruning.

7.3.1 Identification Features

There are many features as discussed in [130]. I carefully select a set of features with the purpose of identifying the user in this section. For the purpose of online classification and distributed computation, aggregating these features is favorable. Utilized features are aggregations of N NetFlow (where N is 5, 10, 20, 30, 40, or 50) and listed below. Simple statistics such as mean and standard deviation of these features loose lots of information and is difficult to explain [149]. I chose to use discrete probability distribution function (pdf) proposed by Ramalho [149]. I improve the method by emphasizing on im- portant values and do not restrict to the same bin slice size. The term outlier refers to the percentage of values out of most normal values. It is used to prevent that the pdf is skewed far away. An example is for a data set of system port {6, 8, 9, 11, 14, 30, 80, 1020}. If they are equally assigned to 4 bins with bin slice size 1024/4 = 256, it will look like { [6, 8, 9, 11, 14, 30, 80], [], [], [1020] } and the pdf is { 0.875, 0, 0, 0.125}. The information in that data set is then lost. Defining the outlier 1%, the interested value of port 80, then the data

set size 8 multiples 99% to get 7, so the 7th value is 80 as the first value of the last bin, the

bin slice size is 80 / (4-1) = 26.6. Then, these bins will look like { [6, 8, 9, 11, 14], [30],

[80], [1020] } and the pdf is { 0.625, 0.125, 0.125, 0.125}. The pdf is constructed as below.

1. Define the number R of the bins for the feature data size N, important bins for special interest values S, and the outlier value P . 2. Calculate the slice value of each bin based on R,S,N,P as described above.

3. For each sample in the feature sets, assign the sample to its bin, and increase the 85

counter Cr of the bin r.

Cr 4. pdf = N for each bin r, which is in the range [0,1]. The following is the list of the features used in identifying the user and their index numbers mentioned in Section 7.3.3.

1. Feature indexes [1 5], host system ports with resolution R =5, P = 1%.

2. Feature indexes [6 10], host user ports with R =5, P = 1%. 3. Feature indexes [10 13], host dynamic ports with R =3, P = 1%.

4. Feature indexes [14 33], other party application ports with R = 20, considering indi- vidual ports for often used Internet application ports, its bin slice sizes are pre-defined

as bin index protocol [port values] as following: 0[0-19,24,26-52], 1Ftps[20, 21, 69, 115], 2SSH[22], 3Telnet[23], 4Mail[SMTP[25],POPs[109,110]], 5DNS[53], 6[54- 66], 7DHCP[67,68], 8HTTP[80,8080], 9[70-79,81-108], 10[111-118,111-142], 11NNTP[119],

12IMAP[143, 993], 13IRC[194], 14[144-193,195-992], 15[994-1023], 16RADIUS[1812], 17MSNP [1863], 18AIM [5190], 19Tor[9100].

5. Feature indexes [34 38], other party user ports with resolution R =5,R = 1%. 6. Feature indexes [39 41], other party dynamic ports with R =3.

7. Feature indexes [42 51], upload byte size per flow with R = 10, P = 1%. 8. Feature indexes [52 61], download byte size per flow with R = 10, P = 1%. 9. Feature indexes [62 66], flow duration with R =5, P = 1%.

10. Feature indexes [67 71], flow interval-arrive time with R =5, P = 1%. 11. Feature indexes [72 85], protocols with R = 14, considering important protocols ICMP(1), TCP(6), UDP(17), most protocols are in 0 -142. Its bin slice sizes are pre- defined as bin index [port values] as following: 0[0,2, 3, 4, 5], 1[1], 2[6], 3[7-16],

4[17], 5[18-34], 6[34-57], 7[57-60], 8[61-76], 9[77-93], 10[94,110], 11[111-127], 12[128-142], 13[143-255].

12. Feature indexes [86 117], other party IP addresses with resolution R = 32 ( 32 Bin- 86

s). 13. Feature indexes [118 122], ingress TCP flags, considering important flags is as fol-

lowing. URG (1 bit) indicates that the Urgent pointer field is significant. ACK (1 bit) indicates that the Acknowledgment field is significant. All packets after the initial

SYN packet sent by the client should have this flag set. PSH (1 bit) asks to push the buffered data to the receiving application. RST (1 bit) resets the connection. SYN (1

bit) synchronize sequence numbers. Only the first packet sent from each end should have this flag set. Some other flags change meaning based on this flag, and some are only valid for when it is set, and others when it is clear. The bin size is 5 for each

flag. 14. Feature indexes [123 127], egress TCP flags (6 bins) is the same logic as above.

15. Feature index 128, count of ingress. 16. Feature index 129, count of egress.

7.3.2 User Identification Results

In identifying the user network traffic pattern among many others, I experimented with Decision Tree, and On-line Support Vector Machine algorithms with four kernel types of linear, polynomial, radial basis function, and sigmoid. Figure 7.6 presents the results of average accuracy versus number of NetFlow. The best average accuracy of on-line Support Vector Machine is 78.6% with 10 NetFlows and linear kernel type. The best overall average accuracy 83.3% is achieved by the Decision Tree (C5.0) algorithm with 50 NetFlows.

The identification accuracy is not very high compared with Section 7.2. In general, it is difficult to identify a user only based on network flow data pattern. This dissertation is the first project to tackle the problem and we need better solutions with more understanding of the user behavior modeling. 87

C5.0 SVM Linear SVM Polynomial SVM RBF SVM Sigmoid

100

80

60

40 Accuracy(%)

20

Errorbar Y +/- STD

0 5 10 15 20 25 30 35 40 45 50 Number of NetFlow

Figure 7.6: Average Accuracy of Identifying the User

7.3.3 Feature Contributions

Figure 7.7 shows each individual feature’s contribution for the six models of aggregating NetFlow (5, 10, 20, 30, 40, and 50 flows) in the decision tree algorithm. Overall, the features of port number contribute most in the identification of the user among other users. Port number represents the network applications. Hence, it indicates user applications are important features to differentiate users from each other. Protocols which represent network applications are similar to port number, and are the second most contributors. After the port number and protocols, most important features are packet size of upload and download and IP address. These results provide guidance for designing better models in the future. 88 129 118 117 116 115 114 113 112 110 109 103 102 101 99 96 95 50flows 94 88 77 68 40flows 59 55 Feature Number 51 49 46 30flows 45 44 43 41 38 20flows 29 27 25 23 Figure 7.7: C5.0 Feature Contribution in Identifying the User 10flows 17 15 13 10 8 5flows 0

80 60 40 20

100 % Training Classification Training % 89

7.4 Discussion

Many aspects of network traffic classification and clustering are interrelated [112]. Many studies focus on classifying applications such as web, peer-to-peer, FTP, and DNS traf- fic [112, 87, 12]. Some emphasize profiling the hosts and network [112, 88, 194, 79]. Ma- chine learning approaches have been applied on classifying applications, security aware- ness, and anomaly detection [112]. There are some works for clustering and classifying host roles [172, 28, 124, 48]. Our aim in this chapter, different from those studies, is to identify the user from other users and to classify host roles of a network from end-host- centric approach rather than the applications or protocols perspective. Our purpose is to build a network security awareness system based on models of traffic flow of internal hosts. These models are built online and can be updated over the time when host roles change. The data is collected online through a low cost distributed platform built with open source components.

7.4.1 Classification of Network Applications

There are many studies to classify applications [112]. CAIDA has developed several meth- ods based on port number, statistic patterns, heuristics, and machine learning. [12]. Kara- giannis et al. [87] classified network traffic of web, peer-to-peer, ftp, mail, chat, and net- work management using heuristics developed from three levels of (1) social level, i.e., the number of other hosts the particular host is connect with; (2) functional level, i.e., whether the host provides service or consumes service; (3) application level, i.e., represented as graphlet for the 4-tuples of {source IP, destination IP, source port, destination port}. In this chapter, different from classifying network applications, we classify host roles in a campus network. For example, a server may support many applications and we classify a host as a server without considering the type of applications it supports. A client host 90

typically use many network applications. We design features based on the host role we are interested in. For example, we separate the port number of system, user, and dynamic

ranges to obtain some application level information. Likewise, we separate the four byte of IP addresses to catch the sub-network or spatial level information.

7.4.2 Profiling the Host and Network

Profiling the host can be done at four levels, i.e., user, application, host, and network [112]. Graph-based approaches grasp the nature of the network. Karagiannis et al. [88], extended

the graph-based host profiling and provided mechanisms to summarize the information over the time. Xu et al. [194] built bipartite graphs of host communication and clustered

hosts with common social-level behaviors in the same prefixes. Himura et al. [79] pre- sented a synoptic graphlet approach by mapping from a cluster and coupled supervised

and unsupervised profiling by using graphlets. Trestian et al. [177] introduced an approach for profiling the endpoint behavior by querying Google search engine and combing with keywords and domain names. Moore et al. [130] analyzes a large number of features for different network profiling studies. In our system, we have not used graph-based features because building a graph for

a large network is computationally expensive compared with sFlow data aggregation. We want online classification and fast feature computation, but have not found an efficient graph-based representation for host classification/fingerprinting.

7.4.3 Machine Learning Approaches

Machine learning algorithms of Naive Bayes, Naive Bayes Kernel Estimation, Bayesian Network, Decision Tree, k-Nearest Neighbors, Neural Networks, and Support Vector Ma- chines have been utilized to analyze traffic anomaly, intrusion detection, and network traffic classification and clustering [112]. In general, Decision Tree and Support Vector Machine 91

perform better than other approaches [112]. In this chapter, we choose supervised machine learning approaches because we model

the hosts of a campus network. We monitor the hosts for their roles in the network and have the ground truth about the analyzed hosts.

7.4.4 Classifying and Clustering Host Roles

Tan et al. [172], defines the role classification problem and challenges within enterprise networks, and introduces algorithms to group hosts based on connection patterns and cor-

relates these hosts. Their approach can be classified as end-host centric [177]. Similar to the authors, we focus on the internal network hosts. In our system, we build host role

models for the internal network as a part of a network security awareness system that uses online data collection and analysis. We used on-line support vector machines with features

selected according to our purpose. We classify host roles of client versus server, regular web email servers versus web non-email servers, clients at personal offices versus public places of laboratories and libraries, and personal office clients from two different colleges.

Himura et al. [48] used minimum spanning tree to cluster hosts based on host behav- iors. Their clustering is based on the Internet traffic or the network traffic applications without knowledge about hosts. Their approach can be classified as the closed box ap- proach [177]. That is the main difference compared with our work. Our classification use network traffic data but focuses on the internal network hosts. Our classification is not just for network applications (e.g., web email servers versus web non-email servers) or net- work endpoints (e.g., client versus server), but also for the host functions of users (e.g., hosts from a personal office or public place and hosts from different colleges). Similarly, [158, 28, 124] make the same assumption that a host is defined by a specific

IP address without knowledge about hosts. Schatzmann et al. [158] used support vector machine to classify HTTPS traffic of web mail versus web non-mail using bi-directional 92

NetFlow with an overall accuracy of 93.2%. In our experiments, on-line support vector machine algorithm reached 94.7% average accuracy with 60 sFlows and linear kernel type,

and the decision tree algorithm reached 100% average accuracy. Berthier et al. [28] devel- oped 6 server identification heuristics that were dependent on flow timing, port, IP address and protocol, and used a Bayesian inference algorithm with an overall accuracy of 79.2% in identifying a server. In our study, we achieved 99.3% average accuracy using the Deci- sion Tree algorithm with 50 sFlows, and 96.5% average accuracy with the On-line Support Vector Machine with 70 sFlows for both Radial Basis Function and linear kernel. Meiss et al. [124] described heuristic methods to differentiate client and server based on the as-

sumption that servers use most common or well-known ports, but no result was reported.

7.4.5 Identifying the User among Others based on Network Flow

To the best of our knowledge, there is no publication to identify a user among other users based on network flow data. There is limited work about user profiling and the correlation and distribution of user flow data related to time and packet to users [125]. Research in

user behaviors mostly focus on detecting anomalies. However, the user behavior research can provide helpful information to identify a user among other users. In this dissertation, I

build models of decision tree and on-line SVM with defined features, and presented aver- age accuracy. These approach provides helpful information for forensic investigation and intrusion detection. 93

Chapter 8. Conclusion

The goal of this dissertation is to provide a network security monitoring and analysis system based on big data technologies. The system can collect, store, access, measure, and analyze network flow data, use machine learning approaches with context of host role and user behaviors. The presented results show that the big data technologies can allow developing a real time continuous network security monitoring and analysis system using machine learning algorithms and behavioral models.

The major contributions of this dissertation are,

• A distributed system based on big data technologies.

• Real time continuous network monitoring and interactive visualization.

• A case study of network measurement by measuring anonymity technology usage on campus.

• Models of classification of host roles and identification of users.

The system designed in this dissertation has been used by the network security depart- ment of campus information technology. The system design leverages the latest big data technologies from Amazon, Facebook, Google, Tweeter, and Yahoo. The system compo- nents have been relatively mature and used by high tech companies to handle large scale data. Hence, the presented system is scalable and has high performance. Additionally, I demonstrate how to monitor the network for security awareness and how to interact with the continuous information in real time. As system provides access to these data with minimal delay, interactive visualization of the information is possible. 94

An administrator can continuously view the network status, top N conversations, target network traffic in real time.

Moreover, I demonstrate a case study to measure the usage of anonymity technolo- gies on a campus and report usage of proxies, Tor, I2P, and commercial anonymity servers.

Proxies and Tor were among the most used anonymity technologies with majority of anony- mous connections and traffic bandwidth. US, Germany, and China are the top 3 countries where anonymity servers are used to connect with this campus. Port 80 (i.e., HTTP) and 443 (i.e., SSL) are seen very often in the anonymity traffic, and it indicates that web brows- ing are the most activities among these anonymity network traffic.

Finally, I present classification of hosts based on host roles of client, server, web email server, web non-email server, personal or public hosts, and clients from two different col- leges. Host role classification accuracies were very high. Specifically, on average accu- racy, I obtained 96.5% by On-Line SVM and 99.3% by Decision Tree (C5.0) in classifying clients versus servers, 94.7% by On-Line SVM and 100% by Decision Tree (C5.0) in clas- sifying web email server versus web non- email server, 92.0% by On-Line SVM and 93.3% by Decision Tree (C5.0) in classifying clients in personnel office versus public place, and

85.5% by On-Line SVM and 93.3% by Decison Tree (C5.0) in classifying clients at per- sonal offices from two different colleges. Overall, Decision Tree algorithm performs better than the On-line Support Vector Machine. Furthermore, I presented identification of a user from other users. The best average accuracy of On-Line Support Vector Machine is 78.6% on average with 10 NetFlows with linear kernel type. The best overall accuracy 83.3% on average is achieved by the Decision Tree (C5.0) algorithm with 50 NetFlows. Overall, these results show a promising future to use big data technologies to monitor and analyze network security data, and to provide real time continuous network security awareness. 95

Chapter 9. Future Work

There are many ways to continue and extend the work presented in this dissertation.

9.1 Improvement to the Current Work

The results of real time network monitoring in Chapter 5, network measurement in Chap-

ter 6, and classification and identification of network objects in Chapter 7 are promising. More understanding on user network behaviors and how to model these behaviors are nec-

essary to better identify a user among the other users. Additional research can define more features related to user network behavior, strategies of feature selection, or algorithm to

learn these features such as the deep learning approach. More interactive features and bet- ter web user interfaces can make Chapter 5 closer to a commercial product for network security administrators.

9.2 Extensions to the Current Work

With the big data system designed in Chapter 4, more work can be done to extend the work.

9.2.1 Background Traffic

It is necessary to remove noise in order to improve the accuracy of classifying network host roles and identifying the user among many other users. It can save storage space and improve the performance if noisy background data can be filtered out before storing the data. First, it is necessary to determine what is noise traffic. For instance, when identifying 96

user behavior, the network traffic of advertisements is a background noise. Second, it is challenging to differentiate the interested traffic from noise traffic based on network flow information alone.

9.2.2 Detection of Operating System Fingerprinting

Before any attack to the network, hackers must collect information about the target. Port scanning and operating system (OS) fingerprinting are very common approaches to get the target’s information. There have been considerable research and applications to detect port scanning [112]. The big data system in this dissertation is able to detect port scanning.

However, there is not much research in OS fingerprinting using network flow data. It is possible to explore the data for fingerprinting by the big data system presented in this dissertation.

9.2.3 Identity Anonymity

There is a gap between researchers in academy and network administrators in information technology. Research in the network traffic usually lacks data because of security and privacy concerns that network administrators are not willing to provide [112]. In order to open the current system and data to researchers, careful anonymization has to be done. The identity anonymity should be validated to avoid any mistake of exposing any security and privacy data, but it also should not affect the performance or prevent network administrators in exploring more information about the target host and users. Determining a balance and integrating these two goals is a future challenge.

9.2.4 Fusion with Other Network Security Data

There are many network security data including network flow, firewall log, applications logs, system logs, and SNMP. All these data are related to network objects of hosts and 97

users. Fusing these data together will provide better network security information and improve accuracy of host classification and user identification.

9.3 Vision

Nowadays, network security has become a very important aspect in our daily life, and

security attacks has been in the news almost daily. Individuals and small businesses can not afford a big information technology team. Even some government offices can not afford

to have a complete expert team because of the budget limitations. Cloud computing and big data technologies open the door to provide network security as a service. I believe this dissertation can provide network security service for individuals, small businesses, and government offices to help them continuously monitor and analyze their network traffic in real time and through interactive web user interfaces. 98

Bibliography

[1] Apache Cassandra. http://cassandra.apache.org/, Retrieved June 31, 2013. [2] Apache Flume. http://flume.apache.org/, Retrieved June 31, 2013. [3] Apache Hadoop Project. http://hadoop.apache.org, Retrieved November 3, 2012. [4] AURORA: traffic analysis and visualization. http://www.zurich.ibm.com/ aurora/, Retrieved September 13, 2012. [5] Big Data. http://www-01.ibm.com/software/data/bigdata/, Re- trieved June 31, 2013. [6] Big data fuels intelligence-driven security. http://www.emc. com/collateral/industry-overview/big-data-fuels- intelligence-driven-security-io.pdf, Retrieved February 18, 2013. [7] Decition Tree C5.0. http://www.rulequest.com/see5-info.html, Re- trieved December 31, 2012. [8] Free haven selected papers in anonymity. http://www.freehaven.net/ anonbib/date.html, Retrieved August 3, 2012. [9] Hadoop home page. http://hadoop.apache.org, Retrieved August 3, 2012. [10] Information security continuous monitoring for federal information systems andor- ganization. http://csrc.nist.gov/publications/nistpubs/800- 137/SP800-137-Final.pdf, Retrieved June 31, 2013. [11] InMotion sFlow toolkit. http://www.inmon.com/technology/ sflowTools.php, Retrieved August 3, 2012. [12] Internet traffic classification. http://www.caida.org/research/ traffic-analysis/classification-overview/, Retrieved June 3, 2012. [13] Introduction to Cisco IOS NetFlow - A Technical Overview. http: //www.cisco.com/en/US/prod/collateral/iosswrel/ps6537/ ps6555/ps6601/prod\_white\_paper0900aecd80406232.html, Retrieved June 3, 2012. 99

[14] IPFIX. http://datatracker.ietf.org/wg/ipfix/, Retrieved Septem- ber 13, 2012. [15] NetFlow applications. http://netflow.caligare.com/ applications.htm, Retrieved September 13, 2012. [16] NFSen - Netflow sensor. http://nfsen.sourceforge.net/, Retrieved September 13, 2012. [17] nodejs . http://nodejs.org/, Retrieved June 31, 2013. [18] sFlow Collectors. http://www.sflow.org/products/collectors. php, Retrieved September 13, 2012. [19] sFlow forum. http://www.sflow.org, Retrieved August 3, 2012. [20] Storm distributed and fault-tolerant realtime computation. http://storm- project.net/, Retrieved June 31, 2013. [21] V8 JavaScript Engine. https://code.google.com/p/v8/, Retrieved June 31, 2013. [22] Service name and transport protocol port number registry. http://www.iana. org/assignments/service-names-port-numbers/service- names-port-numbers.xml, Retrieved December 31, 2012, 2001. [23] Shubair A Abdulla, Sureswara Ramadass, Altyeb Altaher, and Amer Al Nassiri. Setting a worm attack warning by using machine learning to classify NetFlow data. International Journal of Computer Applications, 36(2):49–56, December 2011. [24] Robert Ball, Glenn A Fink, and Chris North. Home-centric visualization of network traffic for security administration. In Proceedings of the 2004 ACM workshop on visualization and data mining for computer security, pages 55–64, New York, NY, USA, 2004. ACM. [25] Pere Barlet-ros and Albert Cabellos-aparicio. Analysis of the impact of sampling on NetFlow traffic classification. Methodology, 55(5):1083–1099, 2010. [26] Alexander V. Barsamian. Network characterization for botnet detection using statistical-behavioral methods. Master’s thesis, Thayer School of Engineering, Dart- mouth College, USA, 2009. [27] Karel Bartos, Martin Rehak, and Vojtech Krmicek. Optimizing flow sampling for network anomaly detection. In Wireless Communications and Mobile Computing Conference, 7th International, pages 1304–1309, July 2011. [28] Robin Berthier, Michel Cukier, Matti Hiltunen, Dave Kormann, Gregg Vesonder, Dan Sheleheda, Park Ave, and Florham Park. Nfsight : NetFlow-based network awareness tool. Architecture, pages 1–8, 2010. [29] Liu Bin, Lin Chuang, Qiao Jian, He Jianping, and Peter Ungsunan. A NetFlow based flow analysis and monitoring system in enterprise networks. Computer Networks, 52(5):1074–1092, 2008. 100

[30] Xu Bo, Chen Ming, Lan Fei, and Wang Na. P2P flows identification method based on listening port. In Broadband Network Multimedia Technology, 2009. IC-BNMT ’09. 2nd IEEE International Conference on, pages 296–300, 2009. [31] Antoine Bordes, Seyda Ertekin, Jason Weston, and Leon´ Bottou. Fast kernel clas- sifiers with online and active learning. Journal of Machine Learning Research, 6:1579–1619, September 2005. [32] Mustapha Bouhtou and Olivier Klopfenstein. Robust optimization for selecting Net- Flow points of measurement in an IP network. 2007 IEEE Global Telecommunica- tions Conference, pages 2581–2585. [33] Daniela Brauckhoff, Bernhard Tellenbach, Arno Wagner, Martin May, and Anukool Lakhina. Impact of packet sampling on anomaly detection metrics. Proceedings of the 6th ACM SIGCOMM on Internet measurement, page 159, 2006. [34] Alexandra Caracas, Andreas Kind, Dieter Gantenbein, Stefan Fussenegger, and Dimitrios Dechouniotis. Mining semantic relations using NetFlow. In 3rd IEEE/I- FIP International Workshop on Business-driven IT Management, pages 110–111, April 2008. [35] Valentin Carela-Espanol, Pere Barlet-Ros, and Josep Sole-Pareta.´ Traffic classifica- tion with sampled netflow. Technical report, Universitat Politecnica de Catalunya, 2009. [36] Abdelberi Chaabane, Pere Manils, and Mohamed Ali Kaafar. Digging into anony- mous traffic: a deep analysis of the tor anonymizing network. In 4th International Conference on Network and System Security, pages 167 –174, September 2010. [37] Yin-Tung F. Chan, Charles A. Shoniregun, and Galyna A. Akmayeva. A NetFlow based Internet-worm detecting system in large network. In Third International Con- ference on Digital Information Management, pages 581–586, 2008. [38] Umang K Chaudhary, Ioannis Papapanagiotou, and Michael Devetsikiotis Devet- sikiotis. Flow classification using clustering and association rule mining. In 15th IEEE International Workshop on Computer Aided Modeling, Analysis and Design of Communication Links and Networks, pages 76–80, 2010. [39] Yingying Chen, S. Jain, V.K. Adhikari, Zhi-Li Zhang, and Kuai Xu. A first look at inter-data center traffic characteristics via yahoo! datasets. In Proceedings IEEE INFOCOM, pages 1620 –1628, April 2011. [40] Guang Cheng and Jian Gong. A resource-efficient flow monitoring system. Com- munications Letters, IEEE, 11(6):558–560, June 2007. [41] Kenjiro Cho, Ryo Kaizaki, and Akira Kato. Aguri: an aggregation-based traffic profiler. In Proceedings of the Second International Workshop on Quality of Future Internet Services, COST 263, pages 222–242, London, UK, 2001. Springer-Verlag. [42] Hyunsang Choi, Heejo Lee, and Hyogon Kim. Fast detection and visualization of network attacks on parallel coordinates. Computers Security, 28(5):276–288, 2009. 101

[43] Edith Cohen, Nick Duffield, Carsten Lund, and Mikkel Thorup. Confident esti- mation for multistage measurement sampling and aggregation. Proceedings of the 2008 ACM SIGMETRICS international conference on measurement and modeling of computer systems SIGMETRICS 08, (i):109, 2008. [44] M. Patrick Collins and Michael K. Reiter. Hit-list worm detection and bot iden- tification in large networks using protocol graphs. In Proceedings of the 10th In- ternational Conference on Recent Advances in Intrusion Detection, pages 276–295, Berlin, Heidelberg, 2007. Springer-Verlag. [45] George Danezis and Claudia Diaz. A survey of anonymous communication chan- nels. Technical Report MSR-TR-2008-35, Microsoft Research, January 2008. [46] Luca Deri. Ntop. http://www.ntop.org, Retrieved June 3, 2012. [47] Luca Deri. Open source voip traffic monitoring. http://131.114.21.22/ VoIP.pdf, Retrieved June 3, 2012. [48] Guillaume Dewaele, Yosuke Himura, Pierre Borgnat, Kensuke Fukuda, Patrice Abry, Olivier Michel, Romain Fontugne, Kenjiro Cho, and Hiroshi Esaki. Unsuper- vised host behavior classification from connection patterns. International Journal of Network Management, 20(5):317–337, September 2010. [49] Falko Dressler, Wolfgang Jaegers, and Reinhard German. Flow-based worm de- tection using correlated honeypot logs. In Proceedings of 15th GI/ITG Fachtagung Kommunikation in Verteilten Systemen, pages 181–186, 2007. [50] Thomas Dubendorfer and Bernhard Plattner. Host behaviour based early detection of worm outbreaks in internet backbones. 14th IEEE International Workshops on En- abling Technologies Infrastructure for Collaborative Enterprise, (c):166–171, 2005. [51] Thomas Dubendorfer, Arno Wagner, and Bernhard Plattner. A framework for real- time worm attack detection and backbone monitoring. First IEEE International Workshop on Critical Infrastructure Protection, pages 3–12, 2005. [52] Nick Duffield. Sampling for passive internet measurement: A review. Statistical Science, 19:472–498, 2004. [53] Nick Duffield and Matthias Grossglauser. Trajectory sampling with unreliable re- porting. IEEE/ACM Transactions on Networking, 16(1):37–50, February 2008. [54] Nick Duffield, Carsten Lund, and Mikkel Thorup. Charging from sampled network usage. In Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measure- ment, pages 245–256, New York, NY, USA, 2001. ACM. [55] Nick Duffield, Carsten Lund, and Mikkel Thorup. Properties and prediction of flow statistics from sampled packet streams. In Proceedings of the 2nd ACM SIGCOMM workshop on internet measurment, pages 159–171, New York, NY, USA, 2002. ACM. [56] Matthew Edman and Bulent¨ Yener. On anonymity in an electronic society: A survey of anonymous communication systems. ACM Computing Surveys, 42(1):1–35, 2009. 102

[57] Robert F Erbacher. Visual behavior characterization for intrusion detection in large scale systems. In Proceedings of the IASTED International Conference On Visual- ization, Imaging, and Image Processing, pages 54–59, 2001. [58] Cristian Estan, Ken Keys, David Moore, and George Varghese. Building a better NetFlow. ACM SIGCOMM Computer Communication Review, 34(4):245, 2004. [59] Tiago Fioreze, Lisandro Zambenedetti Granville, Aiko Pras, Anna Sperotto, and Ramin Sadre. Self-management of hybrid networks: Can we trust netflow data? Integrated Network Management 2009 IM09 IFIPIEEE International Symposium on, pages 577–584, 2009. [60] Fabian Fischer, Florian Mansmann, Daniel A Keim, and Stephan Pietzko. Large- scale network monitoring for visual analysis of attacks. Visualization for computer security 5th international workshop, 72(1-3):1–8, 2008. [61] Michalis Foukarakis, Demetres Antoniades, Spiros Antonatos, and Evangelos P Markatos. Flexible and high-performance anonymization of NetFlow records using anontool. Third international conference on security and privacy in communications networks and the workshops, pages 33–38, 2007. [62] Jer´ omeˆ Franc¸ois, Shaonan Wang, Radu State, and Thomas Engel. BotTrack: track- ing botnets using NetFlow and PageRank. Networking, 6640:1–14, 2011. [63] Jerome Francois, Shaonan Wang, Walter Bronzi, Radu State, and Thomas Engel. BotCloud: Detecting botnets using mapreduce. In 2011 IEEE International Work- shop on Information Forensics and Security, pages 1–6. [64] Vanessa Frias-Martinez, Joseph Sherrick, Salvatore J. Stolfo, and Aangelos D. Keromytis. A network access control mechanism based on behavior profiles. In Computer Security Applications Conference, 2009, pages 3–12. [65] A. S. Galathiya, A. P. Ganatra, and C. K. Bhensdadia. Article: Classification with an improved decision tree algorithm. International Journal of Computer Applications, 46(23):1–6, May 2012. Published by Foundation of Computer Science, New York, USA. [66] Aleksey A Galtsev and Andrei M Sukhov. Network attack detection at flow level. Aerospace, 2011. [67] Lei Gao, Jiahai Yang, Hui Zhang, Bin Zhang, and Donghong Qin. FlowInfra: A fault-resilient scalable infrastructure for network-wide flow level measurement. In Network Operations and Management Symposium, 2011 13th Asia-Pacific, pages 1–8. [68] Yan Gao, Zhichun Li, and Yan Chen. A DoS resilient flow-level intrusion detec- tion approach for high-speed networks. In 26th IEEE International Conference on Distributed Computing Systems, page 39, 2006. [69] David Geer. Behavior-based network security goes mainstream. Computer, 39(3):14–17, March 2006. 103

[70] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services. SIGACT News, 33(2):51–59, June 2002. [71] John R. Goodall and Mark Sowul. VIAssist: visual analytics for cyber defense. In IEEE Conference on Technologies for Homeland Security, pages 143–150, May 2009. [72] Andrew M Gossett, Ioannis Papapanagiotou, and Michael Devetsikiotis. An ap- paratus for P2P classification in Netflow traces. In GLOBECOM Workshops (GC Wkshps), pages 1361–1366, 2010.

[73] Matej´ Gregr,´ Petr Matousek,´ Miroslav Sv´ eda,´ and Toma´s´ Podermanski.´ Practical IPv6 monitoring-challenges and techniques. In IFIP/IEEE International Symposium on Integrated Network Management, pages 650–653, May 2011. [74] Isabelle Guyon and Andre´ Elisseeff. An introduction to variable and feature selec- tion. The Jorunal of Machine Learning Research, 3:1157–1182, March 2003. [75] Hamed Haddadi, Raul Landa, Andrew W. Moore, Saleem Bhatti, Miguel Rio, and Xianhui Che. Revisiting the issues on netflow sample and export performance. In Third International Conference on Communications and Networking in China, pages 442–446, 2008. [76] Byung-Jin Han, Jong-Hyouk Lee, Seon-Gyoung Sohn, Jong-Ho Ryu, and Tai- Myoung Chung. pFlours: A new packet and flow gathering tool. In 10th Interna- tional Conference on Advanced Communication Technology, volume 1, pages 731– 736, 2008. [77] Se-hee Han, Myung-sup Kim, Hong-taek Ju, and James Won-ki Hong. The Archi- tecture of NG-MON: A Passive Network Monitoring System. In InProceeding of 13th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, pages 16–27, 2002. [78] Fang Hao, M Kodialam, T V Lakshman, and S Mohanty. Fast, memory efficient flow rate estimation using runs. Networking, IEEE/ACM Transactions on, 15(6):1467– 1477, 2007. [79] Yosuke Himura, Kensuke Fukuda, Kenjiro Cho, Patrice Abry, Pierre Borgnat, and Hiroshi Esaki. Synoptic graphlet: Bridging the gap between supervised and unsuper- vised profiling of host-level network traffic. Networking, IEEE/ACM Transactions on, PP(99):1, 2012. [80] Han-Wei Hsiao, Deng-Neng Chen, and Tsung Ju Wu. Detecting hiding malicious website using network traffic mining approach. In 2nd International Conference on Education Technology and Computer, volume 5, pages V5–276 –V5–280, June 2010. [81] Yan Hu, Dah-ming Chiu, John C S Lui, and Senior Member. Entropy based adaptive flow aggregation. IEEE/ACM Transactions on Networking, 17(3):698–711, 2009. 104

[82] Markus Huber, Martin Mulazzani, and Edgar Weippl. Tor HTTP Usage and Infor- mation Leakage. In Proceedings of the 11th International Conference on Communi- cations and Multimedia Security, volume 6109 of LNCS, pages 245–255. Springer, May 2010. [83] Hongbo Jiang, Zihui Ge, Shudong Jin, and Jia Wang. Network prefix-level traffic profiling: characterizing, modeling, and evaluation. Computer Networks, 54(18):3327–3340, December 2010. [84] Hongbo Jiang, Andrew W Moore, Zihui Ge, Shudong Jin, and Jia Wang. Lightweight application classification for network management. Proceedings of the 2007 SIGCOMM Workshop on Internet Network Management, page 299. [85] Wang Jinsong, Liu Weiwei, Zhang Yan, Liu Tao, and Wang Zilong. P2P traffic iden- tification based on netflow tcp flag. In International Conference on Future Computer and Communication, pages 700–703, April 2009. [86] Andrew J Kalafut, Jacobus Van Der Merwe, and Minaxi Gupta. Communities of interest for internet traffic prioritization. IEEE INFOCOM Workshops 2009, pages 1–6. [87] Thomas Karagiannis, Konstantina Papagiannaki, and Michalis Faloutsos. BLINC: multilevel traffic classification in the dark. SIGCOMM Computer Communication Review, 35(4):229–240, August 2005. [88] Thomas Karagiannis, Konstantina Papagiannaki, Nina Taft, and Michalis Faloutsos. Profiling the end host. In Proceedings of the 8th International Conference on Pas- sive and Active Network Measurement, pages 186–196, Berlin, Heidelberg, 2007. Springer-Verlag. [89] Yin Ke-xin and Zhu Jian-qi. A novel DoS detection mechanism. In International Conference on Mechatronic Science, Electric Engineering and Computer, pages 296–298, 2011. [90] Douglas Kelly. A taxonomy for and analysis of anonymous communications net- works. Technical report, Air Force Institute of Technology, March 2009. [91] Darren R. Kerr and Barry L. Bruins. Network flow switching and flow data export. US Patent Number US 6243667, 2001. [92] Hyunchul Kim, K C Claffy, Marina Fomenkov, Dhiman Barman, Michalis Falout- sos, and KiYoung Lee. Internet traffic classification demystified: myths, caveats, and the best practices. In Proceedings of the 2008 ACM CoNEXT Conference, pages 11:1—-11:12, New York, NY, USA. ACM. [93] Myung-Sup Kim, Hun-Jeong Kong, Seong-Cheol Hong, Seung-Hwa Chung, and J W Hong. A flow-based method for abnormal network traffic detection. In Network Operations and Management Symposium, volume 1, pages 599 –612 Vol.1, April 2004. 105

[94] Andreas Kind, Dieter Gantenbein, and Hiroaki Etoh. Relationship discovery with netflow to enable business-driven IT management. In The First IEEE/IFIP Interna- tional Workshop on Business-Driven IT Management, pages 63–70, April 2006. [95] Yoshinori Kitatsuji and Katsuyuki Yamazaki. A distributed real-time tool for IP-flow measurement. In 2004 International Symposium on Applications and the Internet, pages 91–98, 2004. [96] Atsushi Kobayashi and Katsuyasu Toyama. Method of measuring VoIP traffic fluc- tuation with selective sFlow. In 2007 International Symposium on Applications and the Internet Workshops, pages 89–89. Ieee. [97] Jochen Kogel.¨ One-way delay measurement based on flow data: Quantification and compensation of errors by exporter profiling. In 2011 International Conference on Information Networking, pages 25–30. [98] Constantinos Kotsokalis, Demetrios Kalogeras, and Basil Maglaris. Router-based detection of DoS and DDoS attacks. In Proceedings of HP OpenView University Association HPOVUA 8th Annual Workshop, 2001. [99] Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: A distributed messaging system for log processing. In Proceedings of 6th International Workshop on Networking Meets Databases (NetDB), Athens, Greece, 2011. [100] Vojtech Krm´ıcek,ˇ Jan Vykopal, and R Krejci. Netflow based system for NAT de- tection. In Proceedings of the 5th international student workshop on emerging net- working experiments and technologies, pages 23–24. ACM, 2009. [101] Sumantra R. Kundu, Sourav Pal, Kalyan Basu, and Sajal K. Das. An architectural framework for accurate characterization of network traffic. IEEE Transactions on Parallel and Distributed Systems, 20(1):111–123, 2009. [102] Anukool Lakhina, Mark Crovella, and Christophe Diot. Mining anomalies us- ing traffic feature distributions. SIGCOMM Computer Communication Review, 35(4):217–228, August 2005. [103] Anukool Lakhina, Konstantina Papagiannaki, Mark Crovella, Christophe Diot, Eric D Kolaczyk, and Nina Taft. Structural analysis of network traffic flows. SIG- METRICS Performnace Evaluation Review, 32(1):61–72, June 2004. [104] Kiran Lakkaraju, William Yurcik, and Adam J Lee. NVisionIP: netflow visualiza- tions of system state for security situational awareness. Proceedings of the 2004 ACM workshop on Visualization and data mining for computer security, 29:65–72. [105] Stephen Lau. The spinning cube of potential doom. Communications of the ACM, 47(6):25–26, June 2004. [106] Chang-yong Lee, Hwan-kuk Kim, Kyoung-hee Ko, and Jeong-wook Kim. A VoIP traffic monitoring system based on NetFlow v9. International Journal of Advanced Science and Technology, 4:1–8, 2009. 106

[107] Myungjin Lee, M Hajjat, R R Kompella, and Sanjay Rao. RelSamp: preserving ap- plication structure in sampled flow measurements. In INFOCOM, 2011 Proceedings IEEE, pages 2354–2362. [108] Myungjin Lee, Student Member, and Nick Duffield. Opportunistic flow-level latency estimation using consistent NetFlow. Networking IEEE/ACM Transaction On, pages 1–14, 2011. [109] Yeonhee Lee, Wonchul Kang, and Youngseok Lee. A hadoop-based packet trace processing tool. In Proceedings of the third international conference on traffic mon- itoring and analysis, pages 51–63, Berlin, Heidelberg, 2011. Springer-Verlag. [110] Bingdong Li, Esra Erdin, Mehmet Hadi Gunes¸,¨ George Bebis, and Todd Shipley. An analysis of anonymity technology usage. In Proceedings of the Third international conference on Traffic monitoring and analysis, pages 108–121, Berlin, Heidelberg, 2011. Springer-Verlag. [111] Bingdong Li, Esra Erdin, Mehmet Hadi Gunes¸,¨ George Bebis, and Todd Shipley. An overview of anonymity technology usage. Computer Communications, 36(12):1269 – 1283, 2013. [112] Bingdong Li, Jeff Springer, George Bebis, and Mehmet Hadi Gunes¸.¨ Review: A sur- vey of network flow applications. Journal of Network and Computer Applications, 36(2):567–581, March 2013. [113] Yuanyuan Li. Study of the monitoring model for securities trading network Quality of Service. In 2nd International Conference on Information Science and Engineer- ing, pages 1–4, 2010. [114] Chen Liang and Gong Jian. Fast application-level traffic classification using NetFlow records. Journal on Communications, 33(1):145 – 152, 2012. [115] Danielle Liu and Frank Huebner. Application profiling of IP traffic. In 27th Annual IEEE Conference on Local Computer Networks, pages 220–229, 2002. [116] Xiao-Wu Liu, Hui-Qiang Wang, Ying Liang, and Ji-Bao Lai. Heterogeneous multi- sensor data fusion with multi-class support vector machines: Creating network secu- rity situation awareness. In Machine Learning and Cybernetics, 2007 International Conference on, volume 5, pages 2689–2694. [117] Karsten Loesing. Measuring the tor network from public directory information. Technical report, The Tor Project, 2nd Hot Topics in Privacy Enhancing Technolo- gies, Seattle, WA, USA, 2009. [118] Karsten Loesing, Steven J. Murdoch, and Roger Dingledine. A case study on mea- suring statistical data in the tor anonymity network. In Proceedings of the 14th Inter- national Conference on Financial Cryptography and Data Security, pages 203–215, Berlin, Heidelberg, 2010. Springer-Verlag. 107

[119] Pere Manils, Chaabane Abdelberi, Stevens Le-Blond, Mohamed Ali Kafar, Claude Castelluccia, Arnaud Legout, and Walid Dabbous. Compromising tor anonymity exploiting P2P information leakage. Computer Research Repository, abs/1004.1461, 2010. [120] Florian Mansman, Lorenz Meier, and Daniel A. Keim. Visualization of host behavior for network security. Network Security, pages 187–202, 2007. [121] Florian Mansmann, Fabian Fischer, Daniel A. Keim, Stephan Pietzko, and Marcel Waldvogel. Interactive analysis of netflows for misuse detection in large IP net- works. In Paul Muller,¨ Bernhard Neumair, and Gabi Dreo Rodosek, editors, DFN- Forum Kommunikationstechnologien, volume 149 of LNI, pages 115–124. GI, 2009. [122] Damon McCoy, Kevin Bauer, Dirk Grunwald, Tadayoshi Kohno, and Douglas Sicker. Shining light in dark places: Understanding the tor network. In Proceed- ings of the 8th international symposium on Privacy Enhancing Technologies, PETS ’08, pages 63–76, Berlin, Heidelberg, 2008. Springer-Verlag. [123] Jonathan McPherson, Kwan-Liu Ma, Paul Krystosk, Tony Bartoletti, and Marvin Christensen. PortVis: a tool for port-based detection of security events. In Pro- ceedings of the 2004 ACM workshop on visualization and data mining for computer security, pages 73–81, New York, NY, USA. ACM. [124] Mark Meiss, Filippo Menczer, and Alessandro Vespignani. Properties and evolution of internet traffic networks from anonymized flow data. ACM Transactions Internet Technololgy, 10(4):15:1–15:23, March 2011. [125] Nikolay Melnikov and Jurgen¨ Schonw¨ alder.¨ Cybermetrics: User identification through network flow analysis. In Burkhard Stiller and Filip De Turck, editors, Mechanisms for Autonomous Management of Networks and Services, volume 6155 of Lecture Notes in Computer Science, pages 167–170. Springer Berlin / Heidelberg. [126] Pavel Minarik and Tomas Dymacek. NetFlow data visualization based on graphs. Visualization for Computer Security, pages 144–151, 2008. [127] Pavel Minarik, Jan Vykopal, and Vojtech Krmicek. Improving host profiling with bidirectional flows. In International Conference on Computational Science and En- gineering, volume 3, pages 231–237, 2009. [128] Saeed Moghaddam and Ahmed Helmy. SPIRIT: A simulation paradigm for realistic design of mature mobile societies. In 7th InternationalWireless Communications and Mobile Computing Conference, pages 232–237, July 2011. [129] Abuagla Babiker Mohd and Sulaiman bin Mohd Nor. Towards a flow-based internet traffic classification for bandwidth optimization. International Journal of Computer Science and Security, 3(3):146, 2009. [130] Andrew Moore, Michael Crogan, Andrew W. Moore, Queen Mary, Denis Zuev, De- nis Zuev, and Michael L. Crogan. Discriminators for use in flow-based classification. Technical Report RR-05-13, Deptment of Computer Science, Queen Mary Univer- sity of London, August 2005. 108

[131] Cristian Morariu, Thierry Kramis, and Burkhard Stiller. DIPStorage: Distributed storage of ip flow records. In 16th IEEE Workshop on Local and Metropolitan Area Networks, pages 108–113, 2008. [132] Cristian Morariu, Peter Racz, and Burkhard Stiller. SCRIPT: A framework for scal- able real-time ip flow record analysis. In 2010 IEEE Network Operations and Man- agement Symposium, pages 278–285, April. [133] Cristian Morariu and Burkhard Stiller. DiCAP: Distributed packet capturing archi- tecture for high-speed network links. In 33rd IEEE Conference on Local Computer Networks, pages 168–175, 2008. [134] Cristian Morariu and Burkhard Stiller. Distributed architecture for real-time traf- fic analysis. In Proceedings of the mechanisms for autonomous management of networks and services, and 4th international conference on autonomous infrastruc- ture, management and security, pages 171–174, Berlin, Heidelberg, 2010. Springer- Verlag. [135] Cristian Morariu and Burkhard Stiller. An open architecture for distributed IP traffic analysis (DITA). In IFIP/IEEE International Symposium on Integrated Network Management, pages 952–957, May 2011. [136] Jan Tore Morken. Distributed NetFlow processing using the map-reduce model. Master’s thesis, Norwegian University of Science and Technology, 2010. [137] Martin Mulazzani, Markus Huber, and Edgar Weippl Weippl. Anonymity and mon- itoring: How to monitor the infrastructure of an anonymity system. IEEE Transac- tions on Systems, Man, and Cybernetics, 40(5):539 –546, September 2010. [138] K anthi Nagaraj, K. V. M. Naidu, Rajeev Rastogi, and Scott Satkin. Efficient aggre- gate computation over data streams. In IEEE 24th International Conference on Data Engineering, pages 1382–1384, April 2008. [139] T. T. Thuy Nguyen and Grenville Armitage. A survey of techniques for Internet traffic classification using machine learning. Communications Surveys Tutorials, IEEE, 10(4):56–76, 2008. [140] Jon Oberheide, Michael Goff, and Manish Karir. Flamingo: Visualizing internet traffic. In 10th IEEE/IFIP Network Operations and Management Symposium, pages 150–161, April 2006. [141] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099–1110, New York, NY, USA. ACM. [142] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIG- MOD ’08, pages 1099–1110, New York, NY, USA, 2008. ACM. 109

[143] Arne Oslebo. Stager a web based application for presenting network statistics. In 10th IEEE/IFIP Network Operations and Management Symposium, pages 1–15, April 2006. [144] Neal Patwari, Alfred O. Hero, and Adam Pacholski. Manifold learning visualiza- tion of network traffic data. Proceeding of the 2005 ACM SIGCOMM workshop on mining network data, page 191. [145] Vladislav Perelman, Nikolay Melnikov, and Jurgen¨ Schonw¨ alder.¨ Flow signatures of popular applications. In Nazim Agoulmine, Claudio Bartolini, Tom Pfeifer, and Declan O’Sullivan, editors, Integrated Network Management, pages 9–16. IEEE, 2011. [146] Dave Plonka. FlowScan: A Network Traffic Flow Reporting and Visualization Tool. In Proceedings of the 14th USENIX conference on system administration, pages 305–318, Berkeley, CA, USA, 2000. USENIX Association. [147] Andre´ Proto, Leandro A Alexandre, Maira L Batista, Isabela L Oliveira, and Adri- ano M Cansian. Statistical model applied to netflow for network intrusion detection. Transactions on Computational Science, 11:179–191, 2010. [148] Wang Qun, Dai Xiuyue, and Huang Lu. Novelty P2P flow analysis system. In 7th International Conference on Wireless Communications, Networking and Mobile Computing, pages 1–4, 2011. [149] Ricardo Gonalves Ramalho. Role based behavior analysis. Master’s thesis, Segurana Informtica, Universidade de Lisboa, 2009. [150] Martin Rehak, Michal Pechoucek, Pavel Celeda, Vojtech Krmicek, Martin Grill, and Karel Bartos. Multi-agent approach to network intrusion detection. In Proceed- ings of the 7th international joint conference on autonomous agents and multiagent systems demo papers, pages 1695–1696. International Foundation for Autonomous Agents and Multiagent Systems, 2008. [151] Martin Rehak, Michal Pechoucek, and Pavel Minarik. Collaborative attack detection in high-speed networks. Analysis, 4696 LNAI:73–82, 2007. [152] Pin Ren, Yan Gao, Zhichun Li, Yan Chen, and B Watson. IDGraphs: intrusion detec- tion and analysis using stream compositing. Computer Graphics and Applications, IEEE, 26(2):28–39, 2006. [153] Fabio Ricciato, Felix Strohmeier, Peter Dorfinger, and Angelo Coluccia. One-way loss measurements from IPFIX records. In Measurements and Networking Proceed- ings, 2011 IEEE International Workshop on, pages 158–163. [154] Mohd Saufy Rohmad, Farok Azmat, Mazani Manaf, and Jamalul-lail Abdul Manan. Enhanced Netflow version 9 (e-Netflow v9) for network mediation: Structure, ex- periment and analysis. In International Symposium on Information Technology, vol- ume 3, pages 1–6, 2008. 110

[155] Dario Rossi and Silvio Valenti. Fine-grained traffic classification with netflow data. Proceedings of the 6th International Wireless Communications and Mobile Comput- ing, page 479, 2010. [156] Yukiko Sawaya, Ayumnu Kubota, and Yutaka Miyake. Detection of attackers in services using anomalous host behavior based on traffic flow statistics. In 2011 IEEE/IPSJ 11th International Symposium on Applications and the Internet, pages 353–359, July. [157] Dominik Schatzmann, Simon Leinen, Jochen Kogel,¨ and Wolfgang Muhlbauer.¨ FACT: Flow-based approach for connectivity tracking. In Neil Spring and George F Riley, editors, PAM, volume 6579 of Lecture Notes in Computer Science, pages 214– 223. Springer, 2011. [158] Dominik Schatzmann, Wolfgang Muhlbauer,¨ Thrasyvoulos Spyropoulos, and Xeno- fontas Dimitropoulos. Digging into HTTPS: flow-based classification of webmail traffic. In Proceedings of the 10th annual conference on Internet measurement, pages 322–327, New York, NY, USA, 2010. ACM. [159] Vyas Sekar, Michael K Reiter, Walter Willinger, Hui Zhang, Ramana Rao Kompella, and David G Andersen. cSamp a system for network-wide flow monitoring. In Proceeding of the 5th USENIX NSDI, San Francisco, CA, April 2008. [160] David S. Shelley and Mehmet Hadi Gunes¸.¨ Gerbilsphere: Inner sphere network visualization. Computer Networks, 56(3):1016 – 1028, 2012. [161] Wenchao Shen, Yanjiao Chen, Qianli Zhang, Yang Chen, Beixing Deng, Xing Li, and Guohan Lv. Observations of IPv6 traffic. In International Colloquium on Computing, Communication, Control, and Management, volume 2, pages 278–282, 2009. [162] M P Singh, N Subramanian, and Rajamenakshi. Visualization of flow data based on clustering technique for identifying network anomalies. In IEEE Symposium on Industrial Electronics Applications, volume 2, pages 973–978, 2009. [163] A Sinha, K Mitchell, and D Medhi. Flow-level upstream traffic behavior in broad- band access networks: DSL versus broadband fixed wireless. In 3rd IEEE Workshop on IP Operations and Management, pages 135–141, 2003. [164] A J Slagell and K Luo. Sharing network logs for computer forensics: a new tool for the anonymization of netflow records. Workshop of the 1st international conference on security and privacy for emerging areas in communication networks 2005, pages 37–42, 2005. [165] Robin Sommer and Vern Paxson. Outside the closed world: On using machine learning for network intrusion detection. 2010 IEEE Symposium on Security and Privacy, pages 305–316, 2010. [166] Murat Soysal and Ece Guran Schmidt. Machine learning algorithms for accurate flow-based network traffic classification: Evaluation and comparison. Performance Evaluation, 67(6):451–467, June 2010. 111

[167] Aanna Sperotto, Gregor Schaffrath, Ramin Sadre, Cristian Morariu, Aiko Pras, and Burkhard Stiller. An overview of IP flow-based intrusion detection. Communications Surveys Tutorials, IEEE, 12(3):343–356, 2010. [168] Anna Sperotto and Aiko Pras. Flow-based intrusion detection. In Nazim Agoulmine, Claudio Bartolini, Tom Pfeifer, and Declan O’Sullivan, editors, Integrated Network Management, pages 958–963. IEEE, 2011. [169] C Strasburg, S Krishnan, K Dorman, S Basu, and J S Wong. Masquerade detection in network environments. In 10th IEEE/IPSJ International Symposium on Applications and the Internet, pages 38–44, July 2010. [170] Felix Strohmeier, Peter Dorfinger, and Brain Trammell. eval- uation based on flow data. In 7th International Wireless Communications and Mobile Computing Conference, pages 1585–1589, July 2011. [171] Andrei M Sukhov, D. I. Sidelnikov, Aleksey A. Galtsev, A. P. Platonov, and M. V. Strizhov. Active flows in diagnostic of troubleshooting on backbone links. Computer Research Repository, abs/0911.2, 2009. [172] Godfrey Tan, Massimiliano Poletto, John Guttag, and Frans Kaashoek. Role clas- sification of hosts within enterprise networks based on connection patterns. In Pro- ceedings of the annual conference on USENIX Annual Technical Conference, pages 2–2, Berkeley, CA, USA, 2003. USENIX Association. [173] Teryl Taylor, Stephen Brooks, and John McHugh. NetBytes viewer: An entity- based netflow visualization utility for identifying intrusive behavior. Proceedings of the 2007 Workshop on Visulization for Computer Security, pages 101–114, 2008. [174] Teryl Taylor, Dianna Paterson, Joel Glanfield, Carrie Gates, Stephen Brooks, and John McHugh. FloVis: Flow visualization system. In Conference For Homeland Security, pages 186–198, March 2009. [175] Juan Pablo Timpanaro, Isabelle Chrisment, and Olivier Festor. I2P’s usage charac- terization. In Proceedings of the 4th international conference on Traffic Monitoring and Analysis, pages 48–51, Berlin, Heidelberg, 2012. Springer-Verlag. [176] Brian Trammell, Bernhard Tellenbach, Dominik Schatzmann, and Martin Burkhart. Peeling away timing error in netflow data. In Neil Spring and George F Riley, editors, PAM, volume 6579 of Lecture Notes in Computer Science, pages 194–203. Springer, 2011. [177] Ionut Trestian, Supranamaya Ranjan, Aleksandar Kuzmanovi, and Antonio Nucci. Unconstrained endpoint profiling (googling the internet). SIGCOMM Computer Communication Review, 38(4):279–290, August 2008. [178] P Truong and F Guillemin. A heuristic method of finding heavy hitter prefix pairs in IP traffic. Communications Letters, IEEE, 13(10):803–805, October 2009. [179] Silvio Valenti and Dario Rossi. Identifying key features for P2P traffic classification. In 2011 IEEE International Conference on Communications ICC, pages 1–6. IEEE. 112

[180] Pavel Celeda,ˇ Jan Vykopal, Tomas Plesn´ık, Michal Trunecka,ˇ and Vojtechˇ Krm´ıcek.ˇ Malware detection from the network perspective using netflow data. In 3rd NMRG Workshop on NetFlow/IPFIX Usage in Network Management, 2010. [181] Gert Vliek. Detecting spam machines, a netflow-data based approach. Master’s thesis, University of Twente, Netherlands, February 2009. [182] Jan Vykopal, Tomas Plesnik, and Pavel Minarik. Network-based dictionary attack detection. In International Conference on Future Networks, pages 23–27, March 2009. [183] Arno Wagner, Thomas Dubendorfer, Lukas Hammerle, and Bernhard Plattner. Flow- based identification of P2P heavy-hitters. International Conference on Internet Surveillance and Protection, 00(c):15–15, 2006. [184] Cynthia Wagner, Jer´ omeˆ Franc¸ois, Radu State, and Thomas Engel. DANAK: Find- ing the odd! In 5th International Conference on Network and System Security, pages 161–168, 2011. [185] Cynthia Wagner, Jer´ omeˆ Franc¸ois, Radu State, and Thomas Engel. Machine learning approach for IP-flow record anomaly detection. In Proceedings of the 10th interna- tional IFIP TC 6 conference on Networking - Volume Part I, pages 28–39, Berlin, Heidelberg, 2011. Springer-Verlag. [186] Cynthia Wagner, Gerard´ Wagener, Radu State, Thomas Engel, and Alexandre Du- launoy. Game theory driven monitoring of spatial-aggregated IP-Flow records. In In- ternational Conference on Network and Service Management, pages 463–468, 2010. [187] Shen Wang and Rui Guo. GA-based filtering algorithm to defend against ddos attack in high speed network. In Fourth International Conference on Natural Computation, pages 601–607. [188] Wuzuo Wang and Weidong Wu. Online detection of network traffic anomalies using degree distributions. Internationl Journal of Communications, Network and System Science, 3(2):177 – 182, February 2010. [189] Songjie Wei, Jelena Mirkovic, and Ezra Kissel. In The 2006 International Con- ference on Data Mining Part of the World Congress in Computer Sciences, pages 269–275, Las Vegas, Nevada, USA. [190] David Weinberger. The machine that would predict the future. Scientific American, 305(6):52–57, 2011. [191] Herwin Weststrate. Botnet detection using netflow information Finding new botnets based on client connections. Structure, 2009. [192] Philipp Winter, Eckehard Hermann, and Markus Zeilinger. Inductive intrusion de- tection in flow-based network data using one-class support vector machines. In 4th IFIP International Conference on New Technologies, Mobility and Security, pages 1–5. 2011. 113

[193] Bo Xu, Ming Chen, and Chao Hu. DEAPFI: A distributed extensible architecture for P2P flows identification. In IEEE International Conference on Network Infras- tructure and Digital Content, pages 59–64, 2009. [194] Kuai Xu, Feng Wang, and Lin Gu. Network-aware behavior clustering of Internet end hosts. In INFOCOM, 2011 Proceedings IEEE, pages 2078–2086, April. [195] Kuai Xu, Zhi-Li Zhang, and Supratik Bhattacharyya. Profiling Internet backbone traffic: behavior models and applications. SIGCOMM Computer Communication Review, 35(4):169–180, August 2005. [196] Kuang Yin and Tang Nianqing. Study on the risk detection about network secu- rity based on grey theory. In International Forum on Information Technology and Applications, volume 1, pages 411–413, May 2009. [197] Xiaoxin Yin, William Yurcik, Michael Treaster, Yifan Li, and Kiran Lakkaraju. Vis- FlowConnect: netflow visualizations of link relationships for security situational awareness. In Proceedings of the 2004 ACM workshop on Visualization and Data Mining for Computer Security, pages 26–34. ACM. [198] Martin Zadnik, Toma´sˇ Pecenka,ˇ and Jan Korenek.ˇ Netflow probe intended for high- speed networks. In International Conference on Field Programmable Logic and Applications, pages 695–698, 2005. [199] Yuanyuan Zeng, Xin Hu, and Kang G Shin. Detection of botnets using combined host- and network-level information. Symposium A Quarterly Journal In Modern Foreign Literatures, pages 291–300, 2010. [200] Wei Zha and Jinyuan He. On campus network P2P and its link control. In Interna- tional Conference on Consumer Electronics, Communications and Networks, pages 5086–5089, April 2011. [201] Hongzhuo Zhang. Study on the TOPN Abnormal Detection Based on the NetFlow Data Set. Computer and Information Science, 2(3):103–108, 2009. [202] Ji Zhang and Shaoqing Meng. A design of NetFlow traffic statistic and analysis system for process of the transition of commercialization of IPV6. In International Conference on Computer Science and Service System, pages 963–965, June 2011. [203] Yu Zhang, Binxing Fang, and Hao Luo. Identifying high-rate flows based on sequen- tial sampling. IEICE Transactions on Information and Systems, E93-D(5):1162– 1174, 2010. [204] Wang Zhenqi and Wang Xinyu. NetFlow based intrusion detection system. 2008 international conference on multiMedia and information technology, pages 825– 828. [205] Haiting Zhu, Xiaoguo Zhang, and Wei Ding. Research on errors of utilized band- width measured by NetFlow. In Second International Conference on Networking and Distributed Computing, pages 45–49, 2011.