Visual Analysis of Network Traffic – Interactive Monitoring, Detection, and Interpretation of Security Threats

Dissertation zur Erlangung des akademischen Grades des Doktors der Naturwissenschaften an der Universitat¨ Konstanz im Fachbereich Informatik und Informationswissenschaft

Universität Konstanz

Universität Konstanz

vorgelegt von Florian Mansmann

Universität Konstanz

Abstract

The has become a dangerous place: malicious code gets spread on personal comput- ers across the world, creating botnets ready to attack the network infrastructure at any time. Monitoring network traffic and keeping track of the vast number of security incidents or other anomalies in the network are challenging tasks. While monitoring and intrusion detection systems are widely used to collect operational data in real-time, attempts to manually analyze their output at a fine-granular level are often tedious, require exhaustive human resources, or completely fail to provide the necessary insight due to the complexity and the volume of the underlying data. This dissertation represents an effort to complement automatic monitoring and intrusion detection systems with visual exploration interfaces that empower human analysts to gain deeper insight into large, complex, and dynamically changing data sets. In this context, one key aspect of visual analysis is the refinement of existing visualization methods to improve their scalability with respect to a) data volume, b) visual limitations of computer screens, and c) human perception capacities. In addition to that, developmet of innovative visualization metaphors for viewing network data is a further key aspect of this thesis. In particular, this dissertation deals with scalable visualization techniques for detailed anal- ysis of large network time series. By grouping time series according to their logical intervals in pixel visualizations and by coloring them for better discrimination, our methods enable accurate comparisons of temporal aspects in network security data sets. In order to reveal the peculiarities of network traffic and distributed attacks with regard to the distribution of the participating hosts, a hierarchical map of the IP address space, which takes both geographical and topological aspects of the Internet into account, is proposed. Since visual clutter becomes an issue when naively connecting the major communication partners on top of this map, hierarchical edge bundles are used for grouping traffic links based on the map’s hierarchy, thereby facilitating a more scalable analysis of communication partners. Furthermore, the map is complemented by multivariate analysis techniques for visually studying the multidimensional nature of network traffic and security event data. Especially the interaction of the implemented prototypes reveals the ability of the proposed visualization methods to provide an overview, to relate communication partners, to zoom into regions of interest, and to retrieve detailed information. For an even more detailed analysis of hosts in the network, we introduce a graph-based approach to tracking behavioral changes of hosts and higher-level network entities. This in- formation is particularly useful for detecting misbehaving computers within the local network infrastructure, which can otherwise substantially compromise the security of the network. To complete the comprehensive view on network traffic, a Self-Organizing Map was used to demonstrate the usefulness of visualization methods for analyzing not only structured network protocol data, but also unstructured information, e.g., textual context of email messages. By ii extracting features from the emails, the neuronal network algorithm clusters similar emails and is capable of distinguishing between spam and legitimate emails up to a certain extent. In the scope of this dissertation, the presented prototypes demonstrate the applicability of the proposed visualization methods in numerous case studies and reveal the exhaustless po- tential of their usage in combination with automatic detection methods. We are therefore con- fident that in the fields of network monitoring and security visual analytics applications will quickly find their way from research into practice by combining human background knowl- edge and intelligence with the speed and accuracy of computers. Zusammenfassung

Das Internet ist ein gefahrlicher¨ Ort geworden: Schadcode breitet sich auf Personal Computern auf der ganzen Welt aus und schafft damit sogenannte Botnets, welche jederzeit bereit sind, die Netzwerkinfrastruktur anzugreifen. Netzwerkverkehr zu uberwachen¨ und den Uberblick¨ uber¨ die gewaltige Anzahl von sicherheitsrelevanten Vorfallen¨ oder Anomalien im Netzwerk zu behalten sind schwierige Aufgaben. Wahrend¨ Monitoring- und Intrusion-Detection-Systeme weit verbreitet sind, um operationale Daten in Echtzeit zu erheben, sind Bemuhungen,¨ ih- ren Output auf detaillierter Ebene manuell zu analysieren, oftmals ermudend,¨ benotigen¨ viel Personal, oder schlagen vollstandig¨ fehl, die notwendigen Einsichten zu liefern aufgrund der Komplexitat¨ und des Volumens der zugrunde liegenden Daten. Diese Dissertation stellt ein Bestreben dar, automatische Uberwachungs-¨ und Intrusion- Detection-Systeme durch visuelle Explorationsschnittstellen zu erganzen,¨ welche menschliche Analysten befahigen,¨ tiefere Einsichten in riesige, komplexe und sich dynamisch verandernde¨ Datensatze¨ zu gewinnen. In diesem Zusammenhang ist ein Hauptanliegen von visueller Ana- lyse, bestehende Visualisierungsmethoden zu verfeinern, um ihre Skalierbarkeit in Bezug auf a) die Datenmenge, b) visuelle Beschrankungen¨ von Computerbildschirmen und c) die Auf- nahmefahigkeit¨ der menschlichen Wahrnehmung zu verbessern. Daruber¨ hinaus ist die Ent- wicklung von innovativen Visualisierungsmetaphern ein weiteres Hauptanliegen dieser Dok- torarbeit. Insbesondere beschaftigt¨ sich diese Dissertation mit skalierbaren Visualisierungstechniken fur¨ detaillierte Analyse von riesigen Netzwerk-Zeitreihen. Indem Zeitreihen einerseits in Pi- xelvisualisierungen anhand ihrer logischen Intervalle gruppiert werden und andererseits zur verbesserten Abgrenzung eingefarbt¨ werden, erlauben unsere Methoden genaue Vergleiche von temporaren¨ Aspekten in Netzwerk-Sicherheits-Datensatzen.¨ Um die Eigenheiten von Netzwerkverkehr und verteilten Attacken in Bezug auf die Vertei- lung der beteiligten Rechner aufzudecken, wird eine hierarchische Karte des IP Adressraums vorgeschlagen, welche sowohl geographische als auch topologische Aspekte des be- rucksichtigt.¨ Da naives Verbinden der wichtigsten Kommunikationspartner auf der Karte zu storenden¨ visuellen Artefakten fuhren¨ wurde,¨ konnen¨ Hierarchical Edge Bundles dazu ver- wendet werden, die Verkehrsverbindungen anhand der Hierarchie der Karte zu gruppieren, um dadurch eine skalierbarere Analyse der Kommunikationspartner zu ermoglichen.¨ Ferner wird die Karte durch eine multivariate Analysetechnik erganzt,¨ um auf visuelle Art und Weise die multidimensionale Natur des Netzwerkverkehrs und der Daten von sicherheits- relevanten Vorfallen¨ zu studieren. Insbesondere deckt die Interkation der implementierten Pro- totypen die Fahigkeit¨ der vorgeschlagenen Visualisierungsmethoden auf, einen Uberblick¨ zu verschaffen, Kommunikationspartner zuzuordnen, in interessante Regionen hineinzuzoomen, und detaillierte Informationen abzufragen. Fur¨ eine noch detailliertere Analyse der Rechner im Netzwerk, fuhren¨ wir einen graphen- iv basierten Ansatz ein, um Veranderungen¨ im Verhalten von Rechnern und abstrakteren Ein- heiten im Netzwerk zu beobachten. Diese Art von Information ist insbesondere nutzlich,¨ um Fehlverhalten der Rechner innerhalb der lokalen Netzwerkinfrastruktur aufzudecken, welche andernfalls die Sicherheit des Netzwerks betrachtlich¨ gefahrden¨ konnen.¨ Um die umfassende Sicht auf Netzwerkverkehr abzurunden, wurde eine Self-Organizing Map dazu verwendet, die Eignung der Visualisierungsmethoden zur Analyse nicht nur von strukturierten Daten der Netzwerkprotokolle, sondern auch von unstrukturierten Informatio- nen, wie beispielsweise dem textuellen Kontext von Email Nachrichten, zu demonstrieren. Mittels der Extraktion der charakteristischen Eigenschaften aus den Emails, gruppiert der Neuronale-Netzwerk-Algorithmus ahnliche¨ Emails und ist imstande, bis zu einem gewissen Grad zwischen Spam und legitimen Emails zu unterscheiden. Im Rahmen dieser Dissertation demonstrieren die prasentierten¨ Prototypen die breite An- wendbarkeit der vorgeschlagenen Visualisierungsmethoden in zahlreichen Fallstudien und le- gen ihr unerschopfliches¨ Potential dar, in Kombination mit automatischen Intrusion-Detection- Methoden verwendet zu werden. Deswegen sind wir zuversichtlich, dass Visual-Analytics- Anwendungen in den Bereichen Netzwerkuberwachung¨ und -sicherheit schnell ihren Weg aus der Forschung in die Praxis finden werden, indem sie menschliches Hintergrundwissen und Intelligenz mit der Geschwindigkeit und Genauigkeit von Computern kombinieren. Parts of this thesis were published in:

[1] Daniel A. Keim, Florian Mansmann, Jorn¨ Schneidewind, and Tobias Schreck. Monitoring network traffic with radial traffic analyzer. In Proceedings of IEEE Symposium on Visual Analytics Science and Technology, pages 123–128, 2006.

[2] Daniel A. Keim, Florian Mansmann, Jorn¨ Schneidewind, Jim Thomas, and Hartmut Ziegler. Visual Data Mining: Theory, Techniques and Tools for Visual Analytics, chapter Visual Analytics: Scope and Challenges. Springer, 2008. Lecture Notes in Computer Science (LNCS), to appear.

[3] Daniel A. Keim, Florian Mansmann, Jorn¨ Schneidewind, and Hartmut Ziegler. Challenges in visual data analysis. In Information Visualization. IEEE Press, 2006.

[4] Daniel A. Keim, Florian Mansmann, and Tobias Schreck. Mailsom - visual exploration of electronic mail archives using self-organizing maps. In Conference on Email and Anti- Spam, 2005.

[5] Florian Mansmann, Fabian Fischer, Daniel A. Keim, and Stephen C. North. Visualizing large-scale IP traffic flows. In Proceedings of the 12th International Workshop on Vision, Modeling, and Visualization, pages 23–30. mpn, Saarbrucken,¨ Germany, November 2007.

[6] Florian Mansmann, Daniel A. Keim, Stephen C. North, Brian Rexroad, and Daniel Shele- hedal. Visual analysis of network traffic for resource planning, interactive monitoring, and interpretation of security threats. IEEE Transactions on Visualization and Computer Graphics, 13(6):1105–1112, 2007. Proceedings of the IEEE Conference on Information Visualization.

[7] Florian Mansmann, Lorenz Meier, and Daniel A. Keim. Visualization of host behavior for network security. In VizSec 2007 – Workshop on Visualization for Computer Security. Springer, 2008. To appear.

[8] Florian Mansmann and Svetlana Vinnik. Interactive exploration of data traffic with hi- erarchical network maps. IEEE Transactions on Visualization and Computer Graphics, 12(6):1440–1449, 2006.

[9] Svetlana Vinnik and Florian Mansmann. From analysis to interactive exploration: Build- ing visual hierarchies from OLAP cubes. In Proceedings of the 10th International Con- ference on Extending Database Technology, pages 496–514, 2006.

Contents

1 Introduction 1 1.1 Monitoring network traffic ...... 2 1.2 Intrusion detection and security threat prevention ...... 3 1.3 Visual analysis for network security ...... 4 1.4 Thesis outline and contribution ...... 5

2 Networks, intrusion detection, and data management 9 2.1 Network fundamentals ...... 10 2.2 Capturing network traffic ...... 17 2.3 Intrusion detection ...... 20 2.4 Building a data warehouse for network traffic and events ...... 26

3 Foundations of information visualization for network security 35 3.1 Information visualization ...... 35 3.2 Visual Analytics ...... 42 3.3 Related work on visualization for network monitoring and security ...... 45

4 Temporal analysis of network traffic 53 4.1 Related work on time series visualization ...... 54 4.2 Extending the recursive pattern method time series visualization ...... 55 4.3 Comparing time series using the recursive pattern ...... 61 4.4 Case study: temporal pattern analysis of network traffic ...... 64 4.5 Summary ...... 67

5 A hierarchical approach to visualizing IP network traffic 69 5.1 Related work on hierarchical visualization methods ...... 70 5.2 The Hierarchical Network Map ...... 72 5.3 Space-filling layouts for diverse data characteristics ...... 78 5.4 Evaluation of data-driven layout adaptation ...... 84 5.5 User-driven data exploration ...... 89 5.6 Case studies: analysis of traffic distributions in the IPv4 address space . . . . 93 5.7 Summary ...... 97

6 An end-to-end view of IP network traffic 101 6.1 Related work on network visualizations based on node-link diagrams . . . . . 102 6.2 Linking network traffic through hierarchical edge bundles ...... 104 6.3 Case study: Visual analysis of traffic connections ...... 108 viii Contents

6.4 Discussion ...... 109 6.5 Summary ...... 110

7 Multivariate analysis of network traffic 111 7.1 Related work on multivariate and radial information representations . . . . . 112 7.2 Radial Traffic Analyzer ...... 115 7.3 Temporal analysis with RTA ...... 118 7.4 Integrating RTA into HNMap ...... 120 7.5 Case study: Intrusion detection with RTA ...... 120 7.6 Discussion ...... 121 7.7 Summary ...... 122

8 Visual analysis of network behavior 123 8.1 Related work on dimension reduction ...... 124 8.2 Graph-based monitoring of network behavior ...... 125 8.3 Integration of the behavior graph in HNMap ...... 129 8.4 Automatic accentuation of highly variable traffic ...... 131 8.5 Case studies: monitoring and threat analysis with behavior graph ...... 131 8.6 Evaluation ...... 138 8.7 Summary ...... 139

9 Content-based visual analysis of network traffic 141 9.1 Related work on visual analysis of email communication ...... 142 9.2 Self-organizing maps for content-based retrieval ...... 143 9.3 Case study: SOMs for email classification ...... 147 9.4 Summary ...... 148

10 Thesis conclusions 151 10.1 Conclusions ...... 151 10.2 Outlook ...... 154 List of Figures

2.1 The layers of the OSI, TCP/IP and hybrid reference models ...... 11 2.2 Advertised IPv4 address count – daily average [75] ...... 13 2.3 IP datagram ...... 14 2.4 Concept of an SSH tunnel ...... 20 2.5 Port scan using NMap ...... 21 2.6 Concept of a demilitarized zone (DMZ) ...... 24 2.7 A multilayered data warehousing system architecture...... 27 2.8 Example netflow cube with three dimensions and sample data ...... 28 2.9 Modeling network traffic as an OLAP cube ...... 30 2.10 Navigating in the hierarchical dimension IP address ...... 34

3.1 Mapping values to color using different normalization schemes ...... 41 3.2 The Scope of Visual Analytics ...... 43 3.3 traffic visualization tool TNV ...... 47

4.1 Line charts of 5 time series of mail traffic over a time span of 1440 minutes . 54 4.2 Recursive pattern example configuration: 30 days in each of the 12 months . 57 4.3 Recursive pattern parametrization showing a weekly reappearing pattern . . . 59 4.4 Multi-resolution recursive pattern with empty fields for normalizing irregu- larities in the time dimension ...... 60 4.5 Enhancing the recursive pattern with spacing ...... 61 4.6 Different coloring options for distinguishing between time series ...... 62 4.7 Combination of two time series at different hierarchy levels ...... 63 4.8 Recursive pattern in small multiples mode...... 63 4.9 Recursive pattern in parallel mode ...... 64 4.10 Recursive pattern in mixed mode ...... 65 4.11 Case study showing the number of SSH flows per minute over one week . . . 66 4.12 Visualizing different characteristics of SSH traffic ...... 67

5.1 Example of a hierarchical data visualization using a rectangle-packing algorithm 72 5.2 Density histogram: distribution of sessions over the IP address space. . . . . 73 5.3 Multi-resolution approach: Hierarchical Network Map ...... 74 5.4 Scaling effects in the HNMap demonstrated on some IP prefixes in Germany . 76 5.5 HNMap on the powerwall ...... 77 5.6 Border coloring scheme ...... 78 5.7 Geographic HistoMap Layout ...... 80 5.8 HistoMap 1D layout ...... 82 x List of Figures

5.9 Strip Treemap layout ...... 83 5.10 Anonymized outgoing traffic connections from the university gateway . . . . 85 5.11 Average position change ...... 87 5.12 Average side change ...... 88 5.13 HNMap interactions ...... 90 5.14 Recursive pattern pixel visualization showing individual hosts ...... 91 5.15 Multiple map instances facilitate comparison of traffic of several time spans . 92 5.16 Interface for configuring the animated display of a series of map instances . . 93 5.17 Visual exploration process for resource location planning ...... 94 5.18 Monitoring traffic changes ...... 95 5.19 Employing Radial Traffic Analyzer to find dependencies between dimensions 96 5.20 Rapid spread of botnet computers in China in August 2006 ...... 99

6.1 Comparison of different strategies to drawing adjacency relationships . . . . 104 6.2 The IP/AS hierarchy determines the control polygon for the B-spline . . . . . 105 6.3 HNMap with edge bundles showing the 500 most important connections . . . 107 6.4 Assessing major traffic connections through edge bundles ...... 109

7.1 Scatter plot matrix ...... 112 7.2 Design ratio of RTA ...... 115 7.3 Continuous refinement of RTA by adding new dimension rings ...... 116 7.4 RTA display with network traffic distribution at a local computer ...... 117 7.5 Animation over time in RTA in time frame mode ...... 119 7.6 Invocation of the RTA interface from within the HNMap display ...... 120 7.7 Security alerts from Snort inRTA ...... 121

8.1 Normalized traffic measurements of two hosts ...... 125 8.2 Coordinate calculation of the host position ...... 126 8.3 Host behaviorgraph of 33 prefixes over a timespan of 1 hour...... 128 8.4 Fine-tuning the graph layout through cohesion forces ...... 129 8.5 Integration of the behavior graph view into HNMap ...... 130 8.6 Automatic accentuation of highly variable ‘/24’-prefixes ...... 132 8.7 Overview of network traffic between 12 and 18 hours ...... 133 8.8 Nightly backups and Oracle DB traffic in the early morning ...... 134 8.9 Investigating suspicious host behavior through accentuation ...... 134 8.10 Evaluating 525 000 SNORT alerts recorded from Jan. 3 to Feb. 26, 2008 . . . 135 8.11 Splitting the analysis into internal and external alerts reveals different clusters 136 8.12 Analysis of 63 562 SNORT alerts recorded from January 21 to 27, 2008 . . . 137 8.13 Performance analysis of the layout algorithm ...... 138

9.1 tf-idf feature extraction on a collection of 100 emails ...... 145 9.2 The learning phase of the SOM ...... 146 9.3 Spam histogram of a sample email archive ...... 147 9.4 Component plane for term “work” ...... 148 1 Introduction

,,Computers are incredibly fast, accurate and stupid: humans are incredibly slow, inaccurate and brilliant; together they are powerful beyond imagination.”

Albert Einstein Contents 1.1 Monitoring network traffic ...... 2 1.2 Intrusion detection and security threat prevention ...... 3 1.3 Visual analysis for network security ...... 4 1.4 Thesis outline and contribution ...... 5

T is a fact that digital communication and sharing of data has proven to be cheap, efficient, Iand effective. Over the years, it has turned the Internet into an indispensable resource in our everyday life: in modern information society, not only private communication, but also education, administration, and business largely depend on the availability of the information infrastructure. To ensure the health of the network infrastructure, the following three aspects play critical roles:

1. Effective monitoring of the network to detect failures and timely react to overload situ- ations.

2. Detection of intrusions and attacks that aim at stealing confidential information, misus- ing hijacked computers for malicious activities, and paralyzing business or services in the Internet.

3. Human capability to react to unforeseen threats to the network infrastructure.

Network monitoring is an essential task to keep the information infrastructure up and run- ning. It is usually executed through a system that constantly monitors the hard- and software components crucial for the vitality of the network infrastructure and informs the network ad- ministrators in case of outages. Through so-called activity profiling, the monitoring system tries to distinguish between normal and abnormal usage and network behavior. In most cases, it is easy to handle failures that have previously occurred. However, recognition of abnor- mal network behavior often involves unnoticed misuse of the network and many false alarms, which eventually lead to information overload for the involved system administrators. Network security has become a constant race against time: so-called 0-Day-Exploits, which are security vulnerabilities that are still unknown to the public, have become a valuable good in the hands of hackers. These exploits are used to develop malicious code, which infiltrates 2 Chapter 1. Introduction various computers in the Internet even before virus scanners and firewalls are capable of offer- ing effective countermeasures. Often, this malicious code communicates with a botnet server and only waits to receive commands to execute code on the hijacked computer. If many of these infected computers are interlinked, they form a botnet and are a mighty weapon to har- vest websites for email addresses, to send out SPAM messages, or to jointly conduct a denial of service attack against commercial or governmental webservers. Today, signature-based and anomaly-based intrusion detection are considered the state-of- the-art of network security. However, fine-tuning parameters and analyzing the output of these intrusion detection methods can be complex, tedious, and even impossible when done manually. Furthermore, current malware trends suggest an increase in security incidents and a diversification of malware for the foreseeable future [60]. In general, it is noticeable that systems become more and more sophisticated and make de- cisions on their own up to a certain degree. As soon as unforeseen events occur, the system administrator or security expert has to interfere to handle the situation. The network monitor- ing and security fields seem to have profited a lot from automatic detection methods in recent years. However, there is still a large potential of visual approaches to foster a better under- standing of the complex information through visualization and interaction. In addition to that, a very promising research field to solve many of today’s information overload problems in network security is visual analytics, which aims at bridging the gap between automatic and visual analysis methods. In the remainder of this chapter, the need for network monitoring will be explained. We proceed by discussing how intrusion detection systems are used to prevent security threats. The need for visual analysis for network security is then motivated through its potential to bridge the gap between the human analyst and the automatic analysis methods. The last section gives an outline of this thesis.

1.1 Monitoring network traffic

The computer network infrastructure forms the technical core of the Information Society. It transports increasing amounts of arbitrary kinds of information across arbitrary geographic distances. To date, the Internet is the most successful computer network. The Internet has fostered the implementation of all kinds of productive information systems unimaginable at the time it was originally designed. While the wealth of applications that can be built on top of the Internet infrastructure is virtually unlimited, there are fundamen- tal protocol elements which rule the way how information is transmitted between the nodes on the network. It is an interesting problem to devise tools for visual analysis of key net- work characteristics based on these well-defined protocol elements, thereby supporting the network monitoring application domain. Network monitoring in general is concerned with the surveillance of important network performance metrics to a) supervise network functionality, b) detect and prevent potential problems, and c) develop effective countermeasures for net- working anomalies and sabotage as they occur. One may distinguish between unintentional defects due to human failures or other malfunctions, referred to as flaws, and the intentional misuses of the system, known as intrusions. 1.2. Intrusion detection and security threat prevention 3

The main focus of most network monitoring systems is to collect operational data from countless network connections, switches, routers, and firewalls. These data need to be timely stored in central repositories where network operations staff can conveniently access and use them to tackle failures within the network. However, one major drawback of these systems is that they only employ simple business charts to visualize their data not taking into account the inter-linked properties of the multi-dimensional network data. In the course of this disserta- tion, it will be demonstrated how information visualization techniques can be applied to gain insight into the complex nature of large network data sets.

1.2 Intrusion detection and security threat prevention

Since the Internet has de-facto become the information medium of first resort, each host on the network is forced to face the danger of being continuously exposed to the hostile environment. What started as proof-of-concept implementations by a few experts for unveiling security vulnerabilities has become a sport among script kiddies and drawn the attention of criminals. Therefore, network security has turned into one of the central and most challenging issues in network communication for practitioners as well as for researchers. Security vulnerabilities in the system are exploited with the intention to infect computers with worms and viruses, to “hack” company networks and steal confidential information, to run criminal activities through the compromised network infrastructure, or to paralyze online services through denial-of- service attacks. Frequency and intensity of the attacks prohibit any laxity in monitoring the network behavior of the system. One of the most famous infections to date was the SQL Slammer worm in January 2003. Due to a vulnerability in the Microsoft SQL Server, the worm was able to install itself on Microsoft servers and started to wildly scan the network in order to propagate itself. It was not the unavailability of the Microsoft SQL servers, but the traffic generated by extensive scans, which in turn caused packet loss or completely saturating circuits in some instances. Several large Internet transit providers and end-user ISP’s were completely shut down. As a result, Bank of America’s debit and credit card operations were impacted, denying customers the opportunity to make any transactions using their bank cards [166]. Economies of scale have made usage of the network infrastructure very efficient and ex- tremely cheap. While this allowed the internet to experience unprecedented growth, it brought about the pitfall that almost every internet user is exposed to unwanted advertisement e-mail messages, so-called spam. In the last few years, more and more of these relatively harm- less spam messages have turned into phishing mails, targeting at stealing online banking and e-commerce codes and passwords from naive users. Various automatic methods, such as virus scanners, spam filters, online surfing controls, firewalls, and intrusion detection systems have emerged as a response to the need of protecting the systems from harmful network traffic. However, as there will always exist human and machine failures, no fully automated method can provide absolute protection. Intrusion detection is the major preventive mechanism for timely recognition of malicious use of the system endangering its integrity and stability. There exist two different concepts to detect intrusions: a) anomaly-based intrusion detection systems (IDS) which offer a higher 4 Chapter 1. Introduction potential for discovering novel attacks, and b) signature-based IDS, which targets already known attack patterns. Anomaly-based detection is carried out by defining the normal state and behavior of the system with alerts sent out whenever that state is violated. It is a rather complicated task to define the normal behavior precisely enough as to minimize false alerts on the one hand and not to let attacks evolve unnoticed on the other hand.

1.3 Visual analysis for network security

The roots of the field of exploratory data analysis date back to the eighties when John Tukey articulated the important distinction between confirmatory and exploratory data analysis [165] out of the realization that the field of statistics was strongly driven by hypothesis testing at the time. Today, a lot of research deals with an increasing amount of data being digitally col- lected in the hope of containing valuable information that can eventually bring a competitive advantage for its owner. Visual data exploration, which can be seen as a hypothesis generation process, is especially valuable, because a) it can deal with highly non-homogeneous and noisy data, and b) is intuitive and requires no understanding of complex mathematical methods [86]. Visualization can thus provide a qualitative overview of the data, allowing data phenomena to be isolated for further quantitative analysis. The emergence of visual analytics research suggests that more and more visual methods will be closely linked with automatic analysis methods. The goal of visual analytics is to turn the information overload into the opportunity of the decade [156, 157]. Decision-makers should be enabled to examine this massive, multi-dimensional, multi-source, time-varying informa- tion stream to make effective decisions in time-critical situations. For informed decisions, it is indispensable to include humans in the data analysis process to combine flexibility, creativity, and background knowledge with the enormous storage capacity and the computational power of today’s computers. The specific advantage of visual analytics is that decision makers may fully focus their cognitive and perceptual capabilities on the analytical process, while allowing them to apply advanced computational capabilities to augment the discovery process. Network computers have become so ubiquitous and easy to access that they are also vul- nerable [108]. While extensive efforts are made to build and maintain trustworthy systems, hackers often manage to circumvent the security mechanism and thereby find a way to infil- trate the systems and steal confidential information, to compromise network computers, and in some cases even to take over the control of these systems. In practice, large networks consisting of hundreds of thousands of hosts are monitored by integrating logs from gateway routers, firewalls, and intrusion detection systems using statistical and signature-based meth- ods to detect changes, anomalies and attacks. Due to economic and technical trends, networks have experienced rapid growth in the last decade, which resulted in more legitimate as well as malicious traffic than ever before. A consequence is that the number of detected anomalies and security incidents becomes too large to cope with manually, thus justifying the pressing need for more sophisticated tools. Our objective is to show how visual analysis can foster deep insight in the large data sets describing IP network activity. The difficult task of detecting various kinds of system vul- nerabilities, for example, can be successfully solved by applying visual analytics methods. 1.4. Thesis outline and contribution 5

Whenever machine learning algorithms become insufficient for recognizing malicious pat- terns, advanced visualization and interaction techniques encourage expert users to explore the relevant data by taking advantage of human perception, intuition, and background knowledge. Through a feedback loop, the knowledge, which was acquired in the process of human in- volvement, can be used as input for advancing automatic detection mechanisms.

1.4 Thesis outline and contribution

The overall goal of this thesis is to show how visual analysis methods can contribute to the fields of network monitoring and security. In many cases, the large amount of available data from network monitoring processes renders the applicability of many visualization techniques impossible. Therefore, a careful selection and extension of current visualization techniques is needed. While the first three chapters motivate our work and introduce the necessary foundations of networking, intrusion detection, data modeling, information visualization, and visual analyt- ics, Chapters 4 to 9 deal with efforts to appropriately represent and interact with the available information in order to gain valuable insight for timely reactions in case of failures and intru- sions. Chapter 2 details basic concepts in networking and intrusion detection that are necessary to comprehend the data sets, which will be analyzed throughout the application chapters. Net- work protocols are discussed at an abstract level along with various tools for monitoring, intrusion detection, and threat prevention. Since in some cases one has to deal with extremely large data sets, performance requirements of the database management system play an im- portant role in our network research efforts. The underlying data model for storing network traffic and event data was inspired by the OLAP (online analytical processing) approach used in building data warehouses for efficiently managing huge data volumes and computing ag- gregates under high performance requirements. In Chapter 3, the research fields of information visualization and visual analytics are dis- cussed. Using Shneiderman’s data type by task taxonomy, the visualization methods of this thesis are systematically put into context. Furthermore, we show how colors are mapped to values using different scaling functions and propose some literature for further reading. Next, the relatively young field of visual analytics is defined and its potential for network monitoring and security is pointed out. Based on an extensive review of scientific publications in the field, an overview of visual analysis systems and prototypes for network monitoring and security is presented to the reader. Starting from low-dimensional input data as in time series, the used input data increase in dimensionality as we proceed from Chapter 4 to Chapter 9. All these chapters follow the same methodology: after a short motivation, related visualization methods are reviewed, the respective visualization approach is introduced, discussed, and evaluated where applicable. Finally each method’s applicability is demonstrated in at least one case studies. Chapter 4 describes the enhanced recursive pattern technique as an alternative to traditional line and bar charts for the comparison of several granular time series. In this visualization technique, through its color attribute each pixel represents the aggregated value of a time series 6 Chapter 1. Introduction for the finest displayed granularity level. Long time series are subdivided into groups of logical units, for example, several hours each consisting of 60 minutes. By allowing empty pixels in the recursive patterns, the technique can better cope with irregularities in time series, such as the irregular number of days or weeks in a month. In order to be able to compare several time series, three coloring schemes and three alternative arrangements are proposed. Finally, the applicability of the extended recursive pattern visualization technique is demonstrated on real data of large-scale SSH flows in our network. In Chapter 5, we propose the Hierarchical Network Map (HNMap), which is a space-filling map of the IP address space for visualizing aggregated IP traffic. Within the map, the posi- tion of network entities are defined through a containment-based hierarchy by rendering child nodes as rectangles within the bounds of their parent node in a space-filling way. While the upper continent and country levels require a space-filling geographic mapping method to pre- serve geographical neighborhood, node placement in the lower two levels depends on the IP addresses, which are contained within the respective autonomous system or network prefix. Since there exist two alternative layout methods and their combination for these lower two lev- els, we evaluate their applicability according to a) visibility, b) average rectangle aspect ratio, and c) layout preservation. Visual analysis of network traffic and events essentially involves exploration of the data. Therefore, various means of interaction are implemented within our prototype. Finally, three case studies involving resource location planning, traffic monitor- ing, and botnet spread propagation are conducted and show how the tool enables insightful analyses of large data sets. In the scope of Chapter 6, the HNMap is extended through hierarchical edge bundles to convey source destination relationships of the most important network traffic links. In contrast to straight connection lines, these bundles avoid visual clutter while at the same time grouping traffic with similar properties in the IP/AS hierarchy of the map. In order to communicate the intensity of a connection, we consider both coloring and width of the splines. The case study then assesses changes of the major traffic connections throughout a day of network traffic. Chapter 7 describes the Radial Traffic Analyzer (RTA), which is a visualization tool for mul- tivariate analysis of network traffic. In the visualization, network traffic is grouped according to joint attribute values in a hierarchical fashion: starting from the inside each ring represents one dimension of the data set (e.g., source/destination IP or port). While inner rings show the high-level aggregates, outer rings display more detailed information. By interactively rear- ranging the rings, the aggregation function of the data is changed. By animating the display, it is demonstrated that the RTA can be used for temporal analysis of network traffic. The case study then demonstrates how the tool is applied for the analysis of event data of an intrusion detection system. In Chapter 8, we propose a novel network traffic visualization metaphor for monitoring the behavior of network hosts. Each host is represented through a number of nodes in a graph, whose position correspond to the traffic proportions of that particular host within a specific time interval. Subsequent nodes of the same host are then connected through straight lines to denote behavioral changes over time. In an attempt to reduce overdrawing of nodes with the same projected position, we then apply a force-directed graph layout to obtain compact traces for hosts of unchanged traffic proportions and large extension of the traces that represent hosts with highly variable traffic proportions. Two case studies show how the tool can be used to 1.4. Thesis outline and contribution 7 gain insight into large data sets by analyzing the behavior of hosts in real network monitoring and security scenarios. Chapter 9 details the analysis of content-based characteristics of network traffic using the well-known Self-Organizing Map (SOM) visualization technique. This neuronal network ap- proach orders high-dimensional feature vectors on a map according to their distances. We create text descriptors by extracting the most popular terms from email subject and text fields of 9 400 messages in an archive and by applying the tf-idf information retrieval scheme. Within the case study, it is demonstrated that the SOM learned on these feature vectors can be used for classification tasks by distinguishing between spam and regular emails based on the position of the email’s feature vector on the map. Chapter 10 concludes the dissertation by summarizing the contributions and giving an out- look to future work.

2 Networks, intrusion detection, and data management for network traffic and events

,,Not everything that is counted counts, and not everything that counts can be counted.”

Albert Einstein Contents 2.1 Network fundamentals ...... 10 2.1.1 Network protocols ...... 10 2.1.2 The Internet Protocol ...... 13 2.1.3 Routing ...... 14 2.1.4 UDP and TCP ...... 15 2.1.5 Domain Name System ...... 16 2.2 Capturing network traffic ...... 17 2.2.1 Network sniffers ...... 17 2.2.2 Encryption, tunneling, and anonymization ...... 18 2.3 Intrusion detection ...... 20 2.3.1 Network and port scans ...... 21 2.3.2 Computer viruses, worms, and trojan programs ...... 22 2.3.3 Countermeasures against intrusions and attacks ...... 23 2.3.4 Threat models ...... 25 2.4 Building a data warehouse for network traffic and events ...... 26 2.4.1 Cube definitions ...... 28 2.4.2 OLAP Operations and Queries ...... 31 2.4.3 Summary tables ...... 32 2.4.4 Visual navigation in OLAP cubes ...... 33

OMPUTER networks have become an integral part of our daily used IT infrastructure. It Cis therefore worth devoting some time to introducing networking concepts and terminol- ogy in order to foster the understanding of the data analysis challenges within this dissertation. In particular, methods for capturing network traffic, methods for intrusion detection, as well as data modeling issues are discussed. 10 Chapter 2. Networks, intrusion detection, and data management

2.1 Network fundamentals

Within this dissertation, networking concepts are only explained in a very brief fashion. Kurose et al. [95] and Tanenbaum [153] have written excellent books on computer networking for a more thorough discussion. In general, one can distinguish between network hardware and software. Today’s network hardware has diversified in various technologies: cable, fiber links, wireless, and satellite communication work together seamlessly. This is due to the fact that flexible network com- munication protocols – the software part – introduce the necessary abstraction to facilitate communication among various machines running diverse operation systems and wide-spread applications. Many innovations of today’s network communication were first proposed in the Request For Comments (RFC) . RFC was intended to be an informal fast distribution way to share ideas with other network researchers. It was hosted at Stanford Research Institute (SRI), one of the first nodes of the ARPANET, which was the predecessor of the Internet [103].

2.1.1 Network protocols Computers in a network are called hosts. As mentioned above, different technologies exists to connect the hosts of a network. From a structural point of view, one often distinguishes between Local Area Networks (LANs) and Wide Area Networks (WANs). The main difference between them is the communication distance, not necessarily the size, resulting in the use of different communication hardware. Since long distance links are more expensive than wiring a few local hosts, there is obviously a consolidation effect resulting in one strong link connecting two networks instead of several low-bandwidth links. However, due to availability concerns, many networks are connected through several links. The Internet is a giant network, which consists of countless interconnected networks. Many people confuse the term World Wide Web (WWW) with the term Internet. In fact, the World Wide Web can be seen as a huge collection of interlinked documents accessed via the Inter- net. Countless webservers provide access to interconnected dynamic and static websites for arbitrary hosts in the Internet. Therefore, the Internet is the name of the network whereas WWW refers to a particular service running on top of this network infrastructure. Emails, for example, are sent through the Internet and are another service besides WWW. There exist two well-known reference models for network protocols: TCP/IP (Transmission Control Protocol/Internet Protocol) and OSI (Open System Interconnection). Neither the OSI nor the TCP/IP model and their respective protocols are perfect. The OSI reference model was proposed at the time when a lot of the competing TCP/IP protocols were already in widespread use, and no vendor wanted to be the first one to implement and support the OSI protocols. The second reason OSI never caught on is that the seven proposed layers were modeled unneces- sarily complex: two of the layers are almost empty (session and presentation), whereas two other ones (data link and network) are overloaded. In addition to that, some functions such as addressing, flow control, and error control reappear in each layer. Third, the early implementa- tions of OSI protocols were flawed and were therefore associated with bad quality as opposed to TCP/IP which was supported by a large user community. Finally, people thought of OSI as the creation of some European telecommunication ministries, the European Union, and later 2.1. Network fundamentals 11

Application Presentation Session Application 5 Application Transport Transport 4 Transport Network Network 3 Network Data Link 2 Data Link Host-to-network Physical 1 Physical OSI TCP/IP hybrid

Figure 2.1: The layers of the OSI, TCP/IP and hybrid reference models as the creation of the U.S. government and thus preferred TCP/IP as a solution coming out of innovative research rather than bureaucracy [153]. OSI protocols are rarely used nowadays, therefore the focus is on TCP/IP protocols, but a hybrid reference model combining TCP/IP and OSI is considered throughout this dissertation. Networking protocols are organized in so-called layers. These layers can be implemented in software (highest flexibility), in hardware (high speed), or in the combination of the two. The used hybrid model uses the upper three layers of TCP/IP, but splits the Host-to-network layer into the data link and physical layer as illustrated in Figure 2.1. The five layers of the hybrid model are as follows:

1. The physical layer is concerned with transmitting raw bits over a communication chan- nel and ensures that if one party sends a 1 bit, the other party actually receives it as a 1 and not as a 0.

2. The data link layer, sometimes also called link layer or network interface layer, is com- posed of the device driver in the operation system and the corresponding network inter- face card. These two components are responsible for handling the hardware details with the cable.

3. The network layer, or internet layer, handles the movement of packets in the network and routes them from one computer via several hops to its destination. In the TCI/IP protocol suite this layer is provided by IP (Internet Protocol), ICMP (Internet Control Message Protocol), and IGMP (Internet Group Management Protocol).

4. The transport layer delivers the service of data flows for the application layer above it. The protocol suite includes two conceptually different protocols: TCP (Transmission Control Protocol) offers a reliable flow of data between two hosts for the application layer by acknowledging received packets and retransmitting erroneous packets. UDP (User Datagram Protocol), on the contrary, only sends packets of data from one host to the other, but no guarantee about the arrival of the packet at the other end is given. It is often used in real-time applications like voice, music, and video streaming where a loss of some packets is acceptable to a certain degree. 12 Chapter 2. Networks, intrusion detection, and data management

Protocol Layer Name Purpose / applications FTP File Transfer Protocol file transfer HTTP HyperText Transfer Protocol hypertext transfer IMAP Internet Message Access Proto- electronic mailbox with folders, etc. col POP3 Post Office Protocol version 3 electronic mailbox AL SMTP Simple Mail Transfer Protocol e-mail transmission across the Internet SNMP Simple Network Management network management Protocol SSH Secure Shell secure remote login (UNIX, LINUX) TELNET TELetype NETwork remote login (UNIX, LINUX) TCP Transmission Control Protocol lossless transmission UDP TL User Datagram Protocol transmission of simple datagrams (packets might be lost), music, voice, video ICMP Internet Control Message Proto- error messages col IGMP NL Internet Group Management manages IP multicast groups Protocol IP Internet Protocol global addressing amongst computers

Table 2.1: Common protocols that build upon the TCP/IP reference model (NL = Network Layer, TL = Transport Layer, AL = Application Layer)

5. The communication details of diverse applications are handled in the application layer. Common application layer protocols are Telnet for remote login, File Transfer Protocol (FTP) for file transfer, SMTP for electronic mail, SNMP for network management, etc.

An important contribution of the OSI model is the distinction between services, interfaces, and protocols. Each layer performs some services for the layer above. A layer’s interface, on the contrary, tells the processes above how to access it. The protocols used inside the layers can now be seen independently and exchanging them will not affect other layers’ protocols. Since the upper layer protocols build upon the services provided by the lower layers, the intermediate routers do not necessarily need to understand the protocols of the application layer, but it suffices when they communicate data using lower level protocols and only the respective source and destination computers are capable of interpreting the used application protocols. A few commonly used protocols are listed in Table 2.1 to convey the intuition about what is done in which layer. Let us consider a short example: After requesting a web page, the HTTP header (Application layer) specifies the status code, modification date, size, content-type, and encoding of the document, among other technical details. The TCP protocol of the Transport layer then subdivides the document into multiple frames and specifies the source (HTTP = port 80) and destination ports on which the requesting host already listens. This protocol 2.1. Network fundamentals 13

Figure 2.2: Advertised IPv4 address count – daily average [75] guarantees reliable and in-order delivery of data from sender to receiver by sending requests and acknowledgments. In case of timeouts, TCP retransmits the lost frames and correctly assembles them. The IP protocol (Network layer) then provides global addressing and takes care of the routing of the frames from the source to the destination host. Often, this involves many routers, which each time transfer the packet to the next host. Normally, this host is closer to the destination with respect to network topology. Finally, the communication between two machines in this chain of involved routers and hosts is controlled using the data link layer. This might be handled by Ethernet or other data link and physical layer standards.

2.1.2 The Internet Protocol As demonstrated in the example above, many upper layer protocols depend upon the Internet Protocol with its global addressing and routing capabilities. Nowadays, version 4 is most commonly used. The Internet’s growth has been documented in several studies [107, 120, 75] by means of estimating its network traffic, its users, and the advertised IP addresses. Figure 2.2 illustrates this growth, stating that currently about 40 % of the approximately 4 billion IPv4 addresses are used. Predictions suggest that IANA (Internet Assigned Number Authority) will run out of IP addresses in 2010. However, Network Address Translation (NAT) and IPv6 technology can compensate for the need of more IPv4 addresses. Both technologies are already in use and ready for a broader deployment. Figure 2.3 shows an IP datagram. The IP header normally consists of at least five 32-bit words (one for each of the first five rows in the figure). It specifies the IP version used (mostly 4), the IP header length (IHL), the type of service, the size of the datagram (header + data), the identification number, which in combination with the source address uniquely identifies a packet, several flags, the fragmentation offset ( count from the original packet, set by 14 Chapter 2. Networks, intrusion detection, and data management

Versi on IHL Type of Service Total Length Identification Flags Fragment Offset Time to Live Protocol Header Checksum Source Address Destination Address Options (optional) Data

Figure 2.3: IP datagram the routers which perform IP packet fragmentation), the time to live (maximum number of hops which the packet may still be routed over), the protocol (e.g., 1 = ICMP; 6 = TCP; 17 = UDP), the header checksum, the source address and the destination address. In some cases, additional options are used by specifying a number greater than five in the IHL field. The rest of the datagram consists of the actual data, the so-called payload. The IP addressing and routing scheme build upon two components, namely, the IP address and network prefixes:

• An IP address is a 32-bit number (in IPv4) which uniquely identifies a host interface in the Internet. For example, 134.34.240.69 is the IP address of a webserver at the University of Konstanz in dot-decimal notation.

• A prefix is a range of IP addresses and corresponds to one or more networks [65]. For instance, the prefix 134.34.0.0/16 defines the 65 536 IP addresses assigned to the Uni- versity of Konstanz, Germany. Each prefix consists of an IP address and a subnet prefix, which specifies the number of leftmost bits that should be considered when matching an IP address to prefixes.

2.1.3 Routing When traffic is sent from a source to a destination host, several repeaters, hubs, bridges, switches, routers, or gateways might be involved. To clarify these terms, we need to reconsider the layers of the reference model since these devices operate on different layers. Repeaters were designed to amplify the incoming signal and send it out again in order to extend the max- imum cable length (Ethernet: ca. 500 m). Hubs work in a very similar way, but send out the incoming signal on all their other network links. These two devices operate on the physical level, since they do not understand frames, packets, or headers. Next, we consider switches and bridges, which both operate on the data link layer. Switches are used to connect several computers, similar to hubs, whereas bridges connect two or more networks. When a frame arrives, the software inside the switch extracts the destination address from the frame, looks it up in a table, and sends it out on the respective network link. When a packet enters a router, the header and the trailer are stripped off and the routing software determines by destination address in the header to which output line the packet should 2.1. Network fundamentals 15 be forwarded. For an IPv4 packet, this address is a 32-bit number (IPv6: 64 bit) rather than the 48-bit hardware address (also called MAC address or Ethernet Hardware Address (EHA)). The term gateway is often used interchangeably for a router. However, transport and application gateways operate one or two layers higher. Since each network is independently managed, it is often referred to as an Autonomous Sys- tem (AS). An AS is a connected group of one or more IP prefixes (networks) run by one or more network operators, and has a single and clearly defined routing policy [65]. AS’s are indexed by a 16-bit Autonomous System Number (ASN). Usually, an AS belongs to a local, regional or global service provider, or to a large customer that subscribes to multiple IP ser- vice providers. The border gateway routers, which connect different AS’s, base their routing decision upon each one’s so-called routing table. This table contains a list of IP prefixes, the next router, and the number of hops to the destination. Prefixes underlie Classless Inter-Domain Routing (CIDR) [54], which was preceded by Classful Addressing. Classful Addressing only allowed 128 A class networks (/8) each con- sisting of 16 777 214 addresses, 16 384 B class networks (/16) with each 65 534 addresses, and 2 097 152 C class networks (/24) of size 254. Note that the number of available addresses is always 2N −2, where N is the number of bits used and the -2 adjusts for the invalidity of the first and last addresses because they are reserved for special use. Since many mid-size com- panies required more than 254 addresses, the fear arose that the B class networks would soon be depleted. CIDR introduced variable prefix length and thus offered more flexibility to vary network sizes for both internal and external routing decisions. Through its bitwise address assignment and aggregation strategy routing tables are kept small and efficient. Continuous ranges of IP addresses, which are all forwarded to the identical next hop, are aggregated in the routing tables. For example, traffic to the prefixes 134.34.52.0/24 and 134.34.53.0/24 both destined for AS 553 can be aggregated to 134.34.52.0/23. Each time a packet arrives at an intermediate hop, it is forwarded to the router with the most specific prefix entry matching the destination IP address. This is done by checking whether the N initial bits are identical. Note that routing is usually more specific within an AS, whereas external routing is highly aggregated due to the fact that all traffic from a particular source to a destination AS needs to pass through the same border gateway router. Further details of the exterior routing, such as policies, costs, announcement, and withdrawal of prefixes are dealt with in the Border Gateway Protocol (BGP) [106].

2.1.4 UDP and TCP As mentioned previously, UPD and TCP operate on the transport layer and provide end-to-end byte streams over an unreliable internetwork. The connectionless protocol UDP provides the service of sending IP datagrams with a short header for applications. This is done by adding the source and destination port fields to the IP header, thus enabling the transport layer to determine which process on the destination machine is responsible for handling the packet. The destination port specifies which process on the target machine is to handle the packet, whereas the source port details on which port the reply to the request should arrive. In the reply, the former source port is simply copied into the destination port so that the requesting machine knows how to handle the answer. 16 Chapter 2. Networks, intrusion detection, and data management

In contrast to UDP, the connection-oriented protocol TCP provides a reliable service for sending byte streams over an unreliable internetwork. This is done by creating so-called sock- ets, which are nothing else but communication end points, and by binding the ports local to the host to the sockets. TCP can then establish a connection between a socket on the source and a socket on the target machine. The IANA [74] is responsible for the assignment of application port numbers for both TCP and UDP. Conceptually, there are three ranges of port numbers:

1. On many systems, well-known port numbers ranging from 0 to 1023 can only be used by system (or root) processes or by programs executed by privileged users. They are assigned by the IANA in a standardization effort.

2. Registered port numbers ranging from 1024 to 49 151 can be used by ordinary user processes or programs executed by ordinary users. The IANA registers uses of these ports as a convenience to the community.

3. Dynamic and/or private port numbers ranging from 49 152 to 65 535 can be used by any process and are not available for registration.

In the analysis types presented in this thesis, application port numbers are often used as an indication about what applications are using the network. Since the application port numbers can be extracted from the packet headers, this kind of analysis does not require looking at the packet content, which might otherwise raise additional privacy concerns. Although the used application ports are a rather good estimate for regular traffic, the application port numbers can be used by other processes than the ones they were originally meant to. The peer-to-peer Internet telephone system Skype, for example, uses port 80 and 443, which are registered to web traffic (http) and secure web traffic (https). This is done in order to bypass application firewalls, which are an effort to protect the network infrastructure from malicious traffic by blocking unused ports. Naturally, this is only possible if the application ports on that particular machine have not already been bound to a webserver process.

2.1.5 Domain Name System So far, we have discussed various protocols which all rely on some sort of network address (e.g., the MAC address or the IP address). Whereas machines can perfectly deal with these kind of addresses, humans find them hard to remember. Due to this fact, ASCII names were introduced to decouple machine names from machine addresses. However, the network itself only understands numerical addresses. Therefore, a mapping mechanism is required to convert the ASCII strings to network addresses. Since the previously used host files could not keep pace with the fast growing Internet, the Domain Name System (DNS) was invented. The Internet is conceptually divided into a set of roughly 250 top level domains. Each one of these domains is further partitioned into subdomains, which in turn can have subdomains of their own, and so on. This schema spans up a hierarchy. One distinguishes between so-called generic (e.g., com, edu, org) and country top-level domains (e.g., de, ch, us). Subdomains can then be registered at the responsible registrar. Each domain is named by the path upward 2.2. Capturing network traffic 17 from it to the (unnamed) root, whereas the components are separated by periods (pronounced “dots”). For example, www.uni-konstanz.de specifies the subdomain www (common con- vention to name a webserver), which is a subdomain of uni-konstanz, which in turn is registered below the top-level domain de. This naming schema usually follows organizational boundaries rather than the physical network. When a domain name is passed to the DNS system, the latter returns all resource records associated with that name. For simplicity, we restrict ourselves to the address records, which might look like this:

Domain TTL Class Type Value www.uni-konstanz.de 86400 IN A 134.34.240.69

Resource records contain five values, namely the domain name, time to live (TTL), class, type, and value. In the record above, the TTL value of 86 400 (the number of seconds in one day) indicates that the record is rather stable since highly volatile information is assigned a small value. IN specifies that the record contains Internet information, and A that it is an ad- dress record. The final field specifies the actual IP address 134.34.240.69, which is mapped to the domain www.uni-konstanz.de. Other resource records hold information such as the start of authority, responsible mail and name servers for a particular domain, pointers, canon- ical names, host descriptions, or other informative text. For more details refer to [153].

2.2 Capturing network traffic

To conduct data analysis of network traffic, details of this traffic need to be obtained from hosts, routers, firewalls, and intrusion detection systems. Often, collecting this data turns out to become a practical challenge since some network packets might pass several routers and are thus stored several times. Furthermore, export interfaces of routers might return the so-called netflows, which are detailed information about size, time, source, and destination of the transferred network traffic, in different formats. For a more detailed description of problems and solutions to measuring network traffic, we suggest to read the book “Network Algorithmics: an interdisciplinary approach to designing fast networked devices” by George Varghese [172].

2.2.1 Network sniffers There exists an alternative way of monitoring network traffic when access to the export in- terface of a router is not given. The network card of almost any computer can be set into promiscuous mode, which instructs the network card to pass any traffic it receives to the CPU rather than just packets addressed to itself. In the next step, the packets are passed to programs extracting application-level data. Depending on the network infrastructure, packet sniffing can be very effective or not effective at all: hubs forward all traffic to each of their network inter- faces (except for the one where it came in), whereas switches only forward incoming traffic to one network link as long as it is not a broadcast packet. Despite this fact, Arp Poison Routing 18 Chapter 2. Networks, intrusion detection, and data management

(APR) can be used to fool switches by misleadingly announcing the MAC addresses of other hosts in the network. Today, network administrators and hackers can choose from a wide variety of packet snif- fers. A few commonly used freeware tools are listed here:

• libpcap is a system-independent interface for user-level packet capturing, which runs on POSIX systems (Linux, BSD, and UNIX-like OSes).

• tcpdump is a command line tool that prints out the headers of packets on a network interface matching a boolean expression. It runs on POSIX systems and is built upon libpcap.

• WinPcap contains the Windows version of the libpcap API.

• JPcap is a java wrapper for libpcap and winpcap, which provides a java interface for network capturing.

• Wireshark (formerly Etherreal) is a free graphical packet capture and protocol analysis tool, which runs on POSIX systems, MS Windows, and Mac OS X and uses libpcap or WinPcap depending on the OS.

• The OSU Flow-tools are a set of tools for recording, filtering, printing and analyzing flow logs derived from exports of CISCO NetFlow accounting records [55].

Since there has always been a need to monitor network traffic and debug network protocols, many more freeware and commercial products have emerged. An overview can be found in [72]. By simply instructing a router to duplicate the outgoing packets and output them on an additional network interface, it is possible to sniff all network traffic using a machine with a promiscuous network interface that is connected to the router. This reduces the effort of having to deal with various formats of the export interface. Surprisingly, a lot of commonly used applications, such as POP3, FTP, IMAP, htaccess, and some webstores do not encrypt transferred data, usernames or passwords. For hackers, it therefore often suffices to have a network sniffer installed which listens to the communication between their victims and the servers they use. The sniffed ethernet frames are simply passed to an application which searches their payload for passwords. Note that in high-load situations not all packets can be captured due to capacity problems of the machine where the sniffer runs. Routers are built and configured to pass large amounts of packets, whereas sniffers need to analyze the packets at a higher level, which needs more processing power.

2.2.2 Encryption, tunneling, and anonymization Encryption can be used to avoid stolen passwords and to guarantee the privacy of the commu- nication. More formally, the desirable properties of secure communication can be identified as follows: 2.2. Capturing network traffic 19

• Confidentiality is often perceived as the only component of secure communication. However, it only comprises that a message is communicated from the sender to the receiver without anybody being able to obtain its content.

• Authentication implies that both the sender and the receiver should be able to confirm the identity of one another. Naturally, face-to-face communication easily solves that problem, but other forms of communication make authentication a technical challenge.

• Message integrity and nonrepudiation deal with the cryptographic property that a mes- sage cannot be altered by the so-called “man in the middle” and with the fact that it can be proven that a certain message was written by the sender and nobody else. In the physical world, we often use signatures, but since it is trivial to make identical copies in the digital world, more sophisticated methods are needed.

• Availability and access control. Experiences with denial-of-service (DoS) attacks from the last few years have shown that availability is vital for ensuring secure communica- tion. To further extend this concept, access control to the communication infrastructure can first and foremost prevent the communication to be intercepted.

Confidentiality, authentication, message integrity, and nonrepudiation have already been considered key components of secure communication for some time [117]. More recently, availability and access control were added [111, 15]. There exist two conceptually different encryption methodologies, namely the symmetric the and asymmetric one. Symmetric encryption relies on the fact that sender and receiver share a secret, the so-called key. This key can be used to encrypt and decrypt messages. Asymmetric encryption also know as public key cryptography, employs two keys per identity: the public key can be obtained from a publicly accessible key repository, whereas the private key should be kept as a secret. The public key of the recipient suffices to encrypt a message addressed at that recipient. However, once encrypted, this message can no longer be read by the sender since it can only be decrypted using the private key of the receipient. Public keys can also be used to prove the sender’s authenticity, the message integrity, and its nonrepudiation. A hash code is generated from the message and encrypted using the sender’s private key. This so-called signature is then attached to the original message and encrypted using the recipient’s public key. Decryption is also twofold: first, the message and signature are decrypted using the recipient’s private key, thereupon, the signature is decrypted a second time using the sender’s public key and is compared to the hash code of the original message. This only works due to the fact that the encryption and decryption functions are reversible, which means that the outcome of encrypting and then decrypting a message is the same as decrypting a message and then encrypting it. While asymmetric encryption is considerably slower than most symmetric encryption meth- ods, it is often used to exchange the keys for symmetric encryption methods. Once this is done, fast symmetric encryption can be used to encrypt large volumes of data. In many network se- curity applications, the SSH protocol is used to establish a secure channel over which traffic from other applications can be tunneled as sketched in Figure 2.4. In other words, the sender creates a local socket, which is bound to the SSH application and which encrypts all incoming 20 Chapter 2. Networks, intrusion detection, and data management

Internet SSH tunnel

sender recipient eavesdropping

"bad guy"

Figure 2.4: Concept of an SSH tunnel – network traffic from diverse applications is automati- cally encrypted and decrypted, thus preventing eavesdropping by “bad guys” traffic and sends the encrypted payload over the Internet until it arrives at the recipient where it is decrypted and forwarded to the port on either the recipient’s machine or on a host in the network specified by the SSH tunnel. Note that if the tunnel only exists from the sender’s to the recipient machine, the traffic the recipient forwards is unencrypted. Naturally, the response also travels through this secured channel. Tunneling has proven to be very powerful for hiding private communication, since the network administrators only see how much encrypted data is transferred form a source to a destination host using SSH. So far, it was still possible for the network administrator or the evil man in the middle to analyze the source and destination IP addresses. However, today many service providers offer anonymization services, which forward traffic to conceal one communication partner. Once the log files of this anonymization service are deleted, it becomes impossible to trace back the communication, given that enough people use the anonymization service simultaneously. While most users employ encryption, tunneling, and anonymization to guard their privacy in an open network infrastructure like the Internet, these mechanisms can be easily misused by criminals to hide their delicts. Enforcing availability and access control as the final property of secure communication in an open network infrastructure can turn out to be a challenge. The next section will explain how today’s technology is used to make the distinction between good and bad guys in order to protect the network infrastructure.

2.3 Intrusion detection

The Internet was originally designed in a way that each host can communicate with any other host in the network. Since the number of hosts in the Internet has grown immensely, hosts became more and more anonymous and some started misbehaving. Although it is easily trace- able which company owns the IP address of the misbehaving host, it can still take several days until that host is cut off. Due to the lack of international laws, a misbehaving host might even 2.3. Intrusion detection 21

Figure 2.5: Port scan using NMap be placed in a country where it does not violates any law, whereas its misbehavior is liable to prosecution in the victim’s country. Therefore, Internet folks often prefer technical solutions, which can be put in place a lot faster than law enforcement. This section briefly describes different kinds of misbehavior and details methods to deal with them. Interested readers are referred to [115, 171] for more details.

2.3.1 Network and port scans In most attack scenarios, a host or a network is first scanned for vulnerabilities using two principle two methods. The first method is called port scan and checks a particular range or sometimes even all ports of the target host for the applications which are listening. As soon as an application answers to the request of the attacker, the latter tries to guess the kind of application from its protocol and port. Sometimes it is even possibly to infer which version of that particular application is running on the probed port. This knowledge then gives the hacker meaningful hints about unpatched security vulnerabilities to be exploited when launching an attack. Figure 2.5 shows the output of a port scan with NMap [71], detailing open ports and the respective listening applications of host 192.168.0.41. Note that this quick scan only took 0.162 seconds. Network scans are the second scanning method usually targeting a whole network or a range of addresses. Commonly, the whole network is scanned on a particular port where a known security vulnerability might still be unfixed. Since sequential scans are easily detectable and can be blocked, tool developers have implemented various strategies to obfuscate scans by permutation of the scanned addresses and by artificially spanning the scan over a larger time interval. Still, common Intrusion Detection Systems (IDS) easily manage to detect normal and obfuscated scans. 22 Chapter 2. Networks, intrusion detection, and data management

2.3.2 Computer viruses, worms, and trojan programs Computer security vocabulary like computer viruses, worms, and trojan programs are com- monly used as buzz words in the media. Despite this fact, their meaning is frequently con- fused. Therefore, these terms among other specialized vocabulary are briefly reviewed in this section.

Computer virus A piece of software that copies itself into other programs. Possibly, it also performs other tasks like spying out users of the infected computer or deleting files. Commonly, computer viruses are passed by infected files on discs, by Internet down- loads, or email attachments.

Computer worm Worms are viruses that spawn running copies of themselves. In the past, worms have repeatedly demonstrated extremely fast world-wide propagation through known bugs in unpatched email clients or other application software.

Trojan A trojan (derived from the trojan horse of Greek mythology) is a program which is supposed to perform a certain function, but secretly performs another, usually diabolical one.

Botnet The term comes from roBOT NETwork and refers to a large number of compromised computers. Normally, hosts are added to botnets through trojan programs or worms. Many of these bot computers communicate with an Internet Relay Chat (IRC) server without the knowledge of their actual owner and receive remote control commands from there. It is common knowledge that bot computers are misused to “harvest” email ad- dresses in the Internet, send out spam, and participate in DDoS attacks (see below).

0-Day-Exploit Most of today’s known malware exploits one or another known security leak and infects unpatched computers or applications with slow update cycles. 0-Day-Exploits, on the contrary, are security leaks within software, which are yet unknown to both the public and the software producers. On the black market, these exploits are vividly traded since patches do not yet exist and the malware spread is expected to be greater. Further- more, 0-Day-Exploits can be secretly used to spy out particular target hosts running little risk of being detected.

Denial Of Service attack (DoS) This class of attacks is characterized by the goal to make a computer resource unavailable to its intended users, either by consumption of computa- tional resources, disruption of configuration information (i.e., routing information), or disruption of physical network components. A SYN flood, for example, is one common DoS attack targeting webservers or other hosts in the Internet. The concept is to send a TCP/SYN packet with a spoofed sender address to the victim host. Because each of these packets is handled like a connection request, the attacked host sends back a TCP/SYN-ACK packet and waits for the TCP/ACK packet in response. However, the number of available connections on the attacked host will soon be depleted, since the sender address was spoofed and the acknowledgment packet will never arrive. At this point of time, the victim is no longer reachable by its intended users. 2.3. Intrusion detection 23

Distributed Denial of Service attack (DDoS) DDoS attacks extend the above introduced con- cept of DoS attacks by being conducted using multiple compromised hosts. One can easily imagine the challenge of distinguishing normal traffic from the hostile traffic generated by 100 000 remotely controlled hosts of a botnet. DDoS attacks have made especially those companies, whose business model depends on the availability of their e-commerce or e-banking platform, vulnerable to blackmailing.

Spam The term spam (“SPiced HAm”) has become a synonym for unwanted advertisement emails and originates from a Monty Python sketch where it was extensively used until no communication was possible any more. The countless unwanted advertisement emails are still a major concern in network operations, probably less due the waste of network and storage resources, but more because end users spend a lot of time sorting them out of their email inbox. The Symantec Internet Security Threat Report [151] stresses that between July 1 and December 31, 2006 spam made up 59 percent of all monitored email traffic. Economies of scale resulted in the ability of just one single spam server with a good network connection to send out several millions of emails per day at almost no cost. Lots of efforts have been made to enhance spam detection software, ranging from centralized spam signature repositories to individually trained artificial intelligence fil- ters for automatic filtering of as many spam emails as possible. However, these methods have to deal with the fact that a single falsely removed ham (opposite of a spam) mes- sage might potentially be very expensive as opposed to manually sorting out a few spam messages from the inbox.

Phishing is a rather recently introduced synonym for fraudulent email messages. Cyber crim- inals create these messages claiming to be sent from well-known companies. Links to faked online banking and e-commerce platforms within these emails aim at stealing user names, passwords, and online banking numbers from credulous victims.

Although this list of definitions is by no means complete, it gives a general idea about the current hot topics in the field of computer security. The next section deals with automatic methods for coping with the above mentioned issues.

2.3.3 Countermeasures against intrusions and attacks Up till today, a lot of effort has been made to develop intrusion prevention and detection technology. Modern personal computers are commonly equipped with anti virus software, which ensures that malicious code is detected, quarantined, and removed. Furthermore, many operation systems have considerably improved their security concept, for example by putting into place a software firewall which only allows traffic from registered applications to pass. Network firewalls serve the same purpose by controlling access to the internal computers of the network. Normally, only internal hosts are allowed to open connections to the “wild” Internet. Connection requests from the outside are automatically blocked with only few exceptions. In some scenarios, reaching an internal host in the network is crucial for its designated function. Web, mail, and DNS servers, for example, should be reachable from the outside and are thus often placed in a so-called demilitarized zone (DMZ) as depicted in Figure 2.6. 24 Chapter 2. Networks, intrusion detection, and data management

DNS

SMTP

WWW

DMZ subnet

Internal network

Router to external network Firewalll

Figure 2.6: Concept of a demilitarized zone (DMZ)

Obviously, hosts within this zone are less protected than the rest of the network. Nevertheless, they are integrated into the overall security concept of the network and are thus better protected than hosts in the “wilderness”. Many commercial networks employ Intrusion Detection Systems (IDS), which scan the net- work traffic for external and internal threats. Note that countless security holes of unpatched operation systems might allow malicious code to nest itself into an internal host. This com- promised host can then infect other hosts in the internal network since the firewall only blocks external traffic. Signature-based IDS filter network traffic according to their set of already known attacks. Commonly, these systems are patched on a regular basis, but due to their de- sign they cannot detect yet unknown attack patterns. More recently, a second class of IDS has evolved, so-called anomaly-based IDS. These systems maintain profiles of the network traffic of subnets and hosts. If, all of a sudden, there are significant changes within the type of traffic of a particular entity, an alert is generated to inform the responsible network administrator. Some installations even automatically initiate countermeasures, such as isolating misbehaving hosts to minimize harm to the rest of the healthy IT infrastructure. Although IDS have proven to detect and block many threats, hackers and cyber criminals will always find new ways to surpass them. Honey pots were invented to fight these bad guys with their own methods: A honey pot is a system that simulates a vulnerable host in the network and carefully protocols attacks conducted against it. This helps to keep track of novel threats, attacks, as well as actions undertaken by the attacker to break into the computer. Naturally, all these systems produce alerts and therefore it is possible to manually keep track of all alerts only up to a certain network size and activity level. The next section deals with data modeling as well as storage of network traffic and event data in order to quickly find the right data or to calculate aggregates for the visualization module. 2.3. Intrusion detection 25

2.3.4 Threat models

In the scope of this thesis, the presented visualization approaches support the analyst in the tasks of detecting misues and threats, filtering out irrelevant alarms, correlateing alarms, to monitoring the behavior of hosts, and inspecting the payload of network packets. None of the presented approaches is suitable for supporting all of these tasks since each proposed visualization is designed to solve a specific problem with a particular set of threats in mind. Therefore, we consider a particular threat model for each of the visualizations. In Chapter 4, the threat model assumes that high traffic values, correlations of high values in several traffic parameters, or regularly reappearing patterns indicate an attack or an intru- sion. While peaks in time series can be easily spotted and referenced to their exact time of occurence, not all reappearing temporal patterns can be identified with our proposed visual- ization. However, the visualization tool can supply the analyst with valuable information for developing algorithms or signatures aimed at detecting more threats of this model. The next chapter deals with threats that are correlated with respect to their source or des- tination. For example, an attack could be launched from several computers within the same prefix or AS. The proposed visualization therefore offers an interface for exploring network traffic aggregated at several levels, which correspond to the network infrastructure. This threat model is further refined for the analysis in Chapter 6 in order to take the relation- ship between the source and the destination of traffic flows into account. Since the proposed edge bundles build upon the map introduced in Chapter 5, they also consider the hierarchical structure of the IP address dimension. The visualization therefore enables analysis of threats based on relating senders and receivers. Chapter 7 considers a different threat model by assuming that attacks or misuses can be detected by considering packet details and aggregates. While some threats of this model can be detected using the proposed visualization technique, it becomes infeasable for human analysts to consider all possible aggregation paths of complex data sets. To explicitly consider temporal changes, the threat model in Chapter 8 defines the host behavior based on the respective network traffic and the proposed graph visualization is used to illustrate these behavioral changes. Under the assumption that hacked computers change their behavior with respect to the quantity or type of network traffic, the proposed visualization enables detection of an enlarged set of threats. Chapter 9 deals with a completely different type of threat model, which considers the textual content of network traffic in order to filter out harmful traffic. It is demonstrated on the basis of a Self-Organizing Map how a feature-based approach can be used to filter out harmful traffic such as spam messages. At this point we have to remark that each of the threat models comprises a large set of threats. However, when visualization is applied to detect these threats, it is often up to the user to find pecularities within the data. Therefore, the human analyst cannot be guaranteed to detect all threats within the considered threat model using the proposed techniques. Fur- thermore, so-called false positives are likely to occur due to misinterpretation of ambivalent alerts. 26 Chapter 2. Networks, intrusion detection, and data management

2.4 Building a data warehouse for network traffic and events

When a network is kept thoroughly monitored over a longer period of time, a lot of data is accumulated in log files of different formats originating from gateways, firewalls, honey pots, and intrusion detection systems. In order to obtain an overview of all these different data sources and to develop the capability of analyzing these large data sets in the visualization tools presented within the scope of this dissertation, it was inevitable to build up a data warehouse by identifying common data characteristics and proposing a data modeling scheme suitable for all data sources. Data warehousing is a field that has grown out of the integration of several technologies and experiences over the last two decades. The term “data warehouse” was originally coined by Inmon in 1990 with the following definition:

,,A data warehouse is a subject-oriented, integrated, non-volatile, and time-variant collection of data in support of management’s decisions. The data warehouse contains granular corporate data.” [70]

Traditional functional and performance requirements of On-Line Transaction Processing (OLTP) applications are unsuitable for the tasks at hand: rather than processing many trans- actions in real time, we need a system that supports fast calculation of aggregates for analysis tasks while taking cross-dimensional conditions into account. After careful considerations, the OLAP (On-Line Analytical Processing) architecture was found to be an appropriate option for processing huge amounts of network data in a multitude of scenarios. The term OLAP was coined by Codd et al. as follows:

,,OLAP is the name given to the dynamic enterprise analysis required to create, manipulate, animate, and synthesize information from exegetical, contemplative, and formulaic data analysis models (...). This includes the ability to discern new or unanticipated relationships between variables, the ability to identify the param- eters necessary to handle large amounts of data, to create an unlimited number of dimensions (consolidation paths), and to specify cross-dimensional conditions and expressions.” [30]

OLAP based solutions, initially adopted by traditional business applications, have proven to be beneficial and extendable to serve a much wider range of application domains, such as health care, biology, government, education, and network security, to name a few. Figure 2.7 places OLAP in the context of the data warehousing system architecture. Each layer of the depicted 5-layer-model encapsulates a different data flow in the system [137]. The Data Sources Layer comprises various data sources, which all store data in their proprietary data formats. Therefore, the ETL layer (Extract, Transform, Load) has to deal with integrating this data from these homogeneous sources into a consistent state within the target schemas. Next, the transformed data is passed on to the Data Warehouse Layer where it is stored and archived in a special purpose database. Data analysis methodologies and techniques, such as OLAP and Data Mining, form the Analysis Layer. Finally, the Presentation Layer completes the data warehouse system architecture with its frontend analytical applications – also known 2.4. Building a data warehouse for network traffic and events 27

5th layer: PRESENTATION

OLAP frontend Data Mining tool DSS frontend spreadsheet web frontend 4th layer: ANALYSIS

OLAP Data Mining DSS methods

Data Mart Data Mart Data Mart

Data Operational Archiving system Warehouse Data Store Monitoring Metadata Administration 3rd layer: DATA WAREHOUSE

cleansed raw data Extractor Extractor Staging area 2nd layer: ETL

Enterprise ERM Resource Management legacy systems operational DBs unstructured data external sources 1st layer: DATA SOURCES

Figure 2.7: A multilayered data warehousing system architecture. as BI tools – for presentation and exploration of the data. Since data warehouses have differ- ent performance requirements than operational database management systems optimized for OLTP and because of the need to consolidate data from many heterogeneous sources, they are normally implemented separately from operational databases.

,,OLAP technology draws its analytical power from the underlying multidimen- sional data model [131]. The data is modeled as cubes of uniformly structured facts, consisting of analytical values, referred to as measures, uniquely determined by descriptive values drawn from a set of dimensions. Each dimension forms an axis of a cube, with dimension members as coordinates of the cube cells storing the respective measure values.” [113]

Figure 2.8(a) shows a strongly simplified example of a three dimensional data cube, storing the number of IP addresses and the number of flows as measures determined by dimensions Instrument Intervention-ID occurrence avg. time of use A B C D A B C D mallet/chisel 0 3 0 1 0:00:00 0:00:23 0:00:34 0:00:50 punch 9 22 10 9 0:02:38 0:00:35 0:00:46 0:01:27 trephine 3 0 7 0 0:02:18 0:00:00 0:00:43 0:00:00 bone ablating instr. 12 25 17 10 0:02:33 0:00:33 0:00:45 0:01:24 Table 1. Example results for Queries 1 and 2: Instrument occurrencies and average use times for 4 discectomy interventions.

to the novel domain of surgical process analysis. Conventional business process modeling tools are rather limited in the types of supported analytical tasks, whereas the data warehousing techniques appear more suitable when it comes to managing large amounts of data, defining various business metrics and run- ning complex queries. The case study presented in this work is concerned with designing a recording scheme for acquiring process descriptions from surgical 28 Chapter 2. Networks, intrusioninterventions for their detection,subsequent analysis and e andxploration. data management Confronted with the deficiencies of the relational OLAP approach to meet the requirements of our case study, we propose an extended data model that addresses such challenges as non-quantitative and heterogeneous facts, many- to-many relationships between facts and dimensions, runtime definition of mea- sures, interchangeability of fact and dimension roles, mixed granularity, etc. Our solution is based on categorizing the facts into base and satellite facts, fact hierar- chies and generalizations. At the level of dimensional modeling, variouMeasuress hierarchy 134.34.0.0/16 Time types are identified and examined with! NumIPresp (SUM)ect to their summarizabilit! NumFlows(SUM)y. The prDimensionsoposed model extensions can be easily implementedSemesterusing current 64.233.0.0/1653.0.0.0/8 Jan Feb Mar Apr !"by DstPort & Sender 141.142.0.0/16 OLAPDstPorttools: facts anSenderd dimensions canJanbe storedFeb inMarrelationApral tabTotalles andJanquerieFebd Mar Apr Total Sender with standard SQL. We demonstrate a prototypical implementation of a visual – Mail (25) 134.34.0.0/16 114 47 120 39 320 114 240 399 77 830 interface for runtime measure definition and conclude our work by presenting the results of various analy53.0.0.0/8tical queries51formul12ated 44by the13domai120n experts155 and42 44 97 338 run against the modeled64.233.0.0/16surgical proce21ss data 6warehou18 se. 5 50 69 16 99 5 189 Mail (25) 141.142.0.0/16 65 0 73 0 138 177 0 177 0 354 Total Mail 251 65 255 51 622 515 298 719 179 1711 SSH (22) Acknowledgement – SSH (22) 134.34.0.0/16 8 2 3 0 13 80 33 27 0 140 DstPort We would like to thank Oli53.0.0.0/8ver Burgert from8 ICC9 AS at2 the Un1 iversit20y of Le18ipzig231 9 1 259 as well as Christos Tran64.233.0.0/16takis and Juergen3 Meixensb0 erger4 from1 the Neurosur8 3gery 0 19 1 23 !"by DstPort !"by Sender Department at the Univ141.142.0.0/16ersity Hospital of5 Leipz0ig for4their surgi0 cal sup9 por35t. 0 34 0 69 ! by Time & Total SSH 24 11 13 2 50 136 264 89 2 491 DstPort – WebRefer (80)ences 134.34.0.0/16 889 0 5 27 921 2339 0 5 77 2421 53.0.0.0/8 1129 33 2 0 1164 9001 93 2 0 9096 !" 1. Dayal, U., Hsu, M.64.233.0.0/16, Ladin, R.: Bus588iness proc0ess co311ordinati1on: State900 of1239the art, 0 1299 3 2541 by Sender & Time th trends, and open i141.142.0.0/16ssues. In: VLDB 2001:0 Proc.0 27 0Internati9 onal Conference9 0 on 0 0 99 99 Very Large Data Bases. (2001) 3–13 Total Web !" 2606 33 318 37 2994 12579 93 1306 179 14157 ! by Time ALL Total Product 2881 109 586 90 3666 13230 655 2114 360 16359 (a) Netflow OLAP cube (b) Pivot table

Figure 2.8: Example netflow cube with three dimensions and sample data

Sender, Time, and DstPort. The pivot table in Figure 2.8(b) illustrates the actual values con- tained in the cells of the data cube along with some chosen aggregates. Note that in many cases these data cubes store already pre-aggregated values, such as the number of flows per prefix, month, and destination port number in our example. Furthermore, depending on the complexity of the data, the cubes’ dimensionality can be significantly higher, which results in far more possible analysis questions. In relational OLAP, each cube is stored as a fact table with tuples containing one or more measure values and the values of their dimensional characteristics. Most data warehouses model dimension hierarchies using the denormalized star schema. The database then consists of fact table and a single table for each dimension. One entry in the fact table then stores the multidimensional coordinates and the associated numerical values for the measures. Because star schemas are limited in their capabilities to provide semantic support for at- tribute hierarchies, normalized snowflake schemas were introduced as a refinement. Accord- ing to this modeling, a separate table is created for every level of the respective dimension. Each one of these tables stores the dimensional values of one hierarchy level for a particular dimension as well as a reference to the parent level. Index structures are then used to speed up join operations of the fact tables with the dimension tables of various granularities and to efficiently calculate aggregates. For further reading, refer to [21, 22, 131].

2.4.1 Cube definitions

For our data warehouse, we used the relational database technology of the open source database PostgreSQL [136] because it offers a robust, reliable, and efficient approach for storing and managing large volumes of data. While collaborating with network administrators of our uni- versity and staff of the network security team at AT&T, we were able to design four types of facts and populate them with data: the netflows cube for storing network traffic extracted from the university gateway, the webserver cube consisting of log entries of the webserver of our working group, the botnets cube containing data from a signature-base botnet detection mechanism, and the snort cube, which is defined through alerts generated by the IDS Snort 2.4. Building a data warehouse for network traffic and events 29

Measures Dimensions

Cube # #flows #requests #IPs #alerts IP address Time Port Browser OS Alert netflows ∑ ∑ ∑ 2 1 2 webserver ∑ ∑ ∑ 1 1 1 1 botnets ∑ ∑ 2 1 snort ∑ ∑ 2 1 2 1

Table 2.2: Cubes of the network security data warehouse with their associated measures and dimensions. Inside the table we denoted the aggregation functions for the applica- ble measures and the number of dimensions of a particular type for the applicable dimensions. that was set up for test purposes. Table 2.2 describes these four cubes, their dimensions, and measures. Transactional facts are periodically retrieved from network devices, the webserver log or IDS and stored in the respective fact table. For network traffic, for instance, an entry describes a single network packet consisting of source and target IP addresses and ports, the timestamp, and the size of the payload. When measuring network traffic using several sensors, special precaution has to be taken in order not to count traffic redundantly when it passes multiple hosts on its way to its destination. To reduce the volume of facts to be stored, all packets and connections referring to the same source, destination and time interval are aggregated into a single fact, with the number of sessions and their total size in bytes as its two measures. A single day of network traffic measured on the main gateway of our mid-size university, for example, stores about 10 million connections, which are already aggregated on hourly intervals to approximately 2 million facts. An overview of the logical database design of the netflow cube is depicted in Figure 2.9. Using the snowflake schema, dimensional tables (e.g., IPAddress, Port) were normalized into subtables for each granularity level. The time dimension was left denormalized like in the star schema since data warehouse systems provide their own routines for handling temporal characteristics.

IP address dimension

IP address dimension is common to all of our cubes, some cubes even have two dimensions of this type, for example, the source and the destination IP addresses in the netflow cube. A balanced hierarchy is defined upon the IP address dimension using the following consolidation path: IP address → IP prefix → autonomous system → country → continent and we thus 30 Chapter 2. Networks, intrusion detection, and data management

Port Continent Country Auton_System

PortID ContinentID CountryID ASN Category Longitude Longitude Name Domain Latitude Latitude CountryID Description Name Name ContinentID Network PortDomain Dimensions Time Prefix netflows StartIP Measures Domain EndIP Description Timestamp Millisecond Timestamp ASN Second SourceIP Minute DestinationIP PortCategory Hour SourcePort IPAddress Day DestinationPort Category Month NumConn IPAddress Description Year NumBytes Prefix

Figure 2.9: Modeling network traffic as an OLAP cube. Each entry of the fact table netflows is linked to its dimensional values and stores the measures NumConn and NumBytes obtain the following hierarchy with the number of entries at each level: 7 continents 190 countries 23054 autonomous systems 197427 prefixes

The hosting country of an autonomous system is determined by looking up the geograph- ical positions of the IP addresses of all the networks it contains and choosing the prevailing country. For this, we rely on the GeoIP database of Maxmind, Ltd. [116] (approx. 99% accuracy). A global map from IP prefixes to ultimate AS names and numbers can be somewhat com- plicated to obtain. Though a local map can be extracted from any border gateway router, due to route aggregation, it is unlikely to list many terminal or leaf-level AS’s. Therefore, maps should be obtained from multiple vantage points in the Internet and aggregated, raising the problem of consistency and completeness, especially in the presence of intentional efforts to spoof AS identifiers. This problem has been studied and there are useful heuristic methods based on dynamic programming [114] and public prefix-to-ASN tables available. Unfortu- nately, these tables date back to 20041 and we therefore reverted to the data we extracted from a single routing table in September 2006 [7].

Time dimension Time is a very common and important dimension in almost all data analysis scenarios. In the data cubes presented here, timestamps are aggregated by millisecond → second → minute →

1A possible side-effect could be that previously large AS’s contain less prefixes, which makes the AS level easier to render in the visualizations. 2.4. Building a data warehouse for network traffic and events 31 hour →day → month → year. Database systems provide support to this unique dimension in form of functions for extracting and manipulating temporal properties of interest. How- ever, because these functions differ from product to product, we modeled time as an OLAP dimension to provide the full OLAP analysis capabilities.

Port dimension The netflows and snort cubes both share the port dimension. TCP and UDP application ports may be grouped into categories (i.e., well-known, registered and dynamic ports as mentioned in Section 2.1.4) on the one hand, and into domains (e.g., web, email, database, etc.) on the other hand. These two consolidation paths allow calculation of two different upper level aggregates. Note furthermore that the netflows cube contains two port dimensions, namely SourcePort and DestinationPort.

Other dimensions Naturally, each application domain has its own dimensions characterizing the facts. In the scenarios discussed in this thesis, the hierarchical dimension alert (consolidation path: alert → alert category) as well as non-hierarchical dimensions browser and operation system are contained in the snort and netflows cubes. Note that the dimensions presented here can by no means be considered complete or perfectly modeled since complicated analysis tasks might pose novel requirements (e.g., disk space efficiency, performance, finer granularities, etc.) to the model.

Measures As mentioned before, measures are defined through a numerical attribute as well as an aggre- gation function or a set of aggregation functions applicable to it. For all of the measures pre- sented here, the sum aggregation function was used. Evidently, other aggregation functions, such as average and bounds, or derived aggregation functions (e.g., alerts per IP address) can be easily defined. As shown in Figure ??, the measures number of bytes, connections, re- quests, flows, IPs, and alerts were used in the four presented cubes of the network security data warehouse.

2.4.2 OLAP Operations and Queries The fact table contains a huge volume of “raw” or slightly aggregated data. OLAP operations are used for computing measure aggregates for a chosen combination of dimensions and their granularity levels. The most common operations are:

• Roll-up (aggregating) and drill-down (disaggregating) allow to change the granularity level.

• Slice-and-dice defines a subcube of interest or even reduces the dimensionality by “flat- tening” single dimensions to one selected value. 32 Chapter 2. Networks, intrusion detection, and data management

• Ranking can be applied as a dynamic filter showing the specified number of marginal (top or bottom) measure aggregates.

Since the visualization approaches in this dissertation restrict themselves to a rather limited set of OLAP operations (drill-accross, rotating, etc. were not used), some of the classical OLAP restrictions in part of hierarchy properties, such as balancedness, covering, and strict- ness, could be relaxed. For instance, one port may belong to multiple domains or not belong to any.

2.4.3 Summary tables Unfortunately, storing the entire data in a network traffic monitoring or network security sce- nario might become unfeasible or managing the entire data “as it is” in the database can turn out to be simply impossible. Besides, the older the data entries get the less interesting they are for the analysis, since anomaly detection is basically concerned with recent and especially current traffic flows and events. In order to efficiently query upper level aggregates, many data warehouses store summary tables containing fact entries, pre-aggregated to a specific subset of dimensions and/or a spec- ified granularity level within a dimension. Summary data can be stored either in the separate fact table, or in the same fact table extended to contain the references to the respective upper level dimension tables. In the latter case, null values are inserted into the columns of the di- mension levels inapplicable to the summary facts. While reducing the number of tables, this method has shown to be error-prone in practice, due to the fact that correctly querying the data becomes more complicated. [21] When the hug data volume in a fact table results in performance degradation, the table can be partitioned into multiple subtables corresponding to different levels of detail. For instance, the netflows cube data can be managed using the following three fact tables:

• ShortTermFlows stores the most recent transactions (e.g., for the last one hour) “as they are”, i.e. without any aggregation. This table will be used for the run-time network load visualization.

• MiddleTermFlows stores the network data of the current day with the granularity coars- ened from millisecond to minute. This table offers still rather detailed information for offline exploration of the current day’s network behavior.

• LongTermFlows aggregates the transactions even further (e.g., by hour or day) to store them as historical data for less detailed analysis.

Other granularity levels along time dimension (e.g. month → quarter → year) can be de- fined depending on the application needs. Furthermore, aggregation along IP address dimen- sion might also result in a considerable speed-up. Often, these summary tables are realized through so-called materialized views, which get automatically updated once new fact entries are appended to the underlying fact tables. 2.4. Building a data warehouse for network traffic and events 33

A recent study in visual analytics proposes an alternative data management strategy, coined Smart Aggregation [155]. By combining automatic data aggregation with user-defined con- trols on what, how, and when data should be aggregated, this approach ensures that a system stays usable in terms of system resources and human perceptual resources. By automatically determining if aggregation is required, the system proposes candidate fields to the user for effective aggregation based on the cardinality of the data. After replacing detailed values through their aggregates, specialized index structures are used to speed up the query process.

2.4.4 Visual navigation in OLAP cubes There is an abundance of tools and interfaces for exploring multidimensional data. We limit ourselves to naming a few products which offer distinguished features relevant for our work. One developed system called Polaris [150] extends the Pivot Table interface by allowing to combine a variety of displays and tools for visual specification of analysis tasks. Polaris is a predecessor of a commercial business intelligence product called Tableau Software [152]. ProClarity was the first to enhance business intelligence with Decomposition Trees, a tech- nique for iterative visual disaggregation of data cubes. XMLA enriches the idea of hierarchical disaggregation by arranging the decomposed subtotals of each parent value into a nested chart (Bar- and Pie-Chart Trees) in its Report Portal OLAP client [187]. Visual Insights has devel- oped a family of tools, called ADVIZOR, with an intuiltive framework for parallel exploration of multiple measures [43]. Probably the most popular paradigm underlying the OLAP navigation structure is that of a file browser, with each cube as a folder containing the list of top-level dimensions and the list of available measures, as found in Cognos PowerPlay [31], BusinessObjects [17], CNS DataWarehouse Explorer [29], and many other commercial OLAP tools. Each hierarchical dimension is itself a folder containing its child entities. Hierarchical entities can be recursively expanded to show the subtrees of their descendants. The entities of the highest granularity (i.e. the leaf nodes) are represented as files and are non-expandable. Standard OLAP interfaces allow users to navigate directly in the dimensional data rather than in a dimensional hierarchy scheme. Our approach [176], however, pursues a clear distinc- tion between the dimension’s structure and its instances. Therefore, expansion of a dimension folder reveals solely the nested folders of its subdimensions, contrary to the standard OLAP navigation displaying the child-level data. The instances of any subdimension can be retrieved on-demand. Figures 2.10(a) and 2.10(b) demonstrates the differences between the standard “show-data” and our proposed “show-structure” interfaces, respectively, at the example of a hierarchical dimension IP address. Notice that expanding the top-level dimension IP address in Figure 2.10(b) reveals its entire descendant hierarchy, thus enabling the user to “jump over” right to the desired granularity level. The data view is available on explicit demand by click- ing the preview button of the respective category. Figure 2.10(c) shows the activated preview of continents with the option to drill-down into any continent’s descendant subtree. The ad- vantages of our proposed navigation structure for building hierarchies can be summarized as follows:

1. Clear distinction between the dimension’s structure and its contents. 34 Chapter 2. Networks, intrusion detection, and data management

(a) “show data”-approach (b) “show structure”-approach (c) on-demand preview in “show structure”- approach

Figure 2.10: Navigating in the hierarchical dimension IP address

2. Immediate overview of all granularity levels in a hierarchical dimension.

3. The ability to drill-through directly to any descendant subdimension.

4. On-demand preview of the data as well as any data node’s descendant entities.

5. Compactness on the display due to moderate expansion at most steps.

6. The entire navigation is built from a single meta table.

7. The actual data is retrieved only if explicitly requested.

8. It is easier to find the entries of interest even somewhere deep in the hierarchy without knowing the data (e.g., any country can be accessed directly through the preview of countries without searching for and drilling through its ancestor in continents).

Evidently, finding entries deep in the hierarchy can become a challenge since the deepest level of the IP dimension might contain up to 2 billion items. In this case, character- or byte- wise navigation can make the content accessible. 3 Foundations of information visualization for network security

,,Discovery consists of seeing what everybody has seen and thinking what nobody has thought.”

Albert von Szent-Gyorgyi Contents 3.1 Information visualization ...... 35 3.1.1 The task by data type taxonomy ...... 36 3.1.2 Mapping values to color ...... 40 3.1.3 Further reading ...... 41 3.2 Visual Analytics ...... 42 3.2.1 Scope of Visual Analytics ...... 42 3.2.2 Visual Analytics challenges in network monitoring and security . . 44 3.3 Related work on visualization for network monitoring and security . . 45 3.3.1 Monitoring of network traffic between hosts, prefixes, and ASes . . 46 3.3.2 Analysis of firewall and IDS logs for virus, worm and attack detection 49 3.3.3 Detection of errors and attacks in the BGP routing system . . . . . 50 3.3.4 Towards Visual Analytics for network security ...... 51

HIS chapter will briefly introduce the field of information visualization, which originally Tstarted with static information displays and currently deals with interactive tools for vi- sual data exploration. Based on Shneiderman’s task by data type taxonomy, the visualization techniques of this dissertation will be discussed. Since many of these visualizations map values to colors, we will present different normalization strategies for obtaining an adequate mapping. Afterwards, the emerging field of visual analytics and its influence on network mon- itoring and security will be sketched. At the end of this section, previous visualization studies in the fields of network monitoring and security will be reviewed.

3.1 Information visualization

Visualization can be considered a relatively young research field since the first IEEE confer- ence on visualization was held in 1990. Nevertheless, some fields such as cartography have a history of more than 2000 years. Likewise, static diagrams for data visualization had been used for some time already as documented in Edward R. Tufte’s books [161, 162, 163, 164]. 36 Chapter 3. Foundations of information visualization for network security

Some of the listed case studies in these books, for example, revert to material from the 18th century. In the late 19th century, statistical graphics then started to become common in gov- ernment, industry, and science. However, not only data analysis was a major concern in the early 20th century, but also human factors. Psychological aspects of human perception, for example, were studied when the Gestalt School of Psychology was founded in Berlin in 1912. The pioneering work of Max Westheimer, Kurt Koffka, and Wolfgang Kohler resulted in a set of Gestalt “laws” of pattern perception, which easily translate into a set of design principles (i.e., proximity, similarity, symmetry, closure, continuity) for information displays as described in [178]. The field was revived some decades later in the 1960s when Jacques Bertin, a French cartog- rapher, published his famous book “Semiologie graphicque – les diagrammes, les reseaux,´ les cartes” (first in French and 16 years later in English), in which he carefully explains how effec- tive and expressive diagrams are constructed [12, 14]. He identifies the eight visual variables, namely, x & y position, size, brightness, texture, color, orientation, and form in order to sys- tematically illustrate how they can be used to convey information in matrices, diagrams, and maps. At approximately the same time, John Tukey, a statistician at Bell Labs, widened the scope of statistics from confirmatory data analysis to exploratory data analysis [165], thereby foreseeing the enormous potential of yet unavailable computer graphics and algorithms to advance the young field. In the mid-80s and 90s, raster displays and computer graphics became available facilitating novel research in a variety of areas. The graph drawing and computational geometry commu- nities, for example, emerged to focus on some of the deeper theoretical problems in creating geometric representations of “abstract information”. In the same time frame, researchers at Xerox PARC (among them Card, MacKinlay, and Robertson), recognized the great impor- tance of user interfaces for assessing vast stores of information although details of their vision such as 3D graphical metaphors have not been adopted so far. These developments jointly led to the formation of the term information visualization, which nowadays offers enough space for creativity to host its own research community. The term is defined as follows:

,,Information visualization: The use of computer-supported, interactive, visual representations of data to amplify cognition.” [18]

3.1.1 The task by data type taxonomy In the last 20 years many visualization systems have been developed all around the world and several taxonomies for information visualization have been proposed to systematically classify and discuss the techniques. Based on the task by data type taxonomy for information visualization (TTT) of Ben Shneiderman [145], the concepts proposed in this thesis will be systematically discussed and put into context. Seven tasks and seven data types were identified at a high level of abstraction and are detailed in Table 3.1. Naturally, those analysis tasks and data types can be further refined or extended for more detailed analysis when necessary. Gaining an overview over a collection of items is a very common task for today’s knowledge workers. There exist several strategies to achieve this goal by means of information visualiza- 3.1. Information visualization 37

Tasks Overview Gain an overview of the entire collection. Zoom Zoom in on items of interest. Filter Filter out uninteresting items. Details-on-demand Select an item or group and get details when needed. Relate View relationships among items. History Keep a history of actions to support undo, replay, and progressive refinement. Extract Allow extraction of sub-collections and of the query parameters.

Data types 1-dimensional Linear data types are texts, program source code, or lists which are organized in a sequential manner. 2-dimensional Mostly geographical data, such as maps, floorplans, or abstract 2D text layouts. 3-dimensional Real-world objects like buildings, the human body, or molecules. Temporal Time-related items with a start and finish time, possibly overlapping. Multi-dimensional Data sets with many dimensions. Tree Hierarchies or tree structures are defined by each item having a link to one parent item (except for the root). Network Items can be linked to an arbitrary number of other items.

Table 3.1: Tasks and data types of Shneiderman’s taxonomy (adapted from [145]) tion techniques. One very common approach is to have a zoomed out view with a field of view box that supports the user in matching it with the adjoined detail view. Another completely different approach is the fisheye view, which magnifies one or more areas of the display and thus presents overview and details in the same view [56]. Certainly, zooming can be seen as another basic task provided that the user is interested in some elements of a collection. The smoother the zoom the easier it is for the user to preserve his sense of orientation in the information space at hand. Zooming can be imple- mented through mouse interaction and is very natural on one- or two-dimensional data, or two-dimensional data presentations of other kinds of data, such as the embedding of a graph onto a map. In many situations, we are faced with information overload. There is simply too much in- formation on the screen to grasp it all at once. In this case, filtering helps to remove unwanted items and to focus on the essential ones. As soon as the information display has been reduced to a few dozens of items, users re- quest details on demand to compare and understand the issues at hand. Usually, this kind of interaction is done by simply clicking on an item or hovering over it, which triggers a pop-up window with values of the item’s attributes. Relating items to each other is a challenging task since viewing all relationships at once might be too confusing and viewing only the ones of a particular item with others requires 38 Chapter 3. Foundations of information visualization for network security

Visualization Task Overview Zoom Filter Details Relate History Extract Chapter Enhanced Recursive Pattern x x x x x 4 Hierarchical Network Map x x x x x 5 Edge Bundles x x 6 Radial Traffic Analyzer x x x x x 7

 Behavior Graph x x x x x 8 Dimensionality Self-Organizing Map x x 9

Table 3.2: Visualization techniques of this thesis by tasks some a priori knowledge about which item to pick. Focusing the analysis on a particular attribute value of an item can also trigger filtering operations, e.g., selecting the director’s name of a particular movie in the FilmFinder [2] results in showing all his movies. Since a single user interaction rarely produces the desired outcome, complex explorative tasks consist of several refining and generalizing steps. It is therefore essential that a history of the previous interaction is kept and made available to the user in order to jump back or reapply the previous interaction. A model for recording the history of user explorations, in visualization environments, augmented with the capability for users to annotate their explo- rations can be found in [61]. After the “needle in the haystack” has been found, it would be inappropriate to just throw it away and restart the search from scratch the next time. Users might rather want to extract the found items, store them, or use drag-and-drop capabilities of the operation system to drag them into the next application window. Although this feature is often desired, many applications still come without it or only support a very limited set of extraction capabilities.

Application

As shown in Table 3.2, visualization techniques presented in this dissertation are introduced with increasing dimensionality of the input data. The first technique is the Enhanced Recur- sive Pattern used to present one-dimensional temporal data in a space-efficent way to gain an overview, filter out unimportant time spans, retrieve more details in order to extract data according to the time dimension. Due to the multi-resolution properties of the Enhanced Re- cursive Pattern, it can also be used in a zooming context. Note that rather than representing overlapping time spans, we reduce the analyzed data to one dimension by aggregating all events of a particular time interval. However, when comparing several time series, we move from a one-dimensional to multi-dimensional analysis. In Chapter 5, we present the Hierarchical Network Map (HNMap), a technique to support the tasks overview, zoom, filter, details, and extraction. The space-filling visualization maps the tree data structure to containment relationships within rectangles on the screen. By visu- alizing the whole IP address space with traffic volumes in each part of the network, the analyst gets an overview of how much traffic is transferred between his network and other networks 3.1. Information visualization 39 in the Internet. The visualization then supports zooming on the rectangles which represent continents, countries, or autonomous systems, enables filtering, and offers several options for representing details. At any given time of the analysis, the user can save the current map to a graphics file for the purpose of documentation, presentation, and dissemination purposes of security incidents. The Edge Bundles technique builds upon the rectangles of HNMap by representing the communication network of traffic measurements through curved lines on top to encode its structural properties. Thus, more details of the traffic flows rather than only the traffic volume per entity can be shown. Though causing occlusion due to overdrawn rectangles, these lines make it possible to relate network entities with one another. The next visualization technique is used in the Radial Traffic Analyzer and is very strong when it comes to the exploration history. While in each analysis step another attribute of the multi-dimensional data set is added as a further ring revealing more details of the network traffic, results of the previous step remain visible on the inner rings. Furthermore, basic tasks like filtering, details-on-demand, relating, and extracting are also supported. The behavior graph, which can be seen as a projection of multi-dimensional data onto a two dimensional plane, is meant to give an overview of the behavior of several hosts in the network over a previously defined time span. By filtering the presented data with the help of sliders and check boxes, the user can focus the analysis on the interesting aspects of the data sets. Additionally, detailed bar charts of the underlying data can be displayed on demand. The gained insights about misbehaving hosts can then be used to reconfigure hosts within the administrated network or to initiate countermeasures against attacks from external hosts. Finally, the Self-Organizing Map in Chapter 9 can be used to cope with the high-dimensional feature vectors of e-mail messages. Details of each SOM node can be retrieved through mouse interaction. Note that all the visualization techniques presented above could be further expanded to support almost all basic tasks, but due to the limited time resources, we restricted our research prototypes to the most important ones .

Other taxonomies Shneiderman’s task by data type taxonomy is not the only taxonomy for visualization. The one which comes closest to his taxonomy is the classification of visual data analysis tech- niques proposed by Keim and Ward [86]. It classifies information visualization techniques based on three dimensions: a) the data types to be visualized, b) the interaction and distor- tion techniques, and c) the visualization techniques (standard 2D/3D display, geometrically transformed display, iconic display, dense pixel display, and stacked display). Keller and Keller presented their taxonomy of visualization goals as early as in 1993 [87]. This fundamental work arranged visualization techniques according to nine actions (identify, locate, distinguish, categorize, cluster, rank, compare, associate, and correlate) and seven data types (scalar, nominal, direction, shape, position, spatially extended region or object, and structure). Chuah and Roth later classified the semantics of interactive visualizations [27], in which they focused on the semantics of basic visualization interaction by characterizing inputs, out- 40 Chapter 3. Foundations of information visualization for network security puts, operations, and compositions of these primitives. Card and Mackinlay further extended the early work of Bertin [13, 14] and assessed each elementary visual presentation according to a set of marks (e.g., points, lines, areas, surfaces, or volumes), their retinal properties (i.e., color and size), and their position in space and time (x, y, z, t) [19].

3.1.2 Mapping values to color

Since the visualizations to be presented in Chapters 4, 5, 6, and 9 use the visual variable color in order to convey some kind of numerical measurement to the analyst, we briefly explain how this mapping from values to colors and vice versa is done within the scope of this dissertation. Although color is not the most effective visual variable to convey quantitative measurements [109], we often excluded other visual variables due to overplotting and perceptual issues, which occur due to the high number of data elements displayed simultaneously. In astronomy, medical imaging, geography, and many other scientific applications, the term pseudocoloring is commonly used for representing continuously varying map values using a sequence of colors. While physicists often use a color sequence that approximates the physical spectrum, some perceptual problems occur since there is no inherent perceptual ordering of colors. For example, in an experiment where test persons were prompted to order a set of paint chips with the colors red, green, yellow, and blue, the outcomes varied due to the missing order of colors. However, if the same experiment is repeated with a series of gray paint chips, the subjects either choose a dark-to-light ordering or vice versa [178]. Therefore, the color scales used in our work all employ monotonically increasing or de- creasing brightness of the color values in order to convey quantitative measurements. A more detailed description about how these Hue Saturation Intensity (HSI) color scales are created can be found in [79]. For a more holistic view on color usage, the number of data classes, the nature of the data (i.e., sequential, diverging, or quantitative), and the end-user environment (e.g, CRT, LCD, printed, projected, photocopied) have to be considered [64]. However, in the scope of this dissertation, we focus on color schemes for interactive data visualization on computer screens. Besides the above discussed choice of an appropriate color scale, normalization is one of the most important aspects of color mapping. Numerical measurements in this dissertation usually refer to network traffic, such as a total number of bytes, packets, flows, or alerts within a spec- ified time frame and can be formalized as a set of statistical values X = (xi)i=1,...,n with xi ≥ 0,xi ∈ R and xmax > 0. These values are represented by the filling color of some visual entity such as a rectangle. We provide several fixed color scales and the analyst is free to choose from the following three normalization schemes: 3.1. Information visualization 41

1000 1000 900 1 800 900 700 600 800 500 0,9 y = log(x+1)/log(1001) 400 700 300

0,8 600 200

y = sqrt(x)/sqrt(1000) 500 0,7 100 90 400 80 70 0,6 60 50 300 y = x/1000 40 0,5 30 200 20 0,4

10 100 90 8 0,3 80 70 6 60 50 0,2 40 4 30 20

10 2 0,1 8 6 4 2 0 100 200 300 400 500 600 700 800 900 1000 sqrt log (a) Normalization functions (b) Color scales

Figure 3.1: Mapping values to color using different normalization schemes

xi color (x ) = (3.1) lin i x max√ xi colorsqrt(xi) = √ (3.2) xmax log(xi + 1) colorlog(xi) = (3.3) log(xmax + 1)

The output of the chosen function is then mapped to the index positions within the color scale. Figure 3.1 details the used normalization functions on five different color scales. Due to its improved discrimination for low values and smoothing effect on outliers, we often used the logarithmic color scale in our studies. However, if high values need to be better distinguished, the square root normalization would be the better choice.

3.1.3 Further reading Apart from the already mentioned references, the Readings in Information Visualization with its 700 references is a good start to develop an overview of the field up till 1999 [18]. More- over, the book The grammar of graphics [181] illumines visualization from a statistical back- ground, whereas the German book Visualisierung - Grundlagen und allgemeine Methoden [144] is a general introduction to information visualization as well as scientific visualization. Bob Spence’s book Information Visualization - Design for Interaction enriched the field through its countless case studies on building effective interactive visualization systems [147]. Likewise, perceptual aspects play an important role within visualization systems and Colin 42 Chapter 3. Foundations of information visualization for network security

Ware’s Information Visualization - Perception for Design is an excellent starting point to deepen one’s knowledge about the field [178].

3.2 Visual Analytics

Visual analytics is the science of analytical reasoning supported by interactive visual interfaces [157]. Over the last decades data was produced at an incredible rate. However, the ability to collect and store this data is increasing at a faster rate than the ability to analyze it. While purely automatic or purely visual analysis methods were developed in the last decades, the complex nature of many problems makes it indispensable to include humans at an early stage in the data analysis process. Visual analytics methods allow decision makers to combine their flexibility, creativity, and background knowledge with the enormous storage and processing capacities of today’s computers to gain insight into complex problems. The goal of visual analytics research is thus to turn the information overload into an opportunity: Decision- makers should be enabled to examine this massive, multi-dimensional, multi-source, time- varying, and often conflicting information stream through interactive visual representations to make effective decisions in critical situations.

3.2.1 Scope of Visual Analytics Visual analytics is an iterative process that involves information gathering, data preprocess- ing, knowledge representation, interaction and decision making. The ultimate goal is to gain insight into the problem at hand which is described by vast amounts of scientific, forensic or business data from heterogeneous sources. To fulfill this goal, visual analytics combines strengths of machines with those of humans. On the one hand, methods from knowledge dis- covery in databases (KDD), statistics and mathematics are the driving force on the part of automatic analysis, while human capabilities to perceive, relate and conclude, on the other hand, turn visual analytics into a very promising field of research. Historically, visual analytics has evolved out of the fields of information and scientific vi- sualization. According to Colin Ware, the term visualization is meanwhile understood as “a graphical representation of data or concepts” [178], while the term was formerly applied to form a mental image. Nowadays fast computers and sophisticated output devices create mean- ingful visualizations and allow users not only to mentally visualize data and concepts, but also to see and explore an exact representation of the data under consideration on a computer screen. However, the transformation of data into meaningful visualizations is not a trivial task and will not automatically improve through steadily growing computational resources. Very often, there are many different ways to represent the data and it is unclear which representa- tion is the best one. State-of-the-art concepts of representation, perception, interaction, and decision-making need to be applied and extended to be suitable for visual data analysis. The fields of information and scientific visualization deal with visual representations of data. The main difference between the two fields is that the latter examines potentially huge amounts of scientific data obtained from sensors, simulations or laboratory tests, while the former is defined more generally as the communication of abstract data relevant in terms of 3.2. Visual Analytics 43

Information Analytics Geospatial Analytics Interaction

Scientific Analytics Cognitive and Scope of Visual Perceptual Science Analytics

Statistical Analytics Presentation, production, and dissemination Knowledge Discovery Data Management & Knowledge Representation

Figure 3.2: The Scope of Visual Analytics

action through the use of interactive visual interfaces. Typical scientific visualization applica- tions are flow visualization, volume rendering and slicing techniques for medical illustrations. In most cases, some aspects of the data can be directly mapped onto geographic coordinates or into virtual 3D environments. There are three major goals of visualization, namely a) presen- tation, b) confirmatory analysis, and c) exploratory analysis. For presentation purposes, the facts to be presented are fixed a priori, and the choice of the appropriate presentation technique depends largely on the user. The aim is to efficiently and effectively communicate the results of an analysis. For confirmatory analysis, one or more hypotheses about the data serve as a starting point. The process can be described as a goal-oriented examination of these hypothe- ses. As a result, visualization either confirms these hypotheses or rejects them. Exploratory data analysis as the process of searching and analyzing databases to find implicit but poten- tially useful information, is a difficult task as the analyst has no initial hypothesis about the data. According to John Tukey, tools as well as understanding are needed for the interactive and usually undirected search for structures and trends [165]. Visual analytics is more than just visualization. It can rather be seen as an integral approach combining visualization, human factors, and data analysis. Figure 3.2 illustrates the detailed scope of visual analytics. At the visualization side, visual analytics integrates methodologies from information analytics, geospatial analytics, and scientific analytics. Especially human factors (e.g., interaction, cognition, perception, collaboration, presentation, and dissemina- tion) play a key role in the communication between human and computer, as well as in the decision-making process. In this context, production is defined as the creation of materials that summarize the results of an analytical effort, presentation as the packaging of those ma- 44 Chapter 3. Foundations of information visualization for network security terials in a way that helps the audience understand the analytical results in context using terms that are meaningful to them, and dissemination as the process of sharing that information with the intended audience [158]. In matters of data analysis, visual analytics furthermore profits from methodologies developed in the fields of data management & knowledge representation, knowledge discovery, and statistical analytics. Note that visual analytics is not likely to be- come a separate field of study [184], but its influence will spread over the research areas it comprises. According to Jarke J. van Wijk, “visualization is not ’good’ by definition, developers of new methods have to make clear why the information sought cannot be extracted automatically” [169]. From this statement, we immediately see the need for the visual analytics approach using automatic methods from statistics, mathematics and knowledge discovery in databases (KDD) wherever they are applicable. Visualization is used as a means to efficiently commu- nicate and explore the information space when automatic methods fail. In this context, human background knowledge, intuition and decision-making either cannot be automated or serve as input for the future development of automated processes. The fields of visualization and visual analytics are both built upon methods from scientific analytics, geospatial analytics, and information analytics. They both profit from knowledge out of the field of interaction as well as cognitive and perceptual science. In contrast to visu- alization, visual analytics explicitly integrates methodology from the fields of statistical ana- lytics, knowledge discovery, data management & knowledge representation, and presentation, production & dissemination.

3.2.2 Visual Analytics challenges in network monitoring and security

The fields of network monitoring and security largely rely on automatic analysis methods in detecting failures and intrusions. However, with the steadily increasing amount and diversi- fication of threats, these methods often produce enormous amounts of alerts or fail to detect novel attacks. Purely visual approaches to network monitoring suffer from the same short- comings. However, visual analytics as a combination of automatic analysis techniques with the background knowledge and intuition of human experts through interactive visual displays appears a promising research area for solving some of the information overload problems in these fields. Through an appropriate visual communication of the analysis results, security experts can make a better and faster assessment of the current threat situation, which enables them to initiate countermeasures in time. The fact that this kind of systems have not yet been used very much in practice by large network service providers can be explained by the following yet unsolved challenges:

• Large networks produce data streams at enormous rates, which aggravates their real- time analysis. However, in many cases, only immediate reaction would save the network resources from a major breakdown. Fast-paced analysis of large amounts of traffic logs and alerts from heterogeneous sources therefore needs to be improved through visual analytics applications in the near future. 3.3. Related work on visualization for network monitoring and security 45

• Scalability of both automatic analysis methods and visualizations is another major issue. While detailed traffic analysis is computationally infeasible on large traffic links, many data visualization methods are also incapable of visualizing large amounts of data. Since the health of the network largely depends on the capability to analyze its behavior, both scalability issues need to be approached.

• In some analysis scenarios, it is not only the amount of data that places a burden on the analyst, but also interpretability issues due to the complexity of the available infor- mation. This motivates innovative research on both automatic methods to abstract data as well as novel on visual representations to gain an overview over complex analysis scenarios.

• Research on semantics is expected to considerably improve analysis tasks by transform- ing raw data into information useful for the analyst.

• While many of newly proposed visualization systems facilitate analysis tasks, their us- age by the intended audience is not guaranteed. Often, user acceptability becomes a challenge since routinized workflows are substituted and the new tools often do not offer the same features as old systems from the first release on.

Although these challenges have to be mastered first, the multitude of proposed systems for visual analysis of network traffic and events presented in the next section already indicates a transformation from automatic analysis systems towards visual analytics systems, which integrate the human expert into the analysis process.

3.3 Related work on visualization for network monitoring and security

Visual support for network security has recently gained momentum, as documented by the CSS Workshop on Visualization and Data Mining for Computer Security in 2004 (VizSEC / DMSEC 2004) and by the Workshops on Visualization for Computer Security in 2005, 2006, 2007. First results have were presented there, but it remains an intriguing endeavor to design visual analysis tools for network monitoring and intrusion detection. While reviewing previous work in the field of visualization for network security, we realized that almost all proposed visualization systems aim at tackling one or more of these three major problems:

1. Monitoring network traffic between hosts, prefixes, and AS’s;

2. Analysis of firewall and IDS logs to detect viruses, worms, and attacks;

3. Detection of errors and attacks in the BGP routing system.

Ultimately, all previously proposed methods support the administrators in their task to gain insight into the causes of unusual traffic, malfunctions or threat situations. Besides automatic 46 Chapter 3. Foundations of information visualization for network security analysis means, network operators often relied on simple statistical graphics like scatter plots, pair plots, parallel coordinates, and color histograms to analyze their data [115]. However, to generate meaningful graphics, the netflow data and the countless alerts generated by IDS’s and firewalls need to be intelligently pre-processed, filtered and transformed since their sheer amount raises scalability issues in both manual and visual analysis. Although traditional statistical graphics suffer from overplotting problems, they often form the basic metaphor of newly proposed visualization systems since analysts are familiar with the interpretation of the former. Scatter plots, for example, are prone to overplotting when many data points are assigned the same position within the plot. Additional interaction fea- tures can then be used to enhance the user’s capabilities to discover novel attacks and to quickly analyze threat situations under enormous time pressure. Recently, the book “Security data visualization: graphical techniques for network analy- sis” by Greg Conti [33] appeared. The book aims at teaching the reader to design a visualiza- tion system for network security, reviews state-of-the-art visualization techniques for network security, and demonstrates in practice how large amounts of traffic, firewall logs, and IDS alerts can be visually analyzed. Furthermore, an outlook of subareas of network security and beyond that could potentially benefit from visual analysis is given.

3.3.1 Monitoring of network traffic between hosts, prefixes, and ASes Some of the earlier work in this field is the study by Erbacher et al. on intrusion and misuse detection [48]. Their proposed glyph visualization animates characteristics of the connections of a single host with other hosts in the network. In the center, the current workload of the monitored host is shown through the thickness of the center circle. Spokes extending from its perimeter represent the number of users in multiples of 10. Other hosts, which initiated connections to the monitored system, are then placed on several concentric rings according to their distance to the monitored system in the IP address space. These hosts are connected with the monitored system through different types of lines, thereby encoding the application and the connection status. At a more detailed level, port numbers give an indication of the running network applica- tions that cause the traffic. For example, Lau presented the Spinning Cube of Potential Doom [99], a visualization based on a rotating cube used as 3D scatterplot. Variables such as lo- cal IP address space, port number and global IP address space are assigned to its axes. Each measured traffic packet is mapped to a point in the 3D scatterplot. The cube is capable of in- tuitively showing network scans due to emerging patterns such as horizontal and vertical lines or areas. However, 3D scatterplots may be difficult to interpret on a 2D screen due to overlay. InetVis reimplements Stephen Lau’s spinning cube and is capable of maintaining interactive frame rates with a high number of displayed points in order to detect scan activity [168]. By comparing the visualized scans with the alert output of Snort and Bro sensors, the authors assess the sensors’ effectiveness. Another port analysis tool is PortVis described by McPearson et al. in [119]. It implements scatterplots (e.g., port/time or source/port) with zooming capabilities, port activity charts and various means of interaction to visualize and detect port scans as well as suspicious behavior on certain ports. 3.3. Related work on visualization for network monitoring and security 47

Figure 3.3: Computer network traffic visualization tool TNV

This work was continued in [122] and resulted in a tool for automatic classification of network scans according to their characteristics, ultimately leading to a better distinction be- tween friendly scans (e.g., search engine webcrawlers) and hostile scans. Wavelet scalograms are used to abstract the scan information at several levels to make scans comparable. Subse- quently, these wavelets are clustered and visualized as graphs to provide an intuition about the clustering result. For an even more detailed analysis of application processes, Fink et al. proposed a system called Portall that allows end-to-end visualization of the communication between distributed processes across the network [50]. While previous tools either showed host level activities or network activities, this system enables the administrator to correlate network traffic with the running processes on the monitored machines. The system mainly uses hierarchical diagrams, which are linked with each other through straight connecting lines. A similar linking is also utilized in other applications, for instance, in TNV [59]. As shown in Figure 3.3, the main matrix links local hosts to external hosts through straight and curved lines. On the bottom, there is a time histogram detailing the amount of traffic per time interval. Through interactive selection, the interval of interest can be chosen and a bifocal lens enlarges the focused area while the reduced context is still visible. Color is used to show the activity level for each local host. Details of the used protocols and traffic direction are presented through colored arrowheads. Furthermore, details of the packets can be retrieved for each matrix cell via a popup menu. On the right-hand side, port activity can be visualized through a parallel coordinates view linking source and destination ports. While this open source tool is excellent for monitoring a small local network, its limitation to displaying approximately 100 hosts at a time might cause scalability issues when monitoring medium or large size networks. Linking detailed data to other visualizations through connecting lines is not the only possi- bility to enhance scatterplots. The IDGraphs system, for example, maps point density within 48 Chapter 3. Foundations of information visualization for network security a scatterplot to brightness [138]. Plotting time on the x-axis versus the number of received SYN and SYN/ACK packets on the y-axis possibly reveals SYN flooding, IP, port, or hy- brid scans when using the respectively appropriate aggregation strategy (e.g., the number of SYN and SYN/ACK packets aggregated by destination IP and port can show SYN flooding attacks). The system is comprised by a correlation matrix, which is interactively linked with the scatterplot by means of brushing. Rumint is another tool for visual analysis of network packets and consists of a text dis- play, a parallel coordinates plot, a glyph-based animation display, a thumbnail toolbar, a byte frequency display, and the binary rainfall visualization [34]. This binary rainfall visualiza- tion depicts one network packet per line detailing its foremost bits through green and black pixels. Alternatively, a 256-level gray-scale configuration for displaying the information byte- wise or a 24-bit RGB configuration for groups of three bytes can be chosen. Whereas textual approaches are limited to displaying the content of about 40 packets per screen, the binary rainfall visualization enables comparison of up to 1000 packets per screen. The tool can be beneficial for comparing packet length, identifying equal values between packets, thus sup- porting signature development for network-based malicious software. The byte frequency dis- play is very similar, but rather than directly visualizing the bytes, it shows aggregated statistics of the byte values per packet. The VIAssist (Visual Assistant for Information Assurance Analysis) application is another approach to discovering new patterns in large volumes of network security data [37]. On the data side, the tool consists of an expression builder for highlighting qualifying data instances and smart aggregation features, already introduced in Section 2.4.3. The visualization front- end is composed of bar charts, a parallel coordinates display, a Table Lens, and a Star Tree graph visualization. Note that the latter two visualizations are adaptations of the products from Inxight Software. Furthermore, the tool offers several mechanisms for collaboration and reporting including sharing of annotations and items of interest as well as communication of hypotheses and analytical findings. Starlight is a software product developed at Pacific Northwest National Laboratory and originally targeting the intelligence community [139]. It allows users to analyze relationships within large data sets. The resulting shapes form clusters on the system’s 3D graph display integrating structured, unstructured, spatial, and multimedia data, offering comparisons of information at multiple levels of abstraction, simultaneously and in near real-time. Network security is only one of several application areas of the product. In 1996, Lamm et al. published a study about access patterns of WWW traffic [98]. Their proposed Avatar system can be used for real-time analysis and mapping WWW server ac- cesses to the points of their geographic origin on various projections of the earth using 3D bars on top of a globe and 3D scatter plots in a virtual reality context. Xiao et al. start their analysis in the opposite direction [186] by first visualizing network traffic using scatterplots, Gantt charts,or parallel plots and then allowing the user to interac- tively specify a pattern to be abstracted and stored using a declarative knowledge representa- tion. A related system NVisionIP [97] employs visually specified rules and comes with the capability to store them for reusage in a modified form of the tcpdump filter language. The visual analytics feedback loop implemented in both approaches allows the analyst to build upon previous discoveries in order to explore and analyze more complex and subtle patterns. 3.3. Related work on visualization for network monitoring and security 49

To effectively identify cyber threats and respond to them, computer security analysts must understand the scale, motivation, methods, source, and target of an attack. Pike et al. de- veloped a visualization approach called Nuance for this purpose. Nuance creates evolving behavioral models of network actors at organizational and regional levels, continuously mon- itors external textual information sources for themes that indicate security threats, and auto- matically determines if behavior indicative of those threats is present in the network [134]. The visual interface of the tool consists of a monitoring dashboard that links events to their geographical reference and a detail view with bar and line charts that are annotated through related real world information.

3.3.2 Analysis of firewall and IDS logs for virus, worm and attack detection

One of the key challenges of visual analytics is to deal with the vast amount of data from heterogeneous sources. In the field of network security, large amounts of events and traffic are collected in log files originating from traffic sensors, firewalls, and intrusion detection sys- tems. As demonstrated in the application Visual Firewall, consolidation and analysis of these heterogeneous data can be vital for proper system monitoring in real-time threat situations [102]. Because gaining insight into complex statistical models and analytical scenarios is a challenge for both statistical and networking experts, the need for visual analytics as a means to combine automatic and visual analysis methods steadily grows along with increasing net- work traffic and escalating alerts. A study on firewall logs by Girardin and Brodbeck proposes to use Self-Organizing Maps for automatic classification of log entries [58]. By merging colors, shapes, and textures, mul- tiple attributes are encoded into a single integrated iconic representation on an 8 by 8 grid. Selecting a cell in the grid triggers a list of the contained log events. The proposed tool fur- ther comprises a spring embedder graph layout consisting of 500 events, a parallel coordinates display, and a dynamic querying interface. In many networks, additional intrusion detection sensors are installed behind firewall sys- tems in order to monitor suspicious traffic that attempts to bypass the firewall. One study to visually evaluate the output of such firewalls and IDS was conducted by Alex Wood in 2003 [185], in which destination port vs. time scatterplots with zooming capabilities are created and color is used to encode the source IP. Other statistical visualizations such as boxplots with whiskers, bar charts and pie charts are used in order to visually represent the distribution of the additional variables of the log data. IDS Rainstorm also attempts to bridge the gap between large data sets and human percep- tion [1]. A scatterplot-like visualization of all local IP addresses versus time is provided for analyzing thousands of security events generated daily by the IDS. After zooming into regions of interest, lines appear and link the pictured incidents to other characteristics of the data set. Koike and Ohno argue that many IDS signatures are not detailed enough to avoid false pos- itives and propose to use visualization to distinguish between false positive and true positive alerts [92]. For near real-time monitoring, their system SnortView reads system logs and Snort alerts every two minutes. Its basic visualization framework is a traditional 2D source IP vs. 50 Chapter 3. Foundations of information visualization for network security time matrix overlaid with statistical information, such as the number of alerts per time interval, the number of alerts per source IP, and colored glyphs encoding details of the used protocols and alert priority. Furthermore, this time diagram is extended through a source destination matrix, which interactively highlights the temporal distribution of alerts for a particular source IP and the associated destination IP. The Starmine system focuses its analysis of IDS alerts on the combination of geographical, temporal, and logical views since some cyber threats can reveal distinct patterns in any of these views. The geographical components map view and globe view visualize the geographical source destination relationships of attacks through arcs on the map and straight lines on the 3D globe. The map view is furthermore capable of showing the amount of attacks for each location as bars in a 3D scene on top of the map. In the integrated view, hosts on the map view are linked through straight lines with the respective hosts’ positions in the IP matrix, which assigns a position to the respective IP address using the first and the second bytes of the latter. Furthermore, a line chart on the right of the 3D scene in the integrated view details the aggregated number of alerts per time interval. Although the system suffers from occlusion, the authors demonstrated in the cause of the analysis of several network viruses that correlations of geographical properties and IP address ranges can be found with their tool. The Starmine system was further extended in another study in order to deal with the geo- graphical and logical properties of a local area network: The map now details the geographical positions of hosts within the university campus and the IP matrix visualizes the third and the fourth octets of the local IP addresses [123]. The temporal view was also extended and now consists of several colored line charts, each displaying the aggregated amount of alerts per subnet. Furthermore, horizontal bars and a zooming interface show port activity. In order to solve some of the occlusion issues in the integrated 3D view, the analysts can choose the expansion view, which plots the logical, temporal, and geographical view next to each other in a two-dimensional arrangement. It is worth mentioning that some visualization techniques such as parallel coordinates and graphs have meanwhile found their way into commercial products, e.g., the RNA Visualization Module of SourceFire [146]. In this context, the techniques are used to display correlations in the multidimensional characteristics of network traffic and connectivity of network nodes. However, the major drawbacks of the parallel coordinates technique are that they produce visual clutter due to overplotting lines and that only correlations between neighboring axes can be identified.

3.3.3 Detection of errors and attacks in the BGP routing system As described earlier in Section 2.1, the whole Internet builds upon the BGP routing system. If a network link fails, alternative routes to the destination will be used to route the traffic unless it was the only link connecting the respective AS with the Internet. Due to the huge scale of the Internet, routing updates are frequent and need to be quickly propagated to ensure reachability. Failures within the routing system can thus easily affect large shares of the Internet’s network traffic and it is crucial to timely detect and resolve them. Link Rank is a tool for gaining insight into large amounts of routing changes by converting them into visual indications of the number of routes carried over individual links in an embed- 3.3. Related work on visualization for network monitoring and security 51 ded graph [96]. Its filtering mechanism helps to extract a reduced topological graph capturing most important or most relevant changes. The tool enables network operators to discover oth- erwise unknown routing problems, understand the scope of impact by a topological event, and detect root causes of observed routing changes. Cortese et al. offer an alternative view on changes in the routing system using a topographic map metaphor [35]. Coloring and contour lines are used to confine AS’s at the same level of the ISP hierarchy. These lines are taken into account when calculating the layout for the AS connectivity graph, thus placing top level AS’s onto the peak of the fictitious mountain surrounded by gradually less important AS’s bounded by contour lines. This kind of metaphor permits an effective visualization of time-animated routing paths within the BGPlay system [32].

3.3.4 Towards Visual Analytics for network security In general, a trend from rather static displays that visualize the outcome of an automatic anal- ysis techniques towards highly interactive Visual Analytics system for network security can be observed. These novel systems try to tightly integrate the human expert into the analysis process by allowing him to interfere with automatic and visual analysis techniques in order to verify old or come up with new hypotheses, to test different views and algorithms on the data, and to reuse gained knowledge to further improve analysis techniques. Note that while this related work section covered tools and systems for visual analysis of network traffic, the related work sections within the application chapters will focus on review- ing relevant visualization techniques from a broader spectrum of fields: Section 4.1 considers contributions on time series visualization, Section 5.1 reviews hierarchical visualization tech- niques, Section 6.1 discusses work on node-link diagrams, Section 7.1 deals with work on radial and multivariate representations, Section 8.1 presents some dimension reduction tech- niques, and Section 9.1 lists related work on visual analysis of email communication.

4 Temporal analysis of network traffic

,,Everything happens to everybody sooner or later if there is time enough.”

George Bernard Shaw

Contents 4.1 Related work on time series visualization ...... 54 4.2 Extending the recursive pattern method time series visualization . . . . 55 4.2.1 The overall algorithm ...... 56 4.2.2 Empty fields to compensate for irregular time intervals ...... 58 4.2.3 Spacing ...... 60 4.3 Comparing time series using the recursive pattern ...... 61 4.3.1 Small multiples mode ...... 63 4.3.2 Parallel mode ...... 64 4.3.3 Mixed mode ...... 64 4.4 Case study: temporal pattern analysis of network traffic ...... 64 4.5 Summary ...... 67 4.5.1 Future work ...... 68

IME is one of the most important properties of network traffic. It is thus often used to Tcorrelate traffic loads and events of one host or higher level network entity with oth- ers. When monitoring large networks, the task of comparing time series becomes difficult, especially due to the following two circumstances:

1. Network time series data is recorded at a very fine granularity – aggregation can be used for consolidating the data volume, but important details might get lost. 2. There is a multitude of potentially comparable time series, especially considering the total number (65536) of application ports of the TCP and UDP protocol or the number of individual hosts within the network.

Unfortunately, traditional visualization techniques from statistics (e.g., line charts or bar charts) scale neither to the resolution of large time series nor to the increasing number of time series compared within the same plot. As demonstrated in Figure 4.1, general trends such as the increase in mail traffic in the morning, are visible, but the details of the exact timing of high or low traffic events get lost. Furthermore, overplotting becomes a serious problem as it makes the time series indistinguishable in some areas of the plot. As a countermeasure, in this chapter we present an overlap-free recursive pattern visualization technique and adapt it to the needs of our analysis. 54 Chapter 4. Temporal analysis of network traffic

smtp port 25 1e+05 pop3 port 110 imap port 143 imaps port 993 pop3s port 995 1e+04

1e+03 packets/min

1e+02

1e+01

200 400 600 800 1000 1200 1400

minutes on 2007−11−20

Figure 4.1: Line charts of 5 time series of mail traffic over a time span of 1440 minutes as monitored at the university gateway on November 20, 2007

The rest of this chapter is structured as follows. First, we discuss related work in the field of time series visualization. Next, details of the recursive pattern algorithm are given and the technique is augmented with empty fields that compensate for irregular time intervals. With spaces added between groups of different levels, the recursive pattern technique becomes both more structured and more readable. Thereafter, we present enhancements for comparing sev- eral time series with each other, such as different coloring schemes as well as tree configura- tion modes, namely, small multiples, parallel mode, and mixed mode. A case study is used to demonstrate the capabilities of the extended recursive pattern. Finally, the contributions are summarized in the last section followed by a brief outlook on the future work in this field.

4.1 Related work on time series visualization

Time series is an important type of data encountered in almost every application domain. The field has been intensely studied and received a lot of research attention, especially from the financial sector. When it comes to information visualization, not only highlighting particular patterns is an important aspect, but also arrangement of multiple time series to support com- parison between several monitored items as studied in [5, 63]. The study in [5] compares stock prices of 50 companies over time and demonstrates that with only 4 time series the line graph might reach its scalability limit, whereas the pixel-based circle segments technique is capable of showing all 50 companies and their stock price development over a long time period. The authors of [63] propose a set of layout masks for subdividing the available display space in a regularity-preserving fashion. The instances of the time series are then arranged according to their inter-time-series importance relationships by means of the masks. This work was continued and extended towards extraction of local patterns within multidimensional data sets in [62]. A local pattern can be interactively selected and so-called “intelligent queries” find similar or inverse patterns within the data set and order them according to their relevance. The 4.2. Extending the recursive pattern method time series visualization 55 technique has proven to be useful for identification of correlated load situations of database servers and for finding suitable load balancing schemes using the available hardware resources. Hochheiser and Shneiderman use traditional line graphs in their Time Searcher system [68]. However, the tool’s focus is set on the dynamic query interface. Through rectangular boxes, the user can simultaneously specify ranges of values and time intervals to find matching time series. Furthermore, a query-by-example interface, support for queries over multiple time- varying attributes, query manipulation, pattern inversion, similarity search capabilites, and graphical bookmarks characterize the innovative user interface of the tool. Naturally, insight is mostly achieved through an iterative exploration process where pre- vious results are progressively refined as emphasized in a study by Phan et. al. [132]. The authors demonstrate how they use progressive multiples of timelines and event plots to orga- nize their findings in the process of investigating network intrusions. Another interactive approach is the LiveRAC system, which is designed to address limi- tations of traditional network monitoring systems by offering an exploration interface [118]. The system is based on a reorderable matrix where each matrix row represents a monitored system and each matrix column a group of one or more monitored parameters. Thereby, com- parison of time series on various levels of details are enabled through the semantic zoom, which adapts each chart’s visual representation to the available display space. In contrast to this work, the approach presented in this chapter is based on a pixel visualization technique to better use the scarce display space. Other application scenarios deal with the problem of finding usage patterns and visualizing them on larger time scales. Van Wijk and van Selow, for example, combined a clustering algorithm and a calendar view to identify daily energy consumption patterns[170]. Their pre- sented methods provide insight into the data set and are suitable for fine-tuning their model’s parameters to predict future energy consumption. Since calendars are a well-known metaphor for most users, they provide a good basis for extensions. DateLens is probably the most prominent novel calendar interface for PDAs [10]. It employs fisheye distortion to simultaneously show overview and details of calendar entries in a compact yet detailed fashion. In contrast to these studies, the overall goal of this chapter is to present a highly scalable method for visualizing large time series and making visual comparisons between them.

4.2 Extending the recursive pattern method time series visualization

We start off by considering different tasks that typically need to be carried out when analyzing time series in a network monitoring context:

1. Find overall trends,

2. Spot repetitive events,

3. Reference an event to the exact time of occurrence, 56 Chapter 4. Temporal analysis of network traffic

4. Find co-occurring patterns in several time series.

Since commonly used line charts are capable of supporting the user in solving the above tasks only to a certain extent, we sought for alternative, more scalable possibilities of repre- senting temporal data. We draw our inspiration from the recursive pattern visualization [4], which arranges pixels or small rectangles in a line-wise back-and-forth fashion. This arrange- ment scheme has the property of rendering neighboring data elements right next to each other. Repeating this scheme through a recursive placement of higher order groups, this property will in most cases get lost when rendering neighboring data points which do not belong to the same upper level group, such as the last day of a month and the first day of the subsequent month. Nevertheless, other regularly reappearing patterns can be made visible when using an appropriate configuration of the recursive pattern.

4.2.1 The overall algorithm The basic recursive pattern algorithm is specified in terms of two array parameters widths and heights- that define the pattern’s rectangle placement scheme. To adapt the recursive pattern to our needs, we extended by introducing three new parameters, namely direction, spacingx, and spacingy, resulting in the following set of parameters in the extended recursive pattern: widths Integer array of size n specifies the horizontal partitioning of the recursive pattern, with elements ordered in descending priority. heights Integer array of size n specifies the vertical partitioning of the recursive pattern, with elements ordered in descending priority. direction Boolean array of size n specifies the start direction of the pattern for each level (0 for horizontal, 1 for vertical). Note that this parameter was not introduced in the original publication [4], but it is necessary for handling top-down alternating recursive patterns. In case this parameter is not set, horizontal start direction for each level is assumed. spacingx Floating point array of size n specifies the horizontal space between elements of each level. spacingy Floating point array of size n specifies the vertical space between elements of each level.

Our implementation of the extended recursive pattern consists of three functions: a) rec- Pos recursively finds the position relative to the pattern hierarchy as specified through widths, heights, and direction, b) absPos determines the absolute position, and c) addSpacing intro- duces spaces between elements and groups of elements of different levels. Assume a pattern representing the days of one year with 30 days in each of the 12 months (30 × 12 = 360). We could define widths = (4,6) and heights = (3,5), which would create a 4×3 arrangements of the months and a 6×5 arrangement of the days. This leads to the recur- sive pattern shown in Figure 4.2: the line-wise forward and backward arrangement positions subsequent days within a month next to each other. 4.2. Extending the recursive pattern method time series visualization 57

6 5 0 1 2 3

7 6 5 4 3 1−11 min, 1−7 months 8 9 10 11 4 +−1 day, +−1 month

Figure 4.2: Recursive pattern example configuration: 30 days in each of the 12 months with parameters widths = (4,6) and heights = (3,5)

The recPos function defined in Algorithm 4.1 takes parameters pos, level and size as input and returns an array with the position relative to the pattern hierarchy. Querying position 50 with recPos(50, 0, 360) in the time series of the previous example would then return (1,20) (2nd month and 21st day). Note that we always count from 0 for easier array handling while using the mod and div functions. Array position 50, equivalent to (1,20) in the recursive pattern hierarchy, is marked white in Figure 4.2 for illustrative purposes.

Algorithm 4.1: Recursive pattern algorithm – calculation of the position relative to the recursive pattern hierarchy 1 procedure recPos (pos, level, size) 2 begin 3 size/ = (widths[level] ∗ heights[level]) 4 if level < n − 1 then 5 return (pos div size) ∪ recPos(pos mod size,level + 1,size) 6 else 7 return pos 8 end

The absPos function in Algorithm 4.2 is somewhat more complex since it has to transform the relative position array into absolute positions that can be easily mapped on the screen. The function proceeds by retrieving the relative position array and adding the offset of the considered item from the top left corner for each level. Assuming the horizontal start direction, array position 50 in the above example is calculated as yrel = 1 div 4 = 0 and xrel = 1 mod 4 = 1 58 Chapter 4. Temporal analysis of network traffic

Algorithm 4.2: Recursive pattern algorithm – calculation of the absolute position from the relative position within the pattern 1 procedure absPos (pos) 2 begin n−1 n−1 dx = ∏ widths[i], dy = ∏ heights[i] 3 i=0 i=0 4 rel = recPos(pos,0,dx ∗ dy) 5 xabs = 0,yabs = 0 6 for i = 0 to n − 1 do 7 dx/ = widths[i] 8 dy/ = heights[i] 9 xrel = 0,yrel = 0 10 if direction[i] == 0 then 11 yrel = rel[i] div widths[i] 12 if yrel mod 2 == 0 then 13 xrel = rel[i] mod widths[i] 14 else 15 xrel = widths[i] − 1 − rel[i] mod widths[i] 16 else 17 xrel = rel[i] div heights[i] 18 if xrel mod 2 == 0 then 19 yrel = rel[i] mod heights[i] 20 else 21 yrel = heights[i] − 1 − rel[i] mod heights[i] 22 xabs+ = xrel ∗ dx 23 yabs+ = yrel ∗ dy 24 return addSpacing (xabs, yabs) 25 end

(lines 11 and 13 in the algorithm) in the first run of the loop resulting in xabs = 1 ∗ 6 = 6 and yabs = 0 (lines 22 and 23); the second run of the loop calculates yrel = 20 div 6 = 3, xrel = 6 − 1 − 20 mod 6 = 3 (lines 11 and 15) resulting in xabs+ = 3 ∗ 1 = 9 and yabs+ = 3 ∗ 1 = 2, which is the final position. The addSpacing function is explained in Section 4.2.3.

4.2.2 Empty fields to compensate for irregular time intervals Figure 4.3 demonstrates different configurations of the recursive pattern algorithm for visu- alizing the days of one month. In (a) we assigned 30 rectangles to one month in a 6 × 5 matrix under the assumption of a space-filling display utilization. Each month consists of 28 to 31 days and, therefore, the value represented by one square in this visualization varies between 23.2 and 25.7 hours. This unconventional aggregation strategy might lead to confu- sions since we cannot tell for sure whether the high value of a cell is due to the actual data 4.2. Extending the recursive pattern method time series visualization 59

Mo We Fr Su Su Fr We Mo Mo We Fr Su

(a) 6 × 5 configuration (b) 7 × 5 configuration (c) 7 × 1, 1 × 5 configuration

Figure 4.3: Recursive pattern parametrization showing a weekly reappearing pattern

distribution or whether it was introduced artificially. Apart from this drawback, the layout destroys the weekly patterns because the horizontal positions of the Friday squares end up at different positions in each row. When considered in a multi-resolution context, this layout inevitably aggravates both the understanding and the implementation of drill-down operations from months to weeks and days.

The second possibility is a 7 × 5 configuration, which maintains daily patterns by filling the pattern with empty rectangles at the beginning and at the end, as demonstrated in Figure 4.3(b). Although it is now possible to compare traffic of subsequent days, the capability of easily comparing patterns occurring on the same weekday is still lacking. Figure 4.3(c) shows our final configuration with groups of 7 rectangles per line (first parameter), vertically arranged into 5 lines (second parameter).

Note that when analyzing daily-grained time series spanning several months, one has to re- serve space for up to six weeks per month because the first week could be almost empty, which leads to the need of the sixth week at the end of the month. This layout has two drawbacks: 1) it wastes up to 33 percent of screen space for the month of February (42 rectangles for 28 days), and 2) the property of having subsequent days always appear next to each other within a month is lost due to the line-wise left to right arrangement. However, these drawbacks are compensated by the gained advantage of easily spotting weekly reappearing patterns.

Technically, empty fields are simply rendered in a color that does not appear in the colormap representing valid entries. This is demonstrated in Figure 4.4 at the example of mapping the number of emails per day for one year of communication to a recursive pattern. As the infor- mation can be displayed at several granularities without changing positions of the underlying data items, this technique has proven to be suitable for interactive data exploration due to the multi-resolution capabilities. Since the resolution of the displayed data is only limited by the number of pixels on the screen, the amount of artificially introduced nodes, and the separating border between the displayed data measurements, this technique turns out to be very scalable and can easily display one million data elements on an ordinary computer screen. 60 Chapter 4. Temporal analysis of network traffic

Figure 4.4: Multi-resolution recursive pattern with empty fields for normalizing irregularities in the time dimension

4.2.3 Spacing

Having realized the difficulty of interpreting the recursive patterns that either contain many elements or are highly nested, we introduced spacing as the second innovation aimed at en- hancing the recursive pattern technique. With gaps added between the elements of different logical time intervals (e.g., hours, days, weeks, months, years, etc.), the user is able to mentally map the rectangles at arbitrary positions to a time reference. Figure 4.5 exemplifies the difference between the traditional recursive pattern, which in this case might be very useful for comparing hourly reappearing patterns, and a recursive pattern with spacing, both visualizations referring to the same email packets data sent on November 20, 2007. Note, that this data set only contains the POP3 packets that entered or left the university network. As a result, the traffic from students or employees who check their email at the university server from within the internal network is not included. In both configurations, it is evident that most employees start checking their email at approximately 8:20 h and the intensity of usage drops at approximately 18:00 h. While the standard recursive pattern configuration enables the analyst to spot hourly repeat- ing events, it remains difficult to estimate the exact time an event occurred due to the large numbers of elements (60 × 24) without visual reference points. In the recursive pattern with spacing configuration, one group of rectangles represents measurements of one hour (6 times 10 minutes). Several patterns can be found, such as the horizontal bars within a group that indicate low or high traffic in subsequent minutes whereas vertical bars represent traffic events in 10 minutes intervals. In the 6th and 7th hour (5:00 h to 6:59 h), for example, one can nicely spot traffic patterns in 5 minutes intervals, which are probably caused by a client machine repetitively fetching emails from the server. An analyst interested in two subsequent red cells 4.3. Comparing time series using the recursive pattern 61

40000

20000

10000 8000 6000 1 hour 4000

2000

1000 800 600

1 min 400

(a) Standard recursive pattern with an hour-per-line arrangement 200

100 80 60

40

20 10 min, 6 hours

1 min, 1 hour (b) Recursive pattern with spacing separating 10 minute intervals

Figure 4.5: Enhancing the recursive pattern with spacing (showing logarithmically scaled one- day traffic volume in packets per minute over port 995 (POP3 over TLS/SSL) at the gateway of a university network) in the evening would have a hard time finding out at what time exactly they occurred based only on the visualization depicted in Figure (a). Figure (b), in contrast, enables fast and intu- itive estimation of hours and minutes of the pattern of interest, which, in our example, maps to the time interval from 21:21 h to 21:22 h. In our implementation, the width of spacing for each level is stored in arrays spacingx and spacingy. Algorithm 4.3 adds the spacing level-wise for each column to the left of and for each row above the current element. Note that we refrain from adding spacing if the previous factor in the array widths or heights was equal to 1 since this denotes a line-wise (row-wise) arrangement of the elements rather than an alternating back-and-forth (top-down) pattern.

4.3 Comparing time series using the recursive pattern

Prior to presenting various layout strategies for combined visualization of several time series, let us consider according to what criteria time series can be distinguished from one another. First of all, the elements of each time series will be made distinguishable through their posi- tion. However, when comparisons of detailed elements originating from different time series are made, it should be clear which time series each element belongs to. Besides changing 62 Chapter 4. Temporal analysis of network traffic

Algorithm 4.3: Recursive pattern algorithm: adding spacing between different levels

1 procedure addSpacing (xabs,yabs) 2 begin 3 dx = 1,dy = 1 4 x f = xabs,y f = yabs 5 dxold = dx + 1,dyold = dy + 1 6 for i = 0 to n − 1 do 7 if dx =6 dxold then 8 x f + = (xabs/dx) ∗ spacingx[n − i − 1] 9 if dy =6 dyold then 10 y f + = (yabs/dy) ∗ spacingy[n − i − 1] 11 dxold = dx,dyold = dy 12 dx∗ = widths[n − i − 1] 13 dy∗ = heights[n − i − 1] 14 return {x f ,y f } 15 end the element position through a different layout function, we propose several visual separation options using the visual attribute color: a) element colors, b) border colors, or c) background colors for each time series, as shown in Figure 4.6. Depending on the analysis task, different data normalization methods, such as linear, square- root, or logarithmic normalization, could be used. Furthermore, the analysis task also deter- mines whether each of the time series should be normalized separately or whether the col- ormap should be adjusted accordingly. When the analyst searches for a pattern with identical absolute values in several time series, the same minimum and maximum for the normalization of all time series should be used. However, when comparing numerical values of several time series that differ in their value characteristics (e.g., comparing the number of hosts with the number of packets), normalizing each time series with its own minimum and maximum might be the more appropriate option. We define the following three configuration modes of the recursive pattern for time series comparison: a) the small multiples mode to investigate several time series one by one, b) the parallel mode to facilitate the task of finding parallel developments within two or more time series, and c) the mixed mode which combines recursive patterns at an intermediate level. As 10 min, 0 min 10 min, 0 min 10 min, 0 min

1 min, 1 hour 1 min, 1 hour 1 min, 1 hour (a) element color (b) border color (c) background color

Figure 4.6: Different coloring options for distinguishing between time series 4.3. Comparing time series using the recursive pattern 63

A+B

2008 2008 2008A 2008B

January ... JanuaryA JanuaryB ... JanuaryA ... JanuaryB ...

Week1A Week1B Week2A Week2B ... Week1A Week2A ... Week1B Week2B ... Week1A Week2A ... Week1B Week2B ...

(a) weeks (b) months (c) years

Figure 4.7: Combination of two time series (A and B) at different hierarchy levels exemplified in Figure 4.7, time series can be combined at different levels of their intrinsic (e.g., days or weeks) or artificially introduced (e.g., 10 min intervals) hierarchy. Thereby, this combination level indirectly determines the recursive pattern configuration mode. Note that combining time series at the lowest level (e.g., rendering day 1 of time series A, B, and C right of each other, followed by day 2 and so on) makes little sense since the original time series in such a view can be identified only with a great cognitive effort. Combination at the second lowest level results in the parallel mode, the one at the intermediate levels results in the mixed mode, and the one at the highest level corresponds to the small multiples mode.

4.3.1 Small multiples mode

Small multiples have been extensively used in information as a straightforward way of com- paring two or more visualizations. Once the visualization of a single time series has been understood, it is only a small step to transfer this knowledge to the analysis of several time series placed next to each other, as shown in Figure 4.8. The human perception is capable of recognizing patterns co-occurring in several of the displayed visualizations at either exactly the same time or shifted in time. However, due to the long eye movement distances, we sus- pect that there exist other recursive pattern layouts more efficient for spotting co-occurring events across multiple time series. 10 min, 4 hours 10 min, 4 hours 10 min, 4 hours 10 min, 4 hours 10 min, 4 hours

1 min, 1 hour 1 min, 1 hour 1 min, 1 hour 1 min, 1 hour 1 min, 1 hour

Figure 4.8: Recursive pattern in small multiples mode (from left to right: the number of pack- ets per minute on mail ports 25 (SMTP), 110 (POP3), 143 (IMAP), 993 (IMAPS), 995 (POP3S) on November 20, 2007 with square-root normalization) 64 Chapter 4. Temporal analysis of network traffic 0 min, 1 hour

1 min, 12 hours

Figure 4.9: Recursive pattern in parallel mode

4.3.2 Parallel mode Parallel mode aligns parallel elements of distinct time series underneath each other, while the recursive placement scheme for the upper level hierarchies remains unchanged. In order to distinguish between different time series, we use several intensity-varying colormaps as depicted in Figure 4.9. The major advantage of this approach is that only small eye movements are necessary for comparative analysis of time series since the elements referring to the same point in time are aligned underneath each other. However, it might be perceptually difficult to mentally map the values to the correct time series.

4.3.3 Mixed mode So far, we have considered the comparison of time series at the highest level with the small multiples mode and at the lowest level with the parallel mode. However, there is another possibility to combine a pair of time series at an intermediate level, namely, by using what we call the mixed mode. Figure 4.10 demonstrates the mixed mode approach at the example of rendering groups of size 10 × 6 underneath each other, whereby each group represents one hour in a single time series. Different background colors are used to mark different time series. The mixed mode layout represents a compromise between the parallel and the small mul- tiples modes. We believe that the user can better follow the individual time series with this approach than with the parallel mode and that the eye movements for comparisons between the time series are considerably smaller than in the small multiples mode. As shown for treemaps in the work by Tu and Shen [160], there is another comparison mode applicable to the recursive pattern visualization. The authors suggest to split the rectangles into two and then visualize changes via the color and the position of the diagonal split line. For comparison of several attributes, vertical bars visualize the respective changes.

4.4 Case study: temporal pattern analysis of network traffic

One of the current network security problems is detection of botnet servers within the inter- nal network. While botnet clients are mostly Windows machines that have been hacked on 4.4. Case study: temporal pattern analysis of network traffic 65 10 min, 0 4 hours

1 min, 1 hour, 12 hours

Figure 4.10: Recursive pattern in mixed mode well-known exploits, the servers commonly run Linux or Unix systems because those sys- tems provide a rich set of tools, good connection, and uninterrupted availability. One way of hacking such a Linux or Unix system is to use a brute force approach by trying out password lists on common user names. While this approach rarely produces a lot of network traffic in absolute terms, it is noticeable through an increased amount of flows with destination port 22. In this case study, we captured the incoming SSH traffic from November 21 to 27, 2007 and stored it in the database. The acquired data comprises 1.4 million flows, or 162 GB, of network traffic. A single so-called flow of this netflow data is characterized by the source and the destination IP addresses and port numbers, the number of packets and the number of transferred bytes, whereas the latter two measures are aggregated over the flow interval. We realized that in our data set the flow interval commonly does not exceed 72 seconds. A longer SSH session therefore results in approximately one flows per minute. In most configurations, SSH server interrupts the connection after three subsequently failed login attempts, which forces the attacker’s machine to reconnect in order to continue trying out further user name and password combinations. Such interrupt results in another flow due to the changed source port number. In order to aggregate the flows into minute intervals, each set of the fact table entries that map to the identical timestamp value at minute-wise precision, are consolidated into one en- try. Figure 4.11 shows the number of SSH flows per minute over a time frame of one week (November 21 to 27, 2007). The extended recursive pattern visualization here places 12 hours horizontally each with 6 × 10 minutes. The corresponding 12 hour intervals of the subsequent days are the placed underneath. Below the middle, the reappearing color marks the second half of each day. Since values are scaled logarithmically prior to their mapping to the grayscale 66 Chapter 4. Temporal analysis of network traffic

60000

40000

20000

10000 8000 6000

4000

2000

1000 800 600

400

200

10 min, 0 12 hours 100 80 60

40

20

1 min, 1 hour

Figure 4.11: Case study showing the number of SSH flows per minute captured at the univer- sity gateway from November 21 to 27, 2007 colormap, dark areas in this plot represent massive amounts of SSH connections caused by attacking machines. In this case, the same normalization is used for each of the seven days since their measurements are directly comparable. While these massive peaks would also be visible in line charts or bar plots, those layouts are less supportive in revealing the exact time of their occurrence. The extended recursive pattern visualization, on the contrary, allows the analyst to see the patterns and to associate them with the exact moments in time by counting the element’s position from the top left corner of each hierarchy level. For example, a large-scale attack seen in Figure 4.11 (dark area in the red time series) took place from 13:51 h to approximately 16:29 h on November 21, 2007. Besides those extreme peaks, the visualization reveals other increases in the number of SSH flows, such as the one on November 22 (blue) from approximately 10:30 h to 21:09 h, which we want to investigate more deeply. Figure 4.12 shows the recursive pattern visualization of the detailed netflow data from November 22 as the number of flows (blue), distinct source hosts (green), distinct destination hosts (violet), network packets (orange), and transferred bytes (yellow). Note that the blue and the turquoise variables are scaled using their common maxima since they convey the same type of information from the source and the destination perspective, respectively. It is now interesting to observe that the attacks from 22:25 h to 22:45 h and from 23:33 h to 23:59 h are visible in four out of five of the data dimensions. Both attacks generated lots of flows to a large number of destination hosts, thereby sending lots of network packets with heavy payload. The gray flow pattern from 10:30 h to 21:07 h can be clearly seen in the number of packets and payload but it is not reflected in the number 4.5. Summary 67

2e+07

1e+07 8e+06 6e+06 4e+06

2e+06

1e+06 8e+05 6e+05 4e+05

2e+05

1e+05 80000 60000 40000

20000

10000 8000 6000 4000

2000

1000 800 600 400 10 min, 0 12 hours

200

100 80 60 40

20

1 min, 1 hour Figure 4.12: Details on the number of flows (blue), external source hosts (green), internal destination hosts (violet), packets (orange), and payload (yellow) of SSH traffic captured at the university gateway on November 22, 2007 of destination hosts (turquoise), which suggests that it is either an exhaustive attack targetting a single host or, more likely, an ordinary SSH-session with massive data transfers. In a nutshell, the scalability of the extended recursive pattern visualization technique was demonstrated in Figure 4.11 by presenting the details of more than 10000 measurements. Despite this large amount of data, the precision of the technique made it possible to visually determine the exact timing of certain events. Moreover, Figure 4.12 proved that the technique enables the analyst to precisely correlate temporal patterns of several time series.

4.5 Summary

Analysts in the field of network monitoring and security have to deal with time series that either contain thousands of measurement values, or consist of a multitude of different mea- surements. In order to properly assess critical network events, the analyst must be able to compare several time series at different levels of detail and to correlate the events contained in the time series. In our opinion, conventional line or bar charts do not scale to large sizes of the time series, characteristic for network monitoring and security, especially with respect to the tasks of spotting repetitive events, determining precise timings of events, and correlating these events across multiple time series. To support the above tasks, in this chapter we introduced the extended recursive pattern visualization for analyzing large or multi-dimensional time se- ries. The technique has been adapted to match the needs of the analysis by incorporating the following innovative features:

• Empty fields compensate for irregular time intervals to use the technique in a multi- resolution context. 68 Chapter 4. Temporal analysis of network traffic

• Spaces are inserted between different hierarchy levels within the time dimension to fa- cilitate the interpretation of large time series.

• Three coloring methods are presented to distinguish between different time series in a comparative view.

• Three layout configuration modes (small multiples, parallel, and mixed mode) enable visual comparisons of time series with the extended recursive pattern.

The case study exemplified the usage of the extended recursive pattern visualization on real network monitoring data sets and demonstrated how SSH attacks can be recognized within large amounts of netflows captured at the university gateway. The detailed analysis of the number of flows, source hosts, destination hosts, packets, and payload showed how a rather short attack targeting many destination hosts can be visually distinguished from an intensive SSH session.

4.5.1 Future work Although we have extended the recursive pattern technique, no user studies have been con- ducted to determine and quantify the pros and cons of our approach. A user study could be helpful in estimating whether our technique outperforms bar and line chars on certain tasks, whether users would prefer our technique, and which coloring technique for visual separation of the time series (element, border, or background color) is the most effective. Furthermore, the question about which comparison mode of the recursive pattern is most effective for com- mon analysis tasks is still unanswered. However, due to time limitations, these issues could not be investigated within this work. 5 A hierarchical approach to visualizing IP network traffic

“What really makes it an invention is that someone decides not to change the solution to a known problem, but to change the question.”’

Dean Kamen Contents 5.1 Related work on hierarchical visualization methods ...... 70 5.2 The Hierarchical Network Map ...... 72 5.2.1 Mapping values to colors ...... 74 5.2.2 Netflow exploration ...... 75 5.2.3 Scaling node sizes ...... 76 5.2.4 Visual Scalability ...... 77 5.3 Space-filling layouts for diverse data characteristics ...... 78 5.3.1 Requirement analysis ...... 79 5.3.2 Geographic HistoMap layout ...... 79 5.3.3 One-dimensional HistoMap layout ...... 82 5.3.4 Strip Treemap layout ...... 83 5.4 Evaluation of data-driven layout adaptation ...... 84 5.4.1 Visibility ...... 84 5.4.2 Average rectangle aspect ratio ...... 86 5.4.3 Layout preservation ...... 86 5.4.4 Summary ...... 88 5.5 User-driven data exploration ...... 89 5.5.1 Filtering ...... 89 5.5.2 Navigation within the map ...... 89 5.5.3 Colormap interactions ...... 91 5.5.4 Interactions with multiple map instances ...... 92 5.5.5 Triggering of other analysis tools ...... 93 5.6 Case studies: analysis of traffic distributions in the IPv4 address space . 93 5.6.1 Case study I: Resource location planning ...... 93 5.6.2 Case study II: Monitoring large-scale traffic changes ...... 95 5.6.3 Case study III: Botnet spread propagation ...... 96 5.6.4 Expert feedback ...... 97 5.7 Summary ...... 97 5.7.1 Future work ...... 98 70 Chapter 5. A hierarchical approach to visualizing IP network traffic

HE main focus of this chapter is to propose an interactive Hierarchical Network Map T(HNMap) technique that supports a mental representation of global Internet measure- ments. HNMap is applied to depict a hierarchy of 7 continents, 190 countries, 23 054 au- tonomous systems and 197 427 IP prefixes. We adopt the Treemap approach [77]: each node in the hierarchy is drawn as a box placed inside its parent. Node sizes are proportional to the number of items contained. Popup and fixed-text labels are used to describe the nodes and an adjustable color scale encodes the values of the measurement attribute. An important aspect of this study is to assess the layout stability of Treemap algorithms under large adjustments in the displayed set of nodes. We compare two layout algorithms for ordered data, split-by-middle and strip Treemaps, with respect to squareness, locality preser- vation, and run time. Filtering and navigation for interactive exploration of the proposed hierarchy are provided. These operations may force re-layouting of the affected part of the hierarchy. In addition to that, we implemented interactions with the color scale and with multiple map instances for gaining insight into the traffic distribution over time. Furthermore, we show how other analysis tools can be triggered from within the HNMap interface to study additional dimensions of the data sets. Three small case studies conducted jointly with security experts are presented as evidence of the suitability of HNMap for visual analysis of security-related data. To the best of our knowledge, this is the first proposed method for visualizing large-scale network data aggre- gated by prefix, autonomous system, country, and continent. The chapter is structured as follows. We start by discussing the related work in the field of hierarchical visualization methods in Section 5.1 and then introduce our multi-resolution HNMap approach in Section 5.2. Section 5.3 presents the involved visualization algorithms, followed by their evaluation with respect to visibility, average rectangle aspect ratio, and lay- out preservation in Section 5.4. Section 5.5 discusses the user interaction techniques that constitute our exploration tool. Different application fields of HNMap are presented as small case studies with real-world data sets in Section 5.6. Section 5.7 concludes the chapter and summarizes its contributions.

5.1 Related work on hierarchical visualization methods

Viewing our work in a broader context, one quickly realizes that data sets encountered in a multitude of application fields have immanent classification hierarchies imposed onto the data entries at the finest granularity level. Since hierarchical structures play a vital role in the process of analyzing and understanding these data sets, information visualization researchers have sought different ways of presenting and interacting with such data. To put the emphasis on the hierarchical structure, Robertson et al. proposed Cone Trees [140], in which the hierarchy is presented in 3D to effectively use the available screen space and enable visualization of the whole structure. Through interactive animation, some of the user’s cognitive load is shifted to the human perceptual system. As an alternative 3D layout, 5.1. Related work on hierarchical visualization methods 71

Kleinberg et al. presented a rather unusual approach by encoding hierarchy information in the stem and branch structure of a botanical tree [89]. A common way of displaying hierarchical data is given by the layouts that place child nodes inside the boundaries of their parent nodes. Such displays provide spatial locality for nodes under the same parent and visually emphasize the sizes of sets at all hierarchy levels. Usually, leaf nodes may have labels or additional statistical attributes that may be encoded graphically as relative object size or color. The most prominent layout of this type is Treemap – a space-filling layout of nested rectan- gles – available in a number of variants. The earliest variant was the Slice-and-dice Treemap [77]. Here, display space is partitioned into slices sized proportionally to the total area of the nodes contained therein. This procedure is repeated recursively at each hierarchy level, ren- dering child nodes inside parent rectangles while alternating between horizontal and vertical partitioning. The technique is not hard to implement and can run efficiently, but it suffers from producing long, thin rectangles, which are hard to perceive and to compare visually. Squarified Treemaps [16] remedy this deficiency by using rectangles with controlled aspect ratios. Rectangles are prioritized by size, so that the large ones are treated as the most critical ones for layout. This improves the appearance of the Treemap, but does not preserve the input node order, which may be an undesirable effect in some applications. This drawback was addressed by the Ordered Treemaps [11]. Most Ordered Treemap vari- ants are pivot-based. Unfortunately, our application often deals with rectangles of highly varying sizes, so that the resulting pivot rectangles can become highly deformed. We there- fore investigated an adaptation of pivot-based Ordered Treemaps and compared it to Strip Treemaps. Similarly to the Million Items Treemap [49], the goal of our technique is to han- dle a large hierarchy (more than 200,000 nodes). Unlike demonstrated in the latter study, we do not intend to animate size changes, but rather focus on maintaining a stable layout when redistributing the gained display space after leaving out substantial parts of the hierarchy. An alternative non-Treemap and non-space-filling layout algorithm was applied by Itoh et al. to visualize computer security data [76]. Their proposed rectangle packing algorithm maps hosts to rectangles and subnets to larger enclosing rectangles as demonstrated in Figure 5.1. Yet another possible layout of the IPv4 address space is the QuadTree decomposition of sub- sequent pairs of bits of the IP address [154] and was demonstrated in the context of analyzing routing changes. For each event, the point representing the IP address is connected with the two involved autonomous systems (placed outside of the QuadTree) by lines. Depending on the security events occurring in the network, characteristic patterns appear. The IP matrix approach [93] uses a similar placement schema, but splits each IP address into two 256 × 256 matrix representations. This system is supplemented by stacking multiple IP matrices in 3D, one for each parameter of the attack data. Furthermore, space-filling curves have been applied to display the IP address space. In contrast to Treemaps, some of these curves depict continuous ranges of the IP address space through non-rectangular regions while preserving geometric continuity. Munroe, for example, drew a comic map of the registry information of all /8 subnets in the IPv4 address space using the fractal mapping of a Hilbert Curve [124]. Wattenberg’s work on jigsaw maps [179] also uses this layout function as well as the H-Curve. Other researchers focused on surveying usage of the IP addresses ranging from aggregated data on images [135] to wall-sized 600 dpi 72 Chapter 5. A hierarchical approach to visualizing IP network traffic

algorithm then decides the rectangle’s posi- tion while it avoids overlapping the rectan- gle with previously placed rectangles, and attempts to minimize the entire grid-like space’s area and aspect ratio. If no adequate candidate position for placing the rectangle exists, the algorithm generates several can- 1 Example of didate positions outside the grid-like space hierarchical and selects one candidate in which to place data visualiza- the rectangle. tion using a rectangle- Visualization in the IP address space packing algo- We group computers according to their IP rithm. addresses to form hierarchical datainitial- ly by the first byte of their IP addresses, then by the second byte, and finally by the third byte. Consequently, the technique forms four- level hierarchical data, as Figure 3a shows. It visualizes the computer network’s structure by representing the hierarchical data as in Fig- ure 3b, where black icons represent comput- ers and rectangular borders represent groups of cFigureomputers. 5.1: Example of a hierarchical data visualization using a rectangle-packing algorithm By visualizing large-scale hierarchical data contain- 2 Improved ing thousands of leaf nodes without overlapping, our rectangle- technique can represent thousands of computers as packing algo- cliprintoutsckable icons in ofone alldispla hostsy space. inThe thetechn Internetique is (a) [66] by means of space-filling curves.rithm: (a) therefore useful as a GUI for directly exploring detailed previously informTheation a majorbout incide advantagents of arbitrary co ofmpu ourters inHierarchical Network Map versus theplac rectangleed rectan- packing algo- large-scale computer networks. In addition, by visual- gles and grid- izirithm,ng a comp theuter h QuadTreeierarchy using t approach,he computers’ IP the IP matrix, and the space-filling curveslike subd isivisi aon more meaningful addresses, our technique can briefly represent the cor- of a display repositioninglation between incid ofents the and g IProup addressess of computers in on the display. We combine networks topologypace, (b) can- with geographic a rinformationeal-world organizatio forn beca anuse I improvedP addresses are of orientationten and understanding of the analyzeddidate positio datans sets, while at assigned according to an organization’s structure. for placing the the same time scaling up to the entire= Candidate IPv4 addressposition space. However,curren duet rectan to- the uncertainty Implementation (b) gle, and (c) involvedOur technique c inonsu themes t geographiche log files of a comm placementercial of autonomous systems, a fewpla misplacementscement of are to be IDS (Cisco Secure IDS 4320). The system detects inci- the current detakennts based intoon sign account.atures that predefine typical mali- rectangle and cious aOurccess pa worktterns. Th hase techn beenique inpu inspiredts several item bys geographical distortion methods, suchupdate asof thPixelMape [84] and from the log file description (see Figure 4, next page): grid-like subdi- HistoScale [83] that reposition points and polygons. Recent workvis onion. cartographic layouts, I IP address of the computer sending incidents, I particularlyIP address of the com theputer rectangular receiving incidents cartograms, (c) [67] that optimize the layout of rectangles with respect I date and time, I toposi area,tive intege shape,r ID (signat topology,ure ID) denot- relative position, and display space utilization, influenced our work. ing the specific signature, and 1.2.3.4 1.*.*.* I Insecu therity lev latterel (1, 2, 3 work,, 4, or 5). a genetic algorithm has been applied to find a good compromise between these objectives. The algorithm renders1.2.3.5 the layout offline,1.2.*.* not interactively,3 (a) Hierarchy and does not yet of computers Visualization procedure 1.2.4.6 1.2.3.* 1.2.4.* exploitAfter consum hierarchicaling the log files (se structures.e Fig- An overview of further rectangular cartogramsaccording to can be found in ure 4a), our technique visualizes the inci- 1.2.4.5 their IP address- de[167].nts in a specified processing order. es. (b) Illustra- 1.3.3.4 Our proposal combines several layout techniques to cope1.3.*.* with thetion large,of multilevel AS/IP Relational-database-like structure. 2.3.5.5 visualization Using the log files, the technique forms a results of the hierarchy in a tool that runs fast enough2.3.5.7 for interactive response. A geographic layout method relational-database-like structure, as Fig- hierarchical urdenotede 4b shows. It cHistoMaponstructs tables foisr tim proposede, for continents3.5.6.8 and countries; a fastda one-dimensionalta. version signature IDs, security levels, senders’ IPof addr HistoMapesses, and receive forrs’ IP a autonomousddresses. systems aims at preserving2.*.*.* neighborhoods3.*.*.* in 1D; and, finally, The data structure accelerates incident (a) (b) agagre Stripgation. Treemap [11] approach that preserves the input order of data is especially appropriate for IP prefixes at the lowest level. IEEE Computer Graphics and Applications 41

5.2 The Hierarchical Network Map

Hierarchical Network Map is a visual exploration approach aimed at displaying the distribu- tion of source and target data traffic of network hosts in a more expressive way than a simple density diagram like the one shown in Figure 5.2. The intention to provide an overview of 5.2. The Hierarchical Network Map 73 10000 100 10 1 # of connections from source

0 50 100 150 200 250

8 bit IP address prefix

Figure 5.2: Density histogram showing the distribution of sessions over the IP address space. the entire Internet on the screen has forced us to treat every pixel as a valuable asset and to use a space-filling technique. Compared to node-link diagrams, our technique has three major advantages:

1. No display space is wasted. 2. Larger data sets can be visualized mapping network size to the node’s area. 3. Better support for multi-resolution exploration since drill-downs have only local effects on the layout.

The whole network structure can be viewed as a hierarchy with the IP addresses of single hosts as the bottom-level nodes. Each host belongs to a local IP network (or prefix), which, in turn, is a member of an autonomous system. The IP address hierarchy is extended by two additional geographic classes on top of the two just mentioned network levels, namely, countries and continents. The display rationale is to recursively nest child node rectangles in their parent rectangles, from single hosts up to local networks, AS’s, countries, and continents. Figure 5.3 illustrates this multi-resolution approach to investigating the outgoing traffic of a chosen local network within the University of Konstanz in terms of number of packets sent from August 27 to September 4, 2005. In the world view with a drill-down on the European continent, shown in Figure (a), high network traffic targeting Germany can be observed. Figure (b) showing a continent view of Europe with a drill-down on Germany reveals that the volume at a single autonomous system (AS 553, Belwue) in Germany exceeds the aggregated traffic to Switzerland. In the view of Germany shown in Figure (c), a drill-down on BelWue (the research network of the federal state Baden-Wurttemberg,¨ Germany) confirms the expectation that our university network (134.34.0.0/16) receives most data packets. In many space-filling hierarchical layouts such as TreeMaps [77], the area is partitioned ac- cording to a variable statistical measure so that the sizes of the resulting rectangles correspond to the respective measure values. In our scenario, choosing the per-node traffic volume as a measure would lead to constantly changing sizes and positions of node rectangles on the dis- play thus aggravating user’s orientation. Since the success of our technique is measured by its 74 Chapter 5. A hierarchical approach to visualizing IP network traffic

(a) World view (b) Continent view (c) Country view

Figure 5.3: Multi-resolution approach using the Hierarchical Network Map applicability for continuous network monitoring and analysis, recognition and familiarity turn into critical requirements. Therefore, we map the screen area and its partitioning to a rather non-volatile measure (at least in the short-run), namely, the total size of the network and its components. Given a constant ratio of the display space, this design results in an almost static map layout, as network architectures normally experience little changes in the short term. While the containment relationships within rectangles show the IP address hierarchy and class membership, the size of each hierarchical region from a single IP address all the way up to the whole continent is proportional to the number of IP addresses that region contains. Furthermore, the geographical nodes, i.e., those at country and continent level, preserve their relative geographical position (similar to a cartogram). It is this geographic awareness of the nodes that allowed us to classify our technique as a map. At the level of autonomous systems and local networks, the contained rectangles appear sorted by IP address in a left-to-right and top-down fashion, thus accelerating the visual lookup of any sub-node.

5.2.1 Mapping values to colors Since we used visual variables area and position for mapping the network size and the re- spective node’s position within the IP address space, respectively, we decided to use color to convey per-node traffic loads measured as a total number of bytes, packets, flows, or sessions. For this color mapping, we rely on methods already discussed in Section 3.1.2. In the HNMap, the output of the chosen normalization function is mapped to the index positions within the red to blue color scale for each displayed rectangle. We favor the logarithmic color scale as demonstrated in Figure 5.3 due to its improved discrimination for low values and smoothing 5.2. The Hierarchical Network Map 75 effect on outliers. In some cases, analyzing an absolute measure of network traffic (e.g., number of sessions or transferred bytes) hardly provides any insight in the time-varying data dynamics. It may be more informative to run a backend database query to calculate the first or second derivative over time, depending on the needed level of abstraction for the analysis task, with the subse- quent visualization of the query results. Further details on derived measurements are given in Section 5.6.2. Processing the entire network data the way it is protocolled by the system logs has turned out to be infeasible for the analysis. The generated amount of “raw” operations is simply too large to store and analyze even on high-performance workstations. For example, an attempt to apply our approach for exploring traffic at a gateway of a middle-sized university would result in storing several gigabytes of log data per hour. To avoid input data overflow, we store the aggregated operations, in which the packets are grouped into sessions and the sessions of related flows, i.e., those referring to the same source and target and the same port numbers within close timestamps, are summed up. Therefore, the load recorded by such aggregated en- try is described by the number of sessions, the number of transferred packets, and the payload measured in bytes.

5.2.2 Netflow exploration

Our proposed visualization can be applied both in online mode for actual network traffic mon- itoring, and in offline mode for exploring the network’s behavior in a selected time and space window. Let us consider each operation mode to reveal their constraints and implications in terms of the required input data and performance.

Offline Mode

In offline mode, historical traffic data is loaded from the database. The user specifies the type of load and the time window to set as compulsory filters. Optional filters can be specified and/or added in the process of interaction. Suppose that a network administrator is concerned about the degrading performance of UDP traffic in the network as packets continue to get dropped at an increasing rate. He starts the investigation by choosing the target load filter to trace where the traffic was sent to, and pro- ceeds by selecting a time period starting at the moment the problem was discovered. Finally, the protocol filter is activated to show only UDP traffic. Running the HNMap helps him to for- mulate a hypothesis about the target distribution of the analyzed data flows. Interactively, he discovers that most of the traffic is sent to Germany and selects another date to check whether such large amounts of traffic are typical for that target node. To verify if the spotted traffic pattern is the result of either a continuous growth or appeared all of a sudden, the network administrator runs an animated view of the daily traffic from the previous month. Observation of a continuous upward trend makes it obvious that there is a need for better connectivity to the target network. 76 Chapter 5. A hierarchical approach to visualizing IP network traffic

Online Mode (Monitoring) Monitoring network flows in online mode is a challenging task due to high data arrival rate and run-time requirements. It can be performed by refreshing the display either continuously or periodically. Continuous re-rendering imposes extreme costs since each incoming event affects the visualization. Therefore, it is rather imperative to define a reasonably short interval in which the visualization gets updated with newly arrived data. In many networks, daily operations are monitored by measuring the performance of the gateway routers. We believe that extending the mere observation of traffic statistics to also take into account the source and target distribution in terms of the IP addresses involved, may result in improved adjustment of the routing policies of large networks. In the long term, considerable cost savings can be achieved if the same performance of the network is provided with less hardware resources due to more intelligent routing policies.

5.2.3 Scaling node sizes Big differences in sizes between IP prefixes, AS’s, countries, and even continents turn visual comparison of the respective rectangles into a challenge, especially when dealing with ordi- nary computer displays as opposed to wall-sized displays. We opted for a compromise by scaling the IP prefix sizes (number of contained IP addresses), thus indirectly affecting the upper level aggregates using square-root, logarithmic, or uniform scaling: √ fsqrt(wi) = wi (5.1) flog(wi) = log(wi + 1) (5.2) funi(wi) = 1 (5.3)

The effects of node size scaling are illustrated in Figure 5.4. Square-root scaling consider- ably reduces the size of the larger rectangles, while the latter still remain considerably larger

(a) Linear (b) Square-root (c) Logarithmic (d) Uniform

Figure 5.4: Scaling effects in the HNMap demonstrated on some IP prefixes in Germany 5.2. The Hierarchical Network Map 77 than previously smaller rectangles. While logarithmic scaling displays minimal size differ- ences, uniform scaling assigns identical sizes to all rectangles. Note that the size of any upper level rectangle (with gray borders) is determined through the sum of the scaled child nodes. This can lead to the effect that AS’s containing many small prefixes becomes bigger than AS’s with only few large prefixes, although the latter one contain more IP addresses.

5.2.4 Visual Scalability

The term visual scalability describes the capability of visualization tools to display large data sets in terms of the number or the dimensionality of data elements [45]. We designed the HNMap to be highly scalable in order to support a large tree structure (7 continents, 190 countries, 23 054 autonomous systems and 197 427 networks) in an effort to facilitate the overview. To which extent this explorative potential can actually be exploited depends on the available display size. Technological advancements such as large display walls enhance scalability while maintaining the overview (see Figure 5.5). The query interface on the top left shows the traffic distribution over time and specifies the selected data, which is in this case the traffic entering the gateway of the University of Konstanz on well-known ports (0-1023) on November 29, 2005 using “transferred bytes” as a measure with logarithmic color mapping. One recognizes a heavy traffic load from AS 3320 (red) of “Deutsche Telekom” as well as from neighboring AS’s in Germany. A port histogram reveals high activity on the web ports 80 and 443. For security and privacy reasons, the data was aggregated and sanitized. To separate adjacent rectangles from each other, we implemented two border schemes. In the first scheme, rectangles are visually separated solely by one-pixel thin borders, irrespective of the rectangles’ resolution. No additional borders are placed around parent rectangles in order to maximize the effective display area. To improve visual perception of the hierarchical

Figure 5.5: HNMap showing all 19 731 AS’s (data from 2005) on a large display wall (5.20m × 2.15m, 8.9 Megapixels, powered by 8 projectors) 78 Chapter 5. A hierarchical approach to visualizing IP network traffic structure represented by the borders and of the depth of rectangle partitions, the borders are highlighted with a distinct color for each granularity level (see Figure 5.6). The second scheme does not draw borders between the lowest hierarchy level (prefixes) at all, but uses the saved space to separate nodes at upper levels incrementing the thickness by 1 pixel per level (AS’s: 1 px, countries: 2 px, continents: 3px). Labels are displayed in black or white for better visibility, depending on the better contrast to the respective rectangle’s background color. If labels are too big to fit into the rectangle boundaries, they will be either shortened or not drawn at all in case there is not enough space for at least three characters.

continent country autonomous system network

Figure 5.6: Improving the visibility of hierarchy levels through borders with a distinct color per level (borders are omitted for rectangles that are less that 2 pixel wide or high)

5.3 Space-filling layouts for diverse data characteristics

The hierarchy of the IP address space exhibits a number of particularities, such as that the upper two hierarchy levels should be placed in a space-filling way according to geographic coordinates and nodes with similar children need to be positioned close to each other at the AS level. The latter goal can be achieved by calculating the middle IP address of the AS and arranging the nodes according to this linear order. At the lowest hierarchy level, the linear order becomes even more meaningful as many subsequent IP prefixes share common owners, routing policies, or are semantically linked (i.e., two universities that joined the internet at roughly the same time). To visualize the large hierarchy at hand in an interpretable way, we chose three different layout algorithms, namely the geographic HistoMap, a variant of this layout called HistoMap 1D, and the Strip Treemap. We discovered that the hierarchy levels consist of different prop- erties and impose different requirements on the space-filling layout. Therefore, we propose a combination of the above mentioned layout algorithms. However, the preservation of the space-filling criteria comes at the expense of sacrificing geographic neighborhood, the abso- lute order of the nodes, visibility of a few small nodes, and layout changes when discarding nodes on purpose, as there is no ideal layout algorithm to date. 5.3. Space-filling layouts for diverse data characteristics 79

5.3.1 Requirement analysis The starting point of our visualization technique is a squarified treemap [16], which optimizes the aspect ratio of the resulting rectangles to produce nearly square partitions, thus improving the visibility of the mapped hierarchical structure and the comparability of rectangle areas. We extend the above approach by setting the rectangle’s position with respect to its neighbors to be another relevant optimization criteria and use the color attribute to express the statistical value of each rectangle representing a network or a geographic entity. During our preliminary analysis, we figured out that the following four criteria are to be taken into account: 1. Full space utilization: A good solution for space-filling is achieved by recursively split- ting the display space into two rectangles.

2. Preserving proportions across rectangles: This criteria is fulfilled by setting the pro- portions of the two resulting rectangles according to the number of IP addresses they represent.

3. Geographic awareness (position): The optimal positions are used to initialize the layout, but are not considered in the further layout optimization process.

4. The aspect ratio of rectangle areas is optimized by evaluating alternative split directions. Some of these criteria are conflicting (see Table 5.1), for instance, aspect ratio conflicts with full space utilization, since the latter cannot avoid rendering rather “stretched” screen partitions. The rest of this section presents the used layout algorithms. Space Area Position Aspect ratio utilization Space utilization x x Area x x Position x x x Aspect ratio x x x

Table 5.1: Conflicting optimization criteria

5.3.2 Geographic HistoMap layout The upper two levels of the AS/IP hierarchy consist of geographic entities. In general, geo- graphic visualization is often very compelling – two-dimensional maps are familiar to most people as a convention for representing three-dimensional reality. Remarkably enough, mental representations derived from maps are effective for many tasks even when extreme scales and nonlinear transformations are involved. Many approaches have been investigated for showing geographically-related as well as more abstract information on maps [40]. To meet the challenging visibility goals of our application, the proposed geographic maps rely on several kinds of abstraction: a) ocean are omitted, b) countries are represented as rect- angles, c) these rectangles are sized proportionally to the number of IP addresses in the AS’s 80 Chapter 5. A hierarchical approach to visualizing IP network traffic

Figure 5.7: Geographic HistoMap of the upper two levels of the IP hierarchy (size represents the number of IP addresses assigned to each country) assigned to those countries, and d) these geographic entities are repositioned while seeking to preserve neighborhood relationships. Figure 5.7 shows the result of applying these abstrac- tions using the HistoMap algorithm. Note that the unscaled number of IP addresses is used for the size of each country’s rectangle. For the geographic layout, the underlying HistoMap algorithm operates on four arrays con- taining latitudes, longitudes, weights, and an index of the geographic entities. In the initial setting for the partitioning algorithm sketched in Algorithm 5.1, there is one spatial point pi for each continent, country, autonomous system, or network; P = {p1,..., pn} where 2 (pi)i=1,...,n ∈ R . A weight wi ∈ R is attached to each rectangle and is equal to the num- ber of IP addresses xi represented by pi. Alternatively, the outcome of the scaling functions described in Section 5.2.3 applied on xi can be used for wi. Function area determines the area of each rectangle on the screen as follows (dx is the width and dy the height of the display):

wi area(ri) = n × dx × dy (5.4) ∑ j=1 w j

The algorithm searches for a partitioning of the display space into |P | = n rectangles R = {r1,...,rn}. Each time a point set P is split into P1 and P2, the representative rectangle r is split first horizontally into two rectangles r1 and r2 and then vertically into r3 and r4. In the next step, the quality function determines the better split, depending on the average squareness of 5.3. Space-filling layouts for diverse data characteristics 81

Algorithm 5.1: Partitioning algorithm. 1 procedure partition (P ) 2 begin 3 if |P | > 1 then 4 (P1,P2) ← splitRect (P ) 5 partition (P1) 6 partition (P2) 7 else 8 drawRect (P ) 9 end 10 procedure splitRect (P ) 11 begin 12 (P1,P2) = splitHorizontal (P ) 13 (P3,P4) = splitVertical (P ) 14 if quality (P1,P2) > quality (P3,P4) then 15 return (P1,P2) 16 else 17 return (P3,P4) 18 end

(r1,r2) and (r3,r4). The position of the split line between the rectangles is determined by the sum of weights associated with each of the two point set. To avoid deep unbalanced recursive calls while operating on these large arrays, we split the arrays in a way similar to Pivot-by-middle Ordered Treemap [11]. The result of a split is that all elements on one side of the pivot element are less than or equal to the pivot element, and all elements on the other side are greater than or equal to the pivot element. This can be achieved by choosing an arbitrary pivot from the array and applying a modified version of Quicksort. In our algorithm, each successive recursive call of Quicksort is limited to that part of the array, which includes the middle position, in order to obtained partitions of the same size ±1. The quality of horizontal and vertical splits in each call of splitRect can be tested by the first semi-sorting the data according to longitude and then latitude. Obviously, the second sorting destroys the order of the first one. To guarantee the reproducibility of the partitioning, we define an absolute ordering on the data set through the use of the index array. In case of equality, the higher index determines which element is greater. For speed, our quality function assesses the squareness of the two rectangles in a greedy fashion without testing the effects of further splits through look-aheads. In general, this approach is not limited to a two-way split, but can readily be extended to three-or-more-way splits. For simplicity, we used the 2-bin variant in our experiments. With respect to the use of geographic information, we realized two things: a) in some cases, neighborhood preservation for countries fails and b) assigning autonomous systems to countries involves uncertainty; thus erroneous information can be communicated. However, until now we have found only one misplaced autonomous system. The techniques strength 82 Chapter 5. A hierarchical approach to visualizing IP network traffic lies in the possibility to show network traffic at different clearly defined granularity levels, so that detailed information can be retrieved at continent, country, autonomous system, or network level. Although mapping a network or an AS to multiple countries (either by splitting partially or by rendering twice) is an option, we refrained from it in order to keep the system interpretable.

5.3.3 One-dimensional HistoMap layout To place autonomous system nodes in a spatially meaningful way, we first attempted to use AS numbers as a sorting criterion, but soon discovered that this ordering hardly reveals interesting patterns. As an alternative, we calculate the median IP address of all prefixes advertised in an AS. When applying the HistoMap in 1D layout, AS’s containing predominantly low IP prefixes (e.g., 4.0.0.0/8) are placed near the upper left corner, and those containing high prefixes (e.g., 239.254.0.0/16) are placed near the lower right corner. Figure 5.8 demonstrates the outcome of this layout algorithm run on the set of all autonomous systems within Germany. One-dimensional HistoMap layout is a simplification of the geographic HistoMap algo- rithm. As in the geographic version, the effects of vertical and horizontal splits of the available display space are assessed, but the split along the only one-dimensional data array is the same for both directions. A speed-up factor of 2.5 is obtained by avoiding the need to resort. It is also possible to apply splitting to multidimensional data. Each split then represents a

Figure 5.8: HistoMap 1D layout of all AS’s in Germany (the measure (number of incoming connections) of each item is expressed through color) 5.3. Space-filling layouts for diverse data characteristics 83 hyperplane orthogonal to the chosen split dimension. Furthermore, a look-ahead function can be applied to find better layouts, but is not considered here.

5.3.4 Strip Treemap layout

The Strip Treemap method was chosen as the bottom-level layout for four reasons: a) its linear layout (see Figure 5.9) preserves the order of the rectangles leading to improved readability, b) its local optimization is efficient when processing large arrays, c) it yields good aspect ratios, and d) it is relatively robust against changes [11]. In this algorithm, items of the input list are first sorted according to an index. Then, input items are iteratively added to the current strip. If the average aspect ratio of the rectangles within this strip after recalculating sizes increases, the candidate rectangle is removed from the strip, all other rectangles are finalized on the screen, and a new strip is initialized and made current. The algorithm terminates after processing all rectangles. To obtain better aspect ratios of the rectangles, especially to avoid long, skinny rectangles in the final strip, we apply an optimization proposed by Bederson et al. in [11], which consists in maintaining a look-ahead strip and moving items from this strip to the current strip if the combined aspect ratio improves.

Figure 5.9: Strip Treemap layout emphasizing the prefix order in the AS ATT-INTERNET3 84 Chapter 5. A hierarchical approach to visualizing IP network traffic

5.4 Evaluation of data-driven layout adaptation

For the experiments, we used the previously described geographic HistoMap layout for the upper two hierarchy levels to position continent and country nodes. The one-dimensional data type of the next two dimensions makes it possible to use a wide variety of layout algorithms, but we limited our research to HistoMap 1D, Strip Treemap, and their combinations due to their efficiency and consistent ordering properties. With regard to runtime performance, we measured 1.5 seconds to render the complete IP hierarchy using HistoMap 1D, 1.9 sec. for Strip / HistoMap 1D, 3.1 sec. for HistoMap 1D / Strip and 4.6 sec. for Strip layout. However, we still see some optimization potential when applying an array-based sorting algorithm to speed up our implementation of the Strip layout. The remainder of this section is dedicated assessing the rectangles’ visibility at the third and fourth hierarchy level, the average aspect ratio, and layout preservation when applying the HistoMap 1D, the Strip layout, or their combinations.

5.4.1 Visibility Showing all data in a large hierarchy at once is difficult, especiallyin presence of a large vari- ance in size between data elements. On the one hand, we would like the display to accurately convey the sizes of various parts of the hierarchy and, on the other hand, we want to show as many items as possible including small ones. An obvious approach is to simply allocate the screen space according to the number of IP addresses in each prefix, which yields meaningful sizes for nodes at all hierarchy levels. However, this approach does not cope well with large variances in prefix lengths, ranging from 20 (subnet mask /32) to 224 (subnet mask /8). One improvement can be achieved by realizing that some IP prefixes (networks) are con- tained in others. Borders for deeply nested hierarchy levels are costly in terms of display space [143], so we can render overlapping prefixes adjacent to each other, assigning values (such as packet counts) to the most specific prefix, which is a common approach common when analyz- ing routing data. If display space has to be assigned to approximately 2 billion IP addresses, this still works out to half a pixel per 1000 IP addresses on a conventional megapixel display or about 5 pixels on a 9 megapixel powerwall. To further reduce the visibility problem, the size of the nodes corresponding to IP prefixes at the lowest hierarchy level can be normalized using square root, logarithmic, or uniform scaling. In practice, with square root normalization we were unable to show 657 prefixes using the optimal layout at all levels on a 1920 × 1200 pixel screen (net resolution: 1856 × 1132 pixels). Logarithmic normalization flog(wi) = log2(wi + 1) combined with two heuristics can show all the prefixes. These heuristics are a) assign a minimum width and height of 1 pixel to each rectangle and b) give precedence to borders of larger rectangles but omit borders of smaller rectangles when there is insufficient space. As labeling is not possible for many rectangles on the screen, detailed information, such as the value of statistical attributes and the path to the root of the hierarchy are shown when the mouse hovers over a rectangle. Figure 5.10 demonstrates the outcome of the algorithms applied to the whole IPv4 address space with logarithmically scaled network sizes using the HistoMap 1D layout for the 3rd and 4th hierarchy level. Note that large parts of the screen appear blue partly because there is no 5.4. Evaluation of data-driven layout adaptation 85

Figure 5.10: Anonymized outgoing traffic connections from the university gateway on November 29, 2005 showing all 197 427 IP prefixes

traffic to these networks and partly because we abstained from drawing borders on the lowest level. Table 5.2 shows the outcome of our experiments.

Algorithm (3rd/4th level) AS (23054) Prefixes (197427) HM 1D 0.00 % 0.00 % Strip 0.03 % 0.46 % Strip / HM 1D - 0.02 % HM 1D / Strip - 0.16 %

Table 5.2: Percentage of invisible rectangles of the IP hierarchy

Finally, it is admittedly very hard to analyze many randomly placed one-pixel rectangles. By removing insignificant nodes with little or no traffic from the hierarchy, more space can be allocated to other items. 86 Chapter 5. A hierarchical approach to visualizing IP network traffic

5.4.2 Average rectangle aspect ratio As a means to evaluate the squareness of the rectangles, we measure the unweighted arithmetic average of the aspect ratios:

1 max(wi,hi) aspect ratio = ∑ (5.5) N i min(wi,hi) Table 5.3 shows that the AS-level hierarchy is more difficult to render with respect to square- ness. This is explainable through the large variances in size: although the size of bottom-level rectangles was scaled logarithmically, the size of each upper level node is calculated by sum- ming up the sizes of all its child nodes. Better results are generally achieved by HistoMap 1D layout, but the combination of Strip and HistoMap 1D is promising due to good aspect ratios. Here, better results are determined by the better layout (HistoMap 1D) for the lowest level, since the latter contains by far more nodes than the AS level.

Algorithm (3rd/4th level) AS (23054) Prefixes (197427) HM 1D 3.077 1.749 Strip 5.754 3.106 Strip / HM 1D - 1.960 HM 1D / Strip - 2.218

Table 5.3: Average aspect ratio of rectangles of the IP hierarchy

5.4.3 Layout preservation Showing all details at once is not always advisable from the perceptual point of view. We thus load a one-day data set of network traffic from a particular gateway into the hierarchy and proceed by continuously removing nodes in the order of least traffic, the ratio between the remaining empty nodes and all nodes within the parent node, and the lowest index. This procedure helps to stay close to the real data distribution, to avoid a randomized sampling function, and to create a well-defined node removing order that is tested with all layouts. Figure 5.17(c) demonstrates the effect of removing nodes where traffic falls below a certain threshold, compared to the original state in Figure 5.17(b). We discovered that the layout distance change metric proposed by Bederson et al. [11] obscured the effects that can be seen in the absolute rectangle side changes due to larger positional changes. To properly assess layout preservation, we split the evaluation into two parts: 1) evaluation of positional changes and 2) evaluation of absolute rectangle side changes.

Positional changes In the first part, we evaluate the positional change as measured from the center of each orig- inal rectangle r1 to the center of its corresponding rectangle r2 in the projected layout of the reduced hierarchy. 5.4. Evaluation of data-driven layout adaptation 87

250 AS (23054) Prefix (197427) HistoMap 1D HistoMap 1D Strip Strip HistoMap 1D/Strip

200 Strip/HistoMap 1D 150 100 average pos change 50 0

0.2 0.4 0.6 0.8

fraction of original nodes

Figure 5.11: Average position change

r 1 1 1 1 d (r ,r ) = ((x + w ) − (x + w ))2 + ((y + h ) − (y + h ))2 (5.6) pos 1 2 1 2 1 2 2 2 1 2 1 2 2 2 Strong correlation between the red and the blue lines as well as between the green and the black dashed lines in Figure 5.11 clearly shows the higher significance of the layout algorithm at the third towards the fourth hierarchy level. HistoMap 1D layout is thus preferred over Strip layout for the third level due to its better preservation of the original rectangle positions. For the fourth hierarchy level, we could not find any significant difference between the two layouts with respect to positional changes. The unexpectedly good results for 50% of the prefix nodes (dashed lines) under all layout algorithms can be explained through a major change towards 40 % in the geographic layout of the first and the second level due to unavoidable recalculation of the weights of the upper level nodes.

Absolute rectangle side changes Since layout change is expressed not only through positional changes of the nodes, but also through changing width and height of the rectangles, we decided to evaluate the absolute rectangle side change: q 2 2 dside(r1,r2) = (w1 − w2) + (h1 − h2) (5.7)

The results of calculating the average side change for all layouts, shown in Figure 5.12, suggest that rectangle sides change drastically especially in those case, when only a small fraction of the original nodes is rendered. On the one hand, this effect is explainable through 88 Chapter 5. A hierarchical approach to visualizing IP network traffic

30 AS (23054) Prefix (197427) HistoMap 1D HistoMap 1D Strip Strip

25 HistoMap 1D/Strip Strip/HistoMap 1D 20 15 average side change 10 5 0

0.2 0.4 0.6 0.8

fraction of original nodes

Figure 5.12: Average side change

the fact that more space is allocated to the remaining rectangles and their sides therefore become considerably bigger. On the other hand, the evaluation shows that the HistoMap 1D layout (black) is superior to the Strip layout (red) both at the AS (straight lines) and at the prefix (dashed lines) level. The other two layouts, which are combinations of HistoMap 1D and Strip layout and vice versa, are in between HistoMap 1D and Strip layout at the prefix level. With more than 60 % of the original nodes, these two layouts are hardly distinguishable from HistoMap 1D with respect to the proposed quality measure average side change.

5.4.4 Summary

Within this evaluation section, we demonstrated the benefits of HistoMap 1D over Strip Treemap layout in several ways. First, the latter is not capable of showing all nodes of our test data hi- erarchy at the target screen resolution due to its sequential display space partitioning. Second, HistoMap 1D yields a better average rectangle aspect ratio at both the AS and the prefix hi- erarchy level. Third, when filtering out parts of the hierarchy to focus the view on an area of interest, HistoMap 1D does a considerably better job of preserving the original layout. We also demonstrated that a combination of the two algorithms is promising, especially due to the line-wise reading order of the Strip layout. In case of a semantic ordering, like the prefix order, the layout might reveal sequential scanning patterns or failures (continuous or discontinuous values in subsequent prefixes). 5.5. User-driven data exploration 89

5.5 User-driven data exploration

Interactivity is the key to many visual analytics applications, and our HNMap is by no means an exception. The analyst chooses which region of the network traffic data set should be investigated in more detail and then continues the explorative search for meaningful patterns depending on the results of the previous steps.

5.5.1 Filtering Further characteristics of network flows are implemented as compulsory and optional filters: 1. Compulsory filters: a) Type of load: Since each IP address functions both as source (sending packets) and as target (receiving packets), the network load to display can refer to either the packets sent, or the packets received, or their total. b) Time window: Each visualization instance displays the traffic within a specified time interval.

2. Optional filters: a) Port or port cluster: Single ports or their groupings can be used as filter to dis- play only the portion of traffic occurring at those ports (for instance, showing the outgoing email load by selecting the source port 25 (SMTP) or the cluster of all common email ports 25, 110, 143, 465, 993, 995). b) Protocol: It might be useful to filter the traffic of a particular protocol, for instance to separate UDP from TCP traffic. c) Node relevance: A relevance slider introduces a visibility threshold by removing low-traffic nodes up to the specified threshold. Empty nodes can be removed by using a checkbox. In the current implementation, most of these filters are entered as text and form part of an SQL query. In the future, these filters might be integrated in a more user-friendly way through appropriate GUI components. However, the current implementation has proven to be very flexible when dealing with different data sets with IP address as the only dimension common to them all.

5.5.2 Navigation within the map Explorative tasks are enabled through typical OLAP database operations such as drill-down (disaggregation), roll-up (aggregation), and slice & dice (filtering). All these basic interactions are mapped to the mouse events: rolling the mouse wheel performs drill-down and roll-up, double click triggers a slice & dice operation and single click opens a context-sensitive popup menu. This popup menu offers support for untrained users as well as access to some expert features such as multiple selection or multiple drill-down (roll-up) operations on the nodes of the same level. 90 Chapter 5. A hierarchical approach to visualizing IP network traffic

(a) Before (left) and after (right) drill-down (b) Before (left) and after (right) roll-up

(c) Before (left) and after (right) drill-down siblings (d) Before (left) and after (right) roll-up siblings

(e) Before (left) and after (right) slice & dice (f) Before (left) and after (right) pruning empty nodes

Figure 5.13: HNMap interactions

A drill-down substitutes a node rectangle through its child rectangles as shown for the red- colored rectangle in the top-left corner of Figure 5.13(a). This interaction is especially useful when exploring traffic distribution details and the need to compare nodes at different granular- ity levels comes up. The inverse operation of a drill-down is roll-up, which subsitutes a group of nodes through their common parent node at an upper level (see Figure 5.13(b)). For faster exploration, these two operations are also available on all sibling (i.e., belonging to the same hierarchy level) nodes as demonstrated in Figures 5.13(c) and 5.13(d). Similarly to a zoom function, a double-click on one on a rectangle triggers the slice & dice interaction, which drills down within the selected node while removing all other nodes from the display as depicted in Figure 5.13(e). Another elegant way of allocating more screen space to relevant nodes is to simply exclude empty nodes from the presentation as shown in Figure 5.13(f) (available via the popup menu and keyboard shortcut). For fast navigation, HNMap offers predefined views of all continents, countries, AS’s, and prefixes, which can also be applied to a previously defined subset of network entities (e.g., showing all prefixes within Germany). In addition to that, the analyst is free to export the current HNMap view as a PNG or PDF image for inclusion into security reports. 5.5. User-driven data exploration 91

Descending to the pixel level The technical limit of a visualization is achieved when every pixel is used for displaying distinct values, where the latter are mapped to the pixels’ color. Further attributes of the data points can be mapped to the pixels’ coordinates. Pixel-based visualization builds up the finest granularity view of the Hierarchical Network Map, namely, the behavior of single hosts. For instance, the pixel visualization can be employed for determining whether network traffic comes from just a limited number of IP addresses or is scattered over the considered network. Pixels are arranged according to the recursive pattern, already introduced in Chapter 4, starting at the top-left corner and descending row by row with alternating forward-backward direction for better cluster preservation. Thereby, the displayed IP addresses appear sorted in descending order. A large display wall with 8.9 Megapixels is potentially capable of showing up to 8.9 million IP addresses by mapping each IP address to one pixel. If the amount of pixels is insufficient, distinct IP addresses have to be replaced by groups of neighboring IP addresses by aggregating their traffic. Figure 5.14 shows active hosts within the network in the recursive pattern pixel visualiza- tion, thereby revealing their distribution behavior. The image on the left shows a pattern of a simulated network scan affecting every 10th IP address, whereas the image on the right reveals a pattern with only very few target hosts. Moving the cursor over the pixel (or its surrounding region) triggers the appearance of the represented IP address’ label.

Figure 5.14: Recursive pattern pixel visualization showing individual hosts

5.5.3 Colormap interactions In our technique, the visual variable color is used to convey the values of the specified measure for each visible node. Employing a bipolar color scale (blue to red over white as the switching point) supports the observation that not only the existence of high load values (dark red) can imply interesting findings, but also the non-existence of any traffic (dark blue). Use of square root and logarithmic color scales helps to make the visualization more resistant to outliers while taking into account the difficulty of exact comparison of quantitative values using those color scales. For this purpose, linear mapping is also available. Through mouse interaction, the white transition point of the bipolar colormap can be moved in order to facilitate comparative analysis [3]. It is furthermore possible to select a particular node on the map: thereby, white color is assigned to this rectangle and the transition point of 92 Chapter 5. A hierarchical approach to visualizing IP network traffic

Figure 5.15: Multiple map instances facilitate comparison of traffic of several time spans the color scale gets adapted accordingly. Nodes with higher values appear in shades of white and red while nodes with lower values in blue and white.

5.5.4 Interactions with multiple map instances In many analysis scenarios, is not only the traffic volume targeting or leaving a particular network entity that matters, but also its gradual changed. To visualize the changes in traffic, multiple map instances can be used as demonstrated in Figure 5.15. Moving the mouse over one of the instances causes a label with the actual traffic values to be displayed in all other instances facilitating the interpretation of the absolute values of that particular entity in the IP hierarchy. Underneath each instance, there is a label indicating the covered time span. Note that the label position gives an additional hint about the chronology of the map instances. An alternative approach is to animate the HNMap display using the animation interface shown in Figure 5.16. The analyst can simply click through the HNMap instances in forward and reverse chronological order or play them as an animation. Obviously, more screen space is allocated to each map instance in this second visualization approach. While instances of the same rectangle are always placed at the identical spot on the map, it is very difficult to simultaneously monitor several rectangles scattered all over the map. Note that in order to make the maps comparable to each other, the same color mapping scheme needs to be applied to all of them. This is done by calculating the global maximum 5.6. Case studies: analysis of traffic distributions in the IPv4 address space 93

Figure 5.16: Interface for configuring the animated display of a series of map instances value over all nodes in all map instances prior to the invocation of the specified scaling func- tion. Similarly, only those empty nodes that remain free of traffic throughout all map instances may be removed from the series of map instances.

5.5.5 Triggering of other analysis tools HNMap has proven to be a powerful tool to analyze the IP dimension of network traffic with regard to a previously defined measure. However, network traffic consist of several dimen- sions, which can be enriched with additional information from intrusion detection systems or other sources. Therefore, a combination of various visualization tools is necessary for explor- ing network traffic particularities. One possibility to combine HNMap with other analysis tools is to further analyze the traffic of a particular node on the HNMap using bar charts of time, host, and port activity, the Radial Traffic Analyzer (Section 7.2), and the Behavior Graph (Section 8.2), accessed the respective tool via the pop-up menu called from the context of the node underneath the mouse cursor.

5.6 Case studies: analysis of traffic distributions in the IPv4 address space

To demonstrate the adequacy of our proposed technique, we conducted a series of case studies with the input data obtained from a production web server, a university gateway router, and a large service provider. In all scenarios, the HNMap was the central exploratory tool. Other graphical representations were generated from the data specified on the map.

5.6.1 Case study I: Resource location planning Positive customer experience is crucial to the success of most businesses. Today, off-the-shelf web analytics software can show the geo-locations of customers who visit a web site. This information can be helpful for inferring customer demographics to optimize logistics or mar- keting strategies. Logically, we would also be interested in the AS’s, from which the customers access a web server, in order to study the consequences of the placement of customer-facing web servers. 94 Chapter 5. A hierarchical approach to visualizing IP network traffic

(a) Specifying visualization parame- ters in the HNMap loading interface

(c) HistoMap 1D showing the AS’s above threshold

(d) Interactive bar chart with a detail slider for hiding (b) HistoMap 1D showing all AS’s low value entries

Figure 5.17: Visual exploration process for resource location planning

To conduct this analysis, we processed the Apache log files of our web server and loaded the IP address, a timestamp, and the transferred byte count of each web request into a data warehouse. The system connects to a database within a server and probes the set of available tables, which are prejoined with the IP network tables to save time during interactive explo- ration. Choosing the most appropriate measure is the key to success for any analysis, and in this case we used the sum of transferred bytes to weight IP addresses. An optional filter can be invoked to ignore large transactions such as huge multimedia downloads that might other- wise skew the results. A checkbox enables removal of AS’s without significant traffic. Other potentially distracting nodes with little traffic can be reduced with the detail slider (see Figure 5.17(a)). Figures 5.17(b) and 5.17(c) show IP addresses of clients who visited the web server be- tween November 28, 2006 and March 16, 2007, aggregated by AS. It clearly highlights the significance of the BELWUE and the DTAG systems, both within the green country rectangle representing Germany. At the same time, the high volume requested by webcrawlers such as Google and Yahoo (Inktomi) as identified by their AS’s is obvious. A right-click on any node brings up a context menu to either navigate within the hierarchy or trigger other graph- ical representations of detailed data. Figure 5.17(d) shows a logarithmically scaled bar chart visualizing the previously defined measure build upon the data set selected on the map. This kind of analysis may be conducted in many other resource planning scenarios, such as choosing an optimal Internet service provider (ISP) for the intranet of a widely-spread company or placing a shared database server at a favorable location. 5.6. Case studies: analysis of traffic distributions in the IPv4 address space 95

5.6.2 Case study II: Monitoring large-scale traffic changes

Another HNMap application is concerned with network traffic monitoring. The two main matters of interest here are 1) to track and predict traffic volumes within one or several inter- operating networks to keep the network infrastructure running well and 2) to react quickly in order to recover from failures. For this case study, we used one day of netflows captured at our university gateway and ano- nymized by removing the internal IP address to ensure privacy. We started by specifying the relevant data, such as the number of failed connections (including those that did not contain any data transfer). This gives an indication of malfunctions or scanning activities – many IP addresses within the probed network are unassigned or protected by a firewall and thus do not reply to the packets sent by scanning computers. Figure 5.18 shows the first derivative of failed connections over time (four successive map instances) aggregated by country. The intense background color of the rectangles in Asia (upper right) attract attention and raise the concern what might have caused the high number of failed connections in the early morning (first strip). In this scenario, the color was chosen to show change: white for little or no change, blue for decrease, and red for increase of traffic. Opening a bar chart for each of the suspicious countries reveals a conspicuously large num- ber of different destination ports characterizing the traffic from China, which would under normal circumstances imply the invocation of a large variety of application programs. For a more thorough analysis, we opened the Radial Traffic Analyzer (RTA) to correlate the source IP addresses as the origin of this traffic with the used ports. With the IP addresses contributing relatively small shares of traffic removed through a slider, the resulting view shown in Figure 5.19 stresses that two IP addresses (218.56.57.58 and 202.102.134.68) were involved in port scanning activities, as seen on the colorful patterns on the outer ring. The RTA interface is capable of drawing one ring for each analyzed dimension (e.g., source & destination IP, ap- plication port, event type, etc.) of the data set. All rings are grouped by the dimensions of the

Figure 5.18: Monitoring traffic changes: failed incoming connections at the university gate- way on November 29, 2005 (square root normalization of the colormap alleviates the visual impact of outliers) 96 Chapter 5. A hierarchical approach to visualizing IP network traffic

Figure 5.19: Employing Radial Traffic Analyzer to find dependencies between dimensions inner rings and sorted according to their own dimensional values. Interactive rearrangements of these rings help the analyst to explore the data and gain deeper insight into it. For more details on RTA, refer to Section 7.2. Such a scenario indicates that HNMap is capable of showing the aggregates of the selected measure or, alternatively, its first or second derivative over time. Navigational operations within the map, such as drilling down to deeper hierarchy nodes or rolling-up to higher level aggregates, are limited to the IP address dimension. Navigating along this dimension is often not adequate or not sufficient for finding a “needle in the haystack”. We therefore supplement HNMap with graphical data representations such as bar charts and the RTA.

5.6.3 Case study III: Botnet spread propagation Between July and December 2006, Symantec observed an average of 63,912 active bot- infected computers per day [151]. Compared with the total number of computers on the Internet, this number is low. However, one should not forget the potential damage these hi- jacked computers might inflict when remotely controlled to collectively attack a commercial or governmental web server. For this case study, we used a signature-based detection algorithm, which exploits the knowledge about known botnet servers to collect bot-infected IP addresses. Without the vi- sualization, we could only see that the list of IP addresses was continually changing, but we could neither map this change to the network infrastructure nor build a mental representation of what was going on. Therefore, the goal of our analysis was to identify the IP prefixes, AS’s, or countries with a high infection rate and to investigate whether they build a cluster at 5.7. Summary 97 a higher level of the hierarchy. The captured data was stored in a data warehouse and loaded into HNMap. The HNMap view disclosed the severity of the spread drawing attention to the red rectangle representing China in the country view. Animating the map over time (one image per day) helps to confirm the dynamics of the developments in China. Figure 5.20 shows the results of zooming in to detailed IP address ranges grouped by AS (yellow borders) to identify which prefixes and AS’s played a major role in the infection. We can observe that the infection was widely spread in the large AS on the upper left and in the smaller and thinner AS on the lower left. Furthermore, a few red colored prefixes outside these AS’s probably contributed considerably to the spread of the worm. With this knowledge, network operators can adapt firewall configurations to block or filter traffic from the affected IP prefixes.

5.6.4 Expert feedback HNMap visualization was presented to the security experts at our university and at AT&T. From the university expert’s perspective, it was very important to show detailed host activities within a network, an autonomous system, or a country, for the task of comparing the number of connections with the transferred bytes of active hosts to find suspicious patterns. After getting this valuable feedback, we integrated bar charts as well as the Radial Traffic Analyzer showing host and port activity into our application. An informal evaluation by AT&T’s visualization and domain experts who tested HNMap yielded numerous improvements, such as an interactive feature to move the transition point in a bi-color scale, time-series animation, linked parallel instances of the map, and intelligent label placement, to name a few. The experts showed particular interest in the botnet spread and worm propagation as demon- strated in Case Study III. They see the potential to explore their vast amount of IDS events to gain deeper insight into malicious network traffic. Further use of the tool in AT&T’s global networks operations center (GNOC) is currently under discussion.

5.7 Summary

In this chapter, we presented the Hierarchical Network Map as a hierarchical space-filling mapping method for visualizing traffic of IP hosts. Our work can be seen as a novel application in the field of computer security visualization. To the best of our knowledge, such large amounts of IP-related network traffic data have never been shown in a geographically aware hierarchical and space-filling context. The proposed algorithm is an application of the space- filling variant of the RecMap [67] algorithm which was modified to fit the analysis needs with respect to one-dimensional (IP addresses) and hierarchical (network hierarchies) data. The approach aims at finding a compromise between four optimization criteria: using display area to represent network size, full utilization of the screen space, position awareness, and aspect ratio. The performance challenges are mastered by employing OLAP cube operations commonly used in data warehouses to efficiently aggregate over large volumes of detailed data. To adapt the visualization to the large hierarchy at hand, we used one-pixel-thin colored 98 Chapter 5. A hierarchical approach to visualizing IP network traffic borders between the nodes of each level or, alternatively, a zero to three pixel thin border, incrementing the border’s thickness at each upper hierarchy level. In our experiments, we compared two layout algorithms and their combinations against each other with respect to visibility, average rectangle aspect ratio, and layout preservation. While our proposed algorithm HistoMap 1D was superior to the Strip layout, we still see valid use cases that justify usage of the latter due to its intuitive sorting order. Furthermore, combinations of the two layouts - a different one for each level - has shown that some desirable features of HistoMap 1D can be retained to a certain degree. Apart from generating insightful graphics, the HNMap tool offers a variety of user inter- actions to explore large netflow and IDS data sets. Mouse interactions are implemented to facilitate explorative tasks on the map, to adapt the color mapping to better highlight relevant patterns, and to trigger other analysis tools on the data specified on the map. In order to show the tool’s applicability, three case studies were presented with traffic data from the university’s gateway router, from a webserver, and IDS data of botnet IPs. These case studies demonstrated that visual exploration of large IP related data sets can reveal valuable insights into network security issues.

5.7.1 Future work In the future, we hope to improve the proposed layouts at the AS level by exploiting infor- mation about border gateway router connectivity. Methods for avoiding deformation of small rectangles, which are placed next to large ones in the Geographic HistoMap and HistoMap 1D layout, may also be beneficial. Because this study emphasized the importance of the upper hierarchy levels to layout stability, we might also take a closer look at the extent, to which sac- rificing squareness in the geographic hierarchy levels improves layout stability. Furthermore, a search feature is foreseen in order to find countries, AS’s, prefixes, and hosts on the map. 5.7. Summary 99

(a) Day 1 (b) Day 3

(c) Day 5 (d) Day 7

(e) Day 9

Figure 5.20: Rapid spread of botnet computers in China in August 2006 as seen from the perspective of a large service provider with yellow boxes grouping prefixes by AS (prefix labels were anonymized for privacy reasons)

6 An end-to-end view of IP network traffic

,,By three methods we may learn wisdom: First, by reflection, which is noblest; Second, by imitation, which is easiest; and third by experience, which is the bit- terest.”

Confucius Contents 6.1 Related work on network visualizations based on node-link diagrams . 102 6.1.1 Abstract graph visualizations ...... 102 6.1.2 Geographic graph visualizations ...... 103 6.2 Linking network traffic through hierarchical edge bundles ...... 104 6.2.1 Hierarchical edge bundles ...... 104 6.2.2 Edge Coloring ...... 105 6.2.3 Interaction design ...... 106 6.2.4 Data Simulation ...... 106 6.3 Case study: Visual analysis of traffic connections ...... 108 6.4 Discussion ...... 109 6.5 Summary ...... 110 6.5.1 Future Work ...... 110

S discussed in the previous chapter, Hierarchical Network Maps are an approach to the Apresentation of IP-related measurements on the global Internet. This pixel-conservative approach is appropriate considering the sparsity of the display space when showing about 200 000 IP prefixes at once. However, in the previous chapter, the network statistics were ob- served at a single vantage point (arriving at or leaving from a particular gateway) and displayed according either to source or to destination in the IP address space. In this study, we consider traffic being transferred through multiple routers, such as in service provider networks. The use of edge bundles enables visual display and detection of patterns through accumulative effects of the bundles while avoding visual clutter. The novelty of this approach lies in its capability to support the analyst in the formation of a mental representation of the network traffic situation through visualization. Each autonomous system or IP prefix is placed on a map while nodes are linked according to their network traffic relationships. The remainder of this chapter is structured as follows. We briefly discuss related work dealing with node-link diagrams and proceed by extending our previously introduced HNMap approach with so-called edge-bundles for linking traffic sources and destinations on the map. Threupon, we present a case study to demonstrate the technique and conclude by assessing the overall contribution in the summary. 102 Chapter 6. An end-to-end view of IP network traffic

6.1 Related work on network visualizations based on node-link diagrams

Node-link diagrams are the most prominent representations of graph data structures. These di- agrams typically consist of nodes, which represent the vertices, and line segments representing edges, whereby each edge connects two vertices. Note that in large layouts nodes are usually not labeled due to display space limitations. Labels are rather shown using an interactive tool to pop them up when the user moves the mouse pointer over nodes. While node-link diagrams have been extensively used in information visualization and other research fields, their usage in network monitoring and security was so far limited to expressing connectivity and traffic intensity between hosts or higher-level elements of the network infrastructure. To gain an overview of the contributions related to node-link diagrams, we recommend Manuel Lima’s webblog “visual complexity – a visual exploration on mapping complex net- works” [105]. It is an extensive effort to collect network visualization studies from various fields and currently features about 500 projects including both historic and state-of-the-art research and design.

6.1.1 Abstract graph visualizations While geographic nodes have clearly defined node positions, one of the key issues of abstract graph representations is the positioning of the nodes. A multitude of graph layout methods have been proposed with different optimization goals ranging from minimizing edge crossings to emphasizing structural properties of graphs [8]. Abstract graph representation normally seek a way to effectively use the available screen space. This implies that linked nodes are rendered close to each other in order to avoid visual clutter caused by crossing edges. Cheswick et al., for example, mapped a graph of about 88 000 networks as nodes having more than 100 000 connecting edges [24], obtained by measuring the quality of network connections in the Internet from different vantage points. Visualizing this information in graphs is challenging in terms of the layout calculation as well as in terms of visibility of nodes and links of such a graph. For better visualization, the authors reduced the graph by calculating a minmal distance spanning tree and then mapped the nodes and edges onto 2D using a spring-embedder layout, which places adjacent nodes in the vicinity of each other. Because large-scale graphs cannot be perceived in all details nor can layout methods be calculated on-the-fly, Ganser et al. have studied a method for reducing the number of displayed objects while preserving structural information essential for understanding the graph [57]. Their method is based on the precomputation of a hierarchy of coarsened graphs, which are combined into concrete renderings. Another related topic are attack graphs, which aim at showing how an attacker might com- bine individual vulnerabilities to seriously compromise a network. Early attack graphs repre- sented transitions of a state machine, with network security attributes as states and the exploits of attackers as their transitions. Following a path through the graph generates a possible at- tack path. In fact, these graphs had serious scalability issues due to the large state space, 6.1. Related work on network visualizations based on node-link diagrams 103 resulting in the development of simplified graphs showing dependencies among exploits and security conditions, which increase only quadratically with the number of exploits. Noel and Jajodia presented a method for visual hierarchical aggregation of attack graphs, in which non- overlapping attack subgraphs are recursively collapsed to single vertices to make attack graph analysis scalable both computationally and cognitively [127]. Williams et al. focused on a different visual approach to attack graphs [182]. Having figured out that mapping the attacks back on the network topology was a cognitively difficult task, they introduced a combined attack graph and reachability display using a treemap visualization in combination with a user-defined semantic abstraction. Since human creativity is not exclusively limited to abstract or geographic graph layouts, there exist hybrid approaches that partly take geographic information into account while cal- culating the graph layout on the screen. One such approach is the visualization interface of the Skitter application that uses polar coordinates to visualize the Internet infrastructure [28]. Each AS node’s polar coordinate is determined by the geographical longitude of its headquar- ter and by the hierarchical connectivity information.

6.1.2 Geographic graph visualizations Early network mapping projects, such as the visualization study of the NSFNET [36], put their focus on geographic visualization where each network node had a clearly defined position on a map. This principle was also applied in a study to map the multicast backbone of the internet in 1996 [125], in which the authors used a 3-dimensional representation of the world and drew curved edges on top of it to show the global network topology. Other research efforts focused on visual scalability issues in 2-dimensional representations ranging from matrices to embeddings of the network topology in a plane as reviewed in [44]. In [9], for example, the concept of line shortening was introduced to prevent the visual clutter from obfuscating the whole image. This technique only draws a short fragment of each connection line at the start and the end point of the line. The Atlas of Cyberspace is another attempt to collect all kinds of maps that were made from cyberspace including geographic, 3D, and abstract graph layouts to represent the network infrastructure [40]. At this point, it is worth mentioning that the study presented in [35] (already discussed in Section 3.3.3) also uses a geographic metaphor, but visualizes abstract AS connectivity information without real spatial reference. Flow maps are another approach to showing the movement of objects from one location to another. Traditionally, flow maps were hand drawn to reduce the visual clutter introduced by overlapping flows. Phan et al. presented a method for generating well-drawn maps, which al- lowed users to see the differences in magnitude among the flows while minimizing the amount of clutter [133]. However, interpretation of flow maps with multiple vantage points remains a challenge. A recent work about hierarchical edge bundles handles the problem of visual clutter el- egantly: a hierarchical classification of nodes is exploited to draw the bundling lines that connect the leaf nodes, thereby visually emphasizing correlations [69]. Inspired by this work and the decision to stick to the node positions fixed by the HNMap layout, we describe an approach to drawing edge bundles on top of HNMaps in this chapter. Such edge bundles vi- 104 Chapter 6. An end-to-end view of IP network traffic

(a) Straight connection lines (b) Exploiting hierarchical struc- (c) Compromise through edge bun- ture dles

Figure 6.1: Comparison of different strategies to drawing adjacency relationships among the IP/AS hierarchy nodes on top of the HNMap sualize end-to-end relationships in network data sets, rather than limiting the analysis to the outgoing or the incoming traffic of a single vantage point.

6.2 Linking network traffic through hierarchical edge bundles

With the HNMap technique described in the previous chapter, we faced the problem that draw- ing straight connecting lines between the most prominent sources and destinations on the map introduces heavy visual clutter, which makes it difficult to even follow a single straight line. The question arose how the hierarchical structure of the map could be exploited to organize the connecting lines without loosing too much detailed information. Figure 6.1 shows three approaches to drawing adjacency relationships among the nodes of the IP/AS hierarchy. The problem of visual clutter becomes obvious when straight lines are drawn as shown in (a). In (b), one observes that connecting with the center of the upper- level hierarchy nodes (rectangles) removes visual clutter, but adds a lot of ambiguity, e.g., the lines connecting the left star (U.S.) and the stars in Europe (middle) – one for each country – are overdrawn more than a hundred times. An interesting compromise is to use so-called hierarchical edge bundles and transparency effects to combine the advantages of the latter two approaches, as shown in (c).

6.2.1 Hierarchical edge bundles A recent contribution proposed a way of using spline curves to draw adjacency relationships among nodes organized in a hierarchy [69]. Figure 6.2 illustrates how we use this technique to exploit the hierarchical structure of the AS/IP hierarchy in order to draw a spline curve between two leaf nodes. Pstart (green) is the center of the source rectangle representing a node in the IP/AS hierarchy and Pend (red) is the center of the destination rectangle. All points Pi (green, blue, and red) from Pstart over LCA(Pstart,Pend) (least common ancestor) to Pend form 6.2. Linking network traffic through hierarchical edge bundles 105

World Least Common North America AS553 Ancestor (LCA)

Continents

Countries AS123 Germany United States AS's Europe (a) IP/AS hierarchy (b) Spline (red) with control polygon (blue)

Figure 6.2: The IP/AS hierarchy determines the control polygon for the B-spline the cubic B-spline’s control polygon. Because many splines share the same LCA, which is in this case the center point of the world rectangle, we do not use them for the control path. Note that the hierarchy in Figure 6.2(a) does not exhibit the prefix level, which would be placed beneath the AS level and could possibly add two more points to the control polygon. However, the user is free to drill-down or roll-up arbitrary nodes on the HNMap and thereby to modify the number of available control points for the splines. Drilling-down to prefixes in one area of the map while keeping another area summarized as a continent is in wise aggravating for spline rendering. The result is a set of splines with varying number of control points. Essentially, HNMap represents an IP world map, in which traffic from the United States to Asia or Oceania would probably not make a detour over Europe. We thus draw these splines going out on the left side and coming in again on the right side, or vice versa. Since all control points have an influence on the course of the line, we need to duplicate all control points and then shift these by 360 degree either to the left or the right depending on their original position. Since all of these duplicated points lie outside of the drawing area, they are not visible on the rendered image but guarantee the same curvature at the borders of the image. In our implementation, the degree of the B-spline is used to control the bundling strength. A degree of six was experimentally proven to be a good choice for the cases with sufficient number of control points. Otherwise, the degree is reduced to the number of available control points. When drawing many of these splines, the bundling effects, which are caused by a number of common control points, reveal high-level information about the traffic intensity on top of the HNMap. Having this information right in the visualization is more useful than presenting a separate visualization since mental combination of the information on the map with this abstract traffic relationships becomes otherwise a difficult task for the analyst.

6.2.2 Edge Coloring In general, we considered two options for coloring and sizing the edges, namely (a) the use of color to distinguish between the edges and the use of width to judge about the edge’s relevance, and (b) the use of both color and edge width to convey the relevance of the edge. 106 Chapter 6. An end-to-end view of IP network traffic

For the first case (see Figure 6.3(a)), we chose a HSI color map with constant saturation and intensity, which is silhouetted against the background. While largely varying colors make tracing of single splines easier, users tend to feel strongly distracted by the colorful display. For the second case (see Figure 6.3(b)), we employed a heat map color scale from yellow to red using 50 to 0 percent alpha blending to visually weight the high-traffic links. Naturally, less important splines were drawn first. Due to the fact that only the most important traffic connections are displayed, we used square-root normalization for both width and color of the splines in order to assign a larger spectrum of colors to these high values. An alternative option would be to adjust the minimum of the colormap to the smallest of the top traffic connections and then to apply log-scaled mapping.

6.2.3 Interaction design In our opinion, the task of tracing splines becomes unsolvable by means of coloring once their number excels about one hundred. We therefore attempted to tackle the problem through interaction. Basically, there are two possible interaction scenarios. The first scenario is to select, refine, and visually highlight a particular spline or region. Selecting a spline or a region by the start and the end points of the splines is relatively intuitive to implement using the spatial data structure of the HNMap application. However, the second scenario of marking a whole bundle of splines had to be discarded due to tedious implementation. It probably boils down to a point in polygon test in each spline for the selected pixels. After specifying the start or the end points of the splines via a mouse interaction, we redraw all splines using their previous RGB color values – non-selected splines with higher alpha blending and the selected ones without transparency effects on top. This allows the user to easily trance the few highlighted splines to their ends. Alternatively, non-selected splines can be completely removed.

6.2.4 Data Simulation Since our university network is not a very central node in the Internet backbone, we only route traffic from or to our own network and a few other nodes in the BelWu¨ autonomous system. As a consequence, most incoming traffic at our gateway is headed towards the internal prefix 134.34.0.0/16 and most outgoing traffic comes from there. If we visualized this traffic on the HNMap, all splines would run into this AS or another prefix node in Germany. Because traffic information is generally considered confidential, it was impossible to obtain real traffic data from service providers for publication. We therefore settled on using real netflow data from our university gateway and substituted the internal source IP prefix with a randomized one according to Algorithm 6.1. The goal of this randomization schema is to distribute the measured traffic of the top n hosts over the connections among them in a way that the sum of weights of the connections for each host is equal to its total traffic. Implicitly, nodes with more traffic are more likely to communicate with several other nodes due to their higher probability of being repeatedly selected in the loop, whereas the opposite applies to low traffic nodes. 6.2. Linking network traffic through hierarchical edge bundles 107

(a) Randomly colored edge bundles make splines more distinguishable (the amount of traffic is expressed only through spline width)

(b) Color combined with transparency and line width to show the traffic amount of a spline

Figure 6.3: HNMap with edge bundles showing the 500 most important connections of one- day incoming and outgoing traffic at the university gateway with semi-randomly selected anonymized destination/source 108 Chapter 6. An end-to-end view of IP network traffic

Algorithm 6.1: Randomization schema to simulate backbone provider traffic 1 begin 2 SortedList ← all pairs of unprocessed nodes and weights 3 while |SortedList| > 0 do 4 (nmin,wmin) ← SortedList.removeMin() 5 (nrand,wrand) ← SortedList.RandomSample() 6 drawSpline(nmin,nrand,wmin) 7 SortedList.update(nrand,wrand − wmin) 8 end

6.3 Case study: Visual analysis of traffic connections

To determine the changes of the most important traffic connections over a period of one day, we loaded traffic data from our university gateway into the HNMap visualization and ran the randomization algorithm to simulate a destination or a source AS outside of our network. Figures 6.3(b) and 6.3(a) demonstrate the outcome of our randomization schema based upon a one-day traffic load of our university gateway by displaying the top 500 traffic con- nections. The images display the high-level connectivity information while still being capable of providing the low-level relations to a large extent. Three major bundles between the traffic from the United States to Europe are identifiable, namely to the Southern Europe, to Ger- many, and to Northern Europe. Furthermore, major traffic bundles from Europe to Asia and from the United States to China as well as to South Korea characterize this traffic. Because of overdrawing effects, shorter splines, and decreased structure of the bundles due to insufficient number of control points, traffic bundles within a single continent are harder to analyze on the map. The strong bundling effects in North America (center left) are explainable through the proximity of the two control points of North America and the United States to each other. Figure 6.4 shows the result of analyzing the same data with four HNMap instances, each representing an equally sized time frame. To enlarge the important rectangles, we removed nodes with no traffic in all map instances. For better visibility of the splines, the colormap of the underlying HNMaps was flipped resulting in a better contrast with the dark background. Because each map instance displays 100 splines, it is hard to estimate whether there is more or less traffic when comparing two instances. However, the color of the red splines gives an indication about the time frame when the traffic connections with the highest load occurred. The largest connection in this image is in the third frame between the Deutsche Telekom in Germany and China-Backbone No. 3 (detailed labels could only be obtained through mouse interaction due to space limitations). Especially the fourth map instance displays the absence of thick red splines. In general, there are more connections with Europe, in particular with Germany (top right of the center), than in the other images. In the first frame, it is noticeable that more connections with Canada exist than in all the other frames. Furthermore, interac- tion allows to better trace individual connections or, alternatively, to monitor the changing connections of a particular continent, country, AS, or prefix in all four map instances. 6.4. Discussion 109

Figure 6.4: Assessing major traffic connections through edge-bundles (100 splines / 4 HNMap instances)

6.4 Discussion

Edge bundles offer an opportunity to communicate source-destination relationships on top of the HNMap thereby leveraging the application from a purely measurement-based analytical approach to a more complete view on large-scale network traffic. Since tracing single splines becomes very challenging once a certain number of traffic links is placed on the map, we experimented with the RGB and alpha values of the spline colors. Mouse interaction is used to select splines at their start or end points in order to silhouet them from the remaining splines. The most obvious drawback of our approach is that edge bundles occlude important infor- mation in the background of the map thus making it inaccessible to the analyst. Furthermore, conveying significance of splines or distinguishing them through color disqualifies the visual variable color from being used for indicating the direction of traffic flows. So far, we have not found an acceptable solution to communicate traffic directions through the splines. When monitoring larger networks, focusing the analysis on a particular type of traffic, for example, communication of hijacked computers that belong to a botnet, helps to significantly reduce the number of nodes and connections to be displayed. Limiting the amount of visible splines as well as the HNMap option to remove empty nodes make the visualization more usable by focusing on the important nodes and traffic links. 110 Chapter 6. An end-to-end view of IP network traffic

6.5 Summary

HNMaps visually represent network traffic aggregated on IP prefixes, AS’s, countries, or con- tinents and can be used for exploration of large IP-related data sets. In this chapter, we ex- tended HNMaps through edge bundles to emphasize the source destination relationship of network traffic. Rather than representing new visualization or analysis techniques, we com- bined two existing methods and applied them to large-scale network traffic to gain insight into communication patterns. In order to facilitate tracing of splines, which represent source-destination relationships between networks or abstract nodes of the IP hierarchy, we compared two alternative coloring schemes. Mouse interaction can be used to highlight network traffic links of interest and offers the analyst a tool to explore details of large data sets while still being able to see other potentially relevant links and high-level patterns.

6.5.1 Future Work We plan to enhance the HNMap edge bundles concept with displaying detailed traffic infor- mation of the prefix of interest. By considerably enlarging this prefix and by mapping single IP addresses in it, we could show details about traffic connections to and from internal hosts. However, since almost all traffic links will share the nodes from our prefix up to the European continent in the IP/AS hierarchy, their points need to be removed from the control polygon to create more meaningful bundling effects. To avoid overdrawing issues in dense spline regions, we thought about the possibility of introducing empty nodes in the HNMap layout and thereby repositioning the otherwise over- drawn rectangles. However, due to time constraints, we leave the implementation and further elaboration of this idea open at this point. 7 Multivariate analysis of network traffic

,,Above all else show the data.”

Tufte Contents 7.1 Related work on multivariate and radial information representations . 112 7.1.1 Multivariate information representations ...... 112 7.1.2 Radial visualization techniques ...... 114 7.2 Radial Traffic Analyzer ...... 115 7.3 Temporal analysis with RTA ...... 118 7.4 Integrating RTA into HNMap ...... 120 7.5 Case study: Intrusion detection with RTA ...... 120 7.6 Discussion ...... 121 7.7 Summary ...... 122

N many network security analysis scenarios, it is not only the relations at various levels of Ithe IP/AS hierarchy or between the sender and the recipient that play an important role, but also other dimensions, such as port numbers, alerts, time, protocols, etc. Since there exist a number of techniques and tools that can be used for analyzing local and global correlations within the attributes of a multivariate data set, we had to focus our effort on implementing and integrating such promising analysis options into our HNMap tool. After careful considerations, we chose a radial hierarchical visualization technique because we found it appropriate for interpreting detailed information along with aggregated informa- tion in one display. The chosen radial layout has the property of allocating progressively more screen space to the outer rings, which represent more granular data than the inner rings. Therefore, these details can be explored while an overview of the data is maintained through the neighborhood relations with the inner ring segments, which represent high-level infor- mation. Interaction capabilities make the Radial Traffic Analyzer (RTA) a powerful tool for dynamic filtering and aggregation in network data sets. The chapter is structured as follows. We start by reviewing related work on multivariate and radial information representations and proceed by introducing the Radial Traffic Analyzer. Af- ter demonstrating how the RTA can be used for temporal analysis, the integration into HNMap is discussed. Subsequently, a short case study applies the approach to an intrusion detection data set. Finally, the last section summarizes the chapter’s contributions. 112 Chapter 7. Multivariate analysis of network traffic

7.1 Related work on multivariate and radial information representations

Since this chapter analyzes the multivariate nature of network data sets, related work discussed here addresses multivariate information representations. Besides, due to the choice of a radial visualization technique for our implementation, we also review some related radial layouts with a focus on the network monitoring and security application fields.

7.1.1 Multivariate information representations

Although the design space for multivariate information representations seems to be almost unlimited at first glance, only a couple of visualization techniques have managed to set foot in everyday use. Some of these techniques are already quite old, such as scatterplots, bar charts, or line graphs. A multitude of applications extend these standard plots (e.g., 3D or animated scatterplots). One prominent example is GapMinder [141], a visualization tool that has be- come quite popular recently. Although the scatterplot-like visualization metaphor behind it is not new, its animation mode and the underlying data from the World Health Organisation make it a fascinating web-based exploration tool. Another possible way to extend these visualization techniques is the use of small multiples as demonstrated in the scatterplot matrix in Figure 7.1. It is a powerful technique, but com- paring all possible dimension pairs results in a huge amount of plots for data sets with more than a few dimensions. Likewise, when considering large real-world data sets, a large extend of overlap makes scatterplot visualization hard to interpret and sometimes even misleading. A further extension of traditional visualization techniques for dealing with multivariate data is dimensional stacking [101], which involves recursive embedding of images defined by a pair of dimensions within the area of a higher-level image.

0 40000 80000 0 20000 50000

● ●● ●●● ●● ●●● ● ●● ● ●●●● ● ● ● ●●●●● ●● ● ●●●●● ● ●● ● ●●●●●● ● ● ●●●●●● ●●● ● ●●●●●●● ●●● ● ●● ● ●● ● ● ● ●●● ● ●●● ●●● ● ●●●●●●●● ●● ●●● ● ●●●●●●● ● ● ● ●●●●●● ● ●●●● ● ●●● ●● ●●●● ● ●● ●●● ●●● ●●●● ●● ● ● ● ●●●●● ● ● ●● ●● ●●●● ● ●●●● ● ●●●●● ● ● ●● ● ● ●●●●● ● ●●●● ●●● ● ● ● ● ●●● ● ● ●●● ●●●●●●●●● ● ● ●●● ●●● ●● ●● ● ●●●●● ● ●● ● ● ● ●●●● ●● ●● ● ● ● ●● ●●● ●● ● ● ●● ●● ●●●● ● ●●●●● ● ● ● ● ●

●●●● ●●●● ●● ●●●●●●● ● ● ●●● ● ●● ●●●●●●● ●●● ●●●●●●● ● ●●● ● ● 20 ●● ● ●●●●● ●● ●● ●●● ●●● ● ●●●●● ●●●●●●●●● ●●●● ●●●●●● ●●●●● ●● ● ●● ● ● ●● ● ●●●● ●●●●●●●●●●●● ●●●●● ● ●●● ●● ● ●●●●●● ●●● ● ● ●● ●●●●● ●●●● ● ● ● ● ●●●●●●●● ●●●●● ●●● ●●●●●●●●● ●● ●● ●●●● ●●●● ●●●● ●● ●● ●●●●● ● ● ● ●● ●●●●●● ● ● ● ● ●● ●●● ● ●●●●●●● ●●●●●● ●●●● ●●● ●● ● ●●● ●●●● ●●●●●● ●●● ●● ● ●●● ● ● ●●●● ● ●● ●● ●● ●●●●● ●●● ●●●●●●● ●●●● ●●●●●●●● ●●● ●●● ●●●●●●●● ● ●● ●●●●● ●● ●●● time ●●●●● ● ●●●● ●●●● ●●●●●●●● ● ●●●●●●●● ●●● ● ●●●●●●● ● ●●●●●● ●● ● ● ● ●●● ●●●●●●●●● ●●●●●●●●● ●●●●● ●●●●● ● ●● ●●● ● ●●●●●● ●●● ●●● ● ●●●● ● ●●●● ● ●●●●●●●● ●●● ●●● ● ●● ● ●● ●● ●●● ●●●●●●●● ●●● ●●●● ●● ● ●

●●●●●●●●● ●● ●●●●● ●●●● ●●●●●●● ●●● ●●● ● ● ●● ●● ● ●●●●●●●● ●●●●● ● ● ●● 10 ●●●●●● ● ●● ●● ● ●●● ●● ● ●●●●●● ● ●●●●● ●● ●●●●● ●●●●●●●●●● ●●● ● ● ● ● ●●● ●●● ● ●●●● ●● ●●●● ●● ●●●● ●●●● ●●●●● ●● ●● ● ●●●●● ●● ●●● ●● ●●●●● ● ●●●●●● ●●●●●●●●● ● ●●● ●● ● ●●● ●● ●●●●●● ● ●●●●●●● ● ●●● ● ● ● ●●●●● ● ● ●● ●● ●● ●●●●●● ●● ●● ●●● ● ●● ● ●●●●●● ● ● ● ●●●● ●●● ●● ● ●●●● ● ●●●●● ● ●●●●●●● ● ●●●●●●●●●● ●●●●●●●●● ● ● ●● ●● 5 ●●● ● ● ●● ● ●● ●●● ● ●●●●● ●● ●● ● ●●● ● ● ● ●● ●●●●● ●● ●● ●● ●●●● ● ● ●●●● ●●●●●● ●● ●●● ●●●●●● ● ●●●●●●●●●● ●● ● ●●●●● ●● ●●●●●● ●● ●

● ● ●● ● ●● ●● ● ● ● ● ●●●●● ●● ● ● ●● ● ●●● ●●●● ● ●● ● ● ● ● ●● ● ● ●●● ●● ●● ●●●●●● ● ● 80000 ● ●● ●●●●● ● ●● ● ● ●● ●● ●●●●●● ● ● ● ● ● ● ●●● ●● ●●● ● ●● ●●●● ● ● ●● ●● ●●●●●●● ● ●●● ●● ●●● ●●●●●● ●● ● ● ●●● ●● ● ●●●● ●●● ● ●● ●●● ●●●● ● ● ● ● ● ● ●●●● ●● ●●● ●● ●●● ● ●● ●●●●● ●● ●● ● ●●●●●●●●● ● ●● ● ● ● ●● ●●●● ● ●● ● ● ● ● ●● ● ● ●● ●● ● ● ●●● ●●●●● ● ● ● ●● ● ● ● ●●●● ●● ●● ●● ●●●● ●● ●● ● ● ● ● ● ●● ● ●● ●●● ● ●●● ●●●● ● ● ● ● ● ●● ● ●●● ●● ●●●● ●●● ●●●●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●●● ●●●● ●●●● ●●● ●●● ●●●● ● ● ● ● ● ●●● ● ●● ●● ●●● ●● ●● ●●●●●● ● ● ● ●●● ●● ●●●●●● ● srcaddr ●● ●●●●●●● ● ●● ●● ● 40000 ●● ● ● ●●● ●●●● ● ●●● ●●●● ●●● ● ● ● ●●● ●●● ● ●●●●● ● ● ●● ●●●● ● ● ● ● ●●● ●●● ●● ●● ● ●●● ● ●●● ●● ●●●● ●● ●● ●●●● ● ● ● ● ● ● ● ● ●● ●● ●●● ● ● ● ●● ● ●● ● ● ● ●● ●●●● ● ● ● ● ●●●● ● ●● ● ●●●● ●● ● ●●●● ●● ● ● ●●● ●● ●●● ●●● ●● ● ●●● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●● ● ●● ● ● ●●●●●●●●● ●● ● ● ●● ●●●● ● ● ● ● 0

● ●● ● ● ●● ●● ●●● ●● ● ●● ●● ● ● ●●● ● ●● ●● ● ●● ● ● ●● ● ●● ●● ●●● ● ● ●●● ●●●● ● ● ● ● ●●●●●●● ●● ●●●●● ● ● ● ● ●●●●● ●●●● ●●●● ●● ●● ●●●●● ●● ●● ● ● ● ● ●● ●●● ● ●● ●●● ● ● ● ●●● ● ●● ●●● ●● ●●●● ●● ●●● ●● ● ● ●●●● ● ● ●● ● ● ● ● ●● ●●●●● ● ●●● ●● ● ● ●● ●● ●●●●●● ● ● ●● ●● ● ●●● ● ●● ●●● ● ● ● ● ●●●● ● ●● ●●●● ● ●● ●●●●● ● ● ● ●● ●●● ●● ● ●● ● ● ●● ●● ●● ● ● ● ●● ●● ●●● ●● ●●● ●● ●● ●●●● ● ●●●●●● ● ●● ●●● ●● ●●● ● ● ● ● ●● ●●●● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ●● ●●●● ● ● ●●● ● ●● ● ●●● ●● ●●●●● ● ● ● ● ● ● ●● ● ●●●● ● ● ●● dstaddr ●●● ● ● ● ● ● ● ● ●●●●●● ● ●●● ●● ● ●●●●●● ●●●● 40000 ●●● ● ●●● ●●●● ●● ●● ●● ●●●●●●● ●● ●●● ● ●● ●●● ● ● ● ● ● ● ● ●●● ● ● ●●●●● ● ● ●● ●● ● ● ●● ●●●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●●● ● ●● ●●● ● ● ●●●● ●● ●● ● ● ●● ● ● ● ●● ●● ● ●● ● ●● ●●● ● ● ● ● ●● ● ●● ● ● ● ●● ●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●● ●●●● ●●● ●●●●● ● ● ●● ●●● ●●●● ●● ● ●● ● ● 0

● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●●●●●●●●●●●● ●●●●●● ●●● ●●●● ●●● ● ●● ● ●●●●● ● ●●● ●● ● ●●●●●● ●●●●●●●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ●●● ● ● ●● ● ●●● ● ●●● ● ●● ●

50000 ● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●●● ●●●● ●●● ● ●●● ●●● ● ● ● ● ● app_port ● ● ● ●●●●● ● ●● ●● ● ● ● ● ● ● ●●● ●●● ●●●●● ● ● ● ●● ● ●●●●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●●●●●● ●●● ●●● ● ●● ● ●●● ● ● ●●● ●●● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ●●● ● ● ●●● ● ● ●●● ● 20000 ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●●● ●● ●● ●●●●●●● ●●● ●●●● ●●●● ●●●● ● ●●●● ●●●● ● ●●●● ●●●● ●●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●● ●● ●● ●● ●●●●●●●● ●●●● ●●●●● ●●● ● ●●●●● ●●●● ●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0

5 10 20 0 40000

Figure 7.1: Scatter plot matrix: color represents the number of flows on a logarithmic scale 7.1. Related work on multivariate and radial information representations 113

Glyphs are another popular technique for expressing multivariate data and comparing sev- eral objects with one another. By mapping the attribute values to shape, color, size, orientation, or position, correlations can be found by searching for similar characteristics in these glyphs. Probably the most famous research on glyph visualization are the Chernoff Faces [23]. In this work, points with up to 18 dimensions were mapped onto the features of a face, such as nose length, mouth curvature, etc. In the network security field, the idea of using glyphs for visual comparison of network events was demonstrated in the Intrusion Detection toolkit [94]. Its visual interface assigns variables of network statistics and intrusion detection data sets to visualization attributes producing a glyph visualization of past or current events. On the one hand, this approach is very flexible, but, on the other hand, many possible parameter settings make it difficult to choose a good visualization. One of the most famous geometric visualization techniques is parallel coordinates [73]. The basic idea is to represent each dimension of the data set through a vertical axis and to connect the points of one multidimensional data point with each other through straight lines. Parallel coordinates have become a popular analysis technique when dealing with network data. VisFlowConnect, for example, uses the parallel axis view to display netflow records as incoming and outgoing links between two machines or domains [190]. This tool allows the analyst to discover a variety of interesting network traffic patterns, such as virus outbreaks, denial of service attacks, or network traffic of grid computing applications. Detecting cor- relations of groups of data tuples can be easily spotted on neighboring axes. Therefore, the neighborhood configuration of these dimension axes strongly affects the correlations seen by the analyst. However, in high-dimensional data sets, there is a huge number of possible con- figurations since any permutation of these axes is valid. Furthermore, overplotting becomes a serious issue when dealing with medium-sized or large data sets. For visual analytics, not only the capability of pattern discovery is important, but also the means for refining and storing the insights gained in the process of analysis. The Xmdv- Tool, well-known for its multivariate visualizations including glyphs, parallel coordinates, and dimensional stacking as well as for interaction techniques like brushing and linking [177], recently introduced the Nugget Management System that supports refinement, storage, and sharing of valuable insights [188]. In pixel-oriented visualization techniques, each pixel represents one data value through its color, thus enabling visualization of large amounts of data [80]. The positional attribute com- monly sets the data values into context by grouping them according to either dimensions or data tuples. Similarly to parallel coordinates, many pixel visualization techniques such as Pixel Bar Charts [81] either can be run on a subset of dimensions or provide several param- eters for fine-tuning the visualization, which results in a large number of possible pixel visu- alizations. Smart visual analytics techniques such as Pixnostics can be used in this context to automatically filter out uninteresting plots in order to focus on those plots with potentially interesting features [142]. In business applications, the pivot table interface is especially popular. The Polaris sys- tem, for example, is a visual interface for large multidimensional databases that extends the pivot table [149]. It allows the analyst to interactively construct visual presentations of table- based graphical displays from visually specified database queries. The experiences from the Polaris prototype lead to the development of the commercial analysis tool Tableau Software 114 Chapter 7. Multivariate analysis of network traffic

[152]. Its recent innovations feature an integrated set of interface commands and defaults that incorporate automatic presentation into their software suite [110]. In contrast to these general-purpose visualization techniques, the aim of our visualization approach is to simultaneously show the aggregates computed at different granularity levels along with the detailed information of the network monitoring and security data sets. Note that this goal can be achieved through many possible layouts. In our case, we found the radial approach useful since more display space at the outer rings gives more space to the detailed information at low levels. Thereby, we take into accout that circular layouts waste a lot of space in rectangular displays.

7.1.2 Radial visualization techniques

In the last few years, radial visualization techniques have become popular including in the network security field. Since we also decided to use a radial layout for our technique, some of the related visualization approaches will be discussed here. Fink and North introduced the Root Polar Layout of the IP address space [51]. Based on the assumption that classifying the IP address space into several trust levels can reveal useful information to the analyst, the tool visually separates these trust levels through concentric rings. IP addresses of the highest trust level are drawn in the inner circle, with IP addresses of progressively lower trust levels arranged on concentric rings around this circle. While this approach is certainly capable of displaying many different IP addresses on a single screen, a major drawback is that visual correlation of those IP addresses with other dimensions of the data becomes difficult. The work by Erbacher et al. detailed the analysis of the multivariate network traffic through a visualization that is conceptually close to parallel coordinates [47]. Circles around the mid- dle encode time intervals from the latest events on the outer rings to historical events on the inner ones. The angle from the middle determines the internal IP address and the cutting points with the circles are then connected through straight lines with the external IP address (top and bottom side) and the application port (left and right side), which were mapped as axes on the sides of the display. Note that the redundant encoding of the respective opposite side is used to declutter the display by excluding lines that cross more than 50 % of the display in the respective direction. VisAware focuses on situational awareness and also uses concentric circles to express time intervals [52]. It was built upon the w3 premise that every incident has at least the three attributes, namely, what, when, and where. In this display, the location attribute is placed on a map in the center, the time attribute is indicated on concentric circles around this map (with the current time spans inside and the historical ones outside), and the classification of the incident is represented through the angle around the circle. For each incident, the attributes were linked through lines that connect at its location on the map. While the first approach used concentric rings for trust levels of IP addresses and the latter two used the rings to map the time attribute, our approach uses the rings in a hierarchical fash- ion by showing highly aggregated information on the inner rings and detailed information on the outer rings. The visualization metaphor of our Radial Traffic Analyzer (RTA) consists of 7.2. Radial Traffic Analyzer 115 several concentric rings subdivided into sectors and is inspired by the space-filling hierarchical visualization techniques of Solar Plot [25], Sunburst [148], Interring [189], and FP-Viz [85].

7.2 Radial Traffic Analyzer

Our analysis focuses on the network layer of the Internet protocol stack. The network layer provides source and destination IP addresses, whereas the transport layer provides source and destination ports. Additionally, we collect information about the used protocols, (mostly TCP and UDP) and the payload (transferred bytes). In short, we store a tuple t = (time,ipsrc, ipdst, portsrc, portdst, protocol, payload) for each transferred packet. For matters of simplicity, we restrict ourselves to UDP (used by connection-less services) and TCP (used by connection- oriented services) packets. In the visualization presented in this chapter, we try to bring together the complementing pieces of information through extensive use of the visualization attribute position. Different variables of the data set are mapped to rings (see Figure 7.2), and the positioning scheme facilitates the analysis of a single data item (green) by following a straight line (red) from its position on one of the rings to the circle in the middle. Each ring segment that crosses this line (blue) is a higher level aggregate that contains the data item. Furthermore, sorting and grouping operations are applied to bring similar data tuples close to each other. Color is used to visually link identical data characteristics, such as a host that appears once as the destination of incoming traffic and once as source for outgoing traffic. We assume that a radial layout provides a better support for the task of finding suspicious

group by attribute 1, attribute 2 order by attribute 1, attribute 2

group by attribute 1 order by attribute 1

country or continent attribute 1 attribute 2

higher level aggregate data item

Figure 7.2: Design ratio of RTA 116 Chapter 7. Multivariate analysis of network traffic

Figure 7.3: Continuous refinement of RTA by adding new dimension rings patterns, because the user does not misinterpret the item’s position (left or right) as an indi- cation of that item’s importance, whereas in linear layouts, the natural reading order from left to right might cause such false impressions [26]. As users might tend to minimize eye move- ments, the cost of sampling will be reduced if items are spatially close [178]. Therefore, in the radial layout of RTA, the most important attribute, as chosen by the user, is placed in the inner circle, with the values arranged in ascending order to allow better comparisons of close and distant items. The subdivision of this ring is conducted according to the proportions of the measurement (i.e., number of packets or connections) using an aggregation function over all tuples with identical values for this attribute. Each further ring displays another attribute and uses the attributes of the rings further inside for grouping and sorting, prioritized by the order of the rings from inside to outside. RTA displays can also be created interactively. By adding one dimension after another as demonstrated in Figure 7.3, the analyst obtains a better feeling about the granularity of the data. If the granularity is too fine, there is still an option to remove or reorder rings. While each added ring triggers a new database query, reordering the rings influences the aggregates of the concerned ring as well as of all the outer rings, thus triggering a series of queries. The default configuration consists of four rings. The visualization is to be read from the center outwards, with the innermost ring for the source IP addresses, the second ring for the destination IP addresses, and the remaining two rings for the source and the destination ports, respectively. Starting on the right and proceeding counter-clockwise, the fractions of the payloads for each group of network traffic are mapped to the ring segments while sorting the groups according to ipsrc, ipdst, portsrc, and portdst. Having grouped the traffic according to ipsrc, we add the next grouping criteria for each outer ring, which results in a finer subdivision of each sector compared to the inner rings. Figure 7.4 shows the RTA display with network traffic distribution at a local computer. An overview is maintained by grouping the packets from inside to outside. The inner two circles represent the source and the destination IP addresses, the outer two circles represent the source and destination ports. Traffic originating from the local computer can be recognized by the lavender colored circle segment in the inner ring and traffic to this host is shown in the same color in the second ring. Normally, ports reveal the application type of the respective traffic. This display is evidently dominated by web traffic (port 80 colored green), remote desktop 7.2. Radial Traffic Analyzer 117

IP addresses localhost others Application ports web secure web mail secure mail Telnet/MS Remote Desktop SSH FTP/Netbios others

Figure 7.4: RTA display with network traffic distribution at a local computer and login applications (ports 3389 and 22 colored dark red and red, respectively) and email traffic (ports 993 and 25 colored blue and dark blue, respectively). To facilitate the understanding and correct interpretation of the rings, sectors representing identical IP addresses (inner two rings) are drawn in the same color. A common coloring scheme also applies to the ports (outer two rings). To further enhance the coloring concept, we created a mapping function for ordinal attributes that translates a given port or IP address number x to the index of an appropriate color scale: c(x) = x mod n (n is the total number of distinct colors used). Prominent ports (e.g., HTTP=80, SMTP=25) are mapped to colors not included into the color scale for their easier identification. The mapping function facilitates relating close IP addresses or ports to each other. To distinguish between the traffic transferred over an unsecured and over a secured channel, we modify the brightness of the respective color (i.e., HTTP/80 = green, HTTPS/443 = light green, etc.). For mapping numeric attributes (e.g., number of connections or time) to color, it makes more sense to normalize the data values and then to map them to a color scale with light to dark colors or vice versa. Different color scales were used for the attributes and should clarify the comparability of rings. Thereby, an IP address appearing as a sending host in the innermost circle and reappearing as a receiving host in the second circle should be colored identically, whereas this color should then be exempted from being used for mapping a port number in the outer rings. Mapping hierarchical data to a radial layout bring results in a generally favorable display allocation: each our ring shows more detailed information and also consumes more display 118 Chapter 7. Multivariate analysis of network traffic space. Depending on the analysis task at hand, different dimensional combinations are useful. The grouping is specified by assigning the chosen dimension (i.e., source IP, destination IP, source port, destination port) to one of the rings. A grouping according to the hosts might be useful when determining high-load hosts communicating on different ports, while a grouping according to the destination ports clearly reveals the load of each traffic type. To compensate for the strict importance imposed by the choice of the inner circle, the positioning and, thus, the importance within the sorting order can be interactively by dragging the circles into desired positions. With the increasing number of circle segments in a ring, some of them may become too small to plot labels into them. Therefore, we cut long labels and employ Java tooltip pop- ups showing the complete label and additional information like the host name for a given IP address and the possible applications corresponding to the respective ports1. As filtering is a common task, we implemented a filter for discarding all traffic with the chosen attribute val- ues, which can be triggered with a simple mouse click. Detailed information about the data items represented through a particular circle segment is accessible via a popup menu. The payload is not the only available measure for network traffic analysis. To investigate failed connections, for example, the measure “transferred bytes” would not show the data en- tries of interest on the ring, as all failed connections have the value of 0 for this measure. In this situation, “number of connections” would be a more appropriate measure for sizing the circle segments. Experts often compare the number of transferred bytes to the number of sessions on a set of active hosts. High traffic with only few sessions is considered to be a download ressource, whereas medium traffic distributed over many sessions is typical for medium-bandwidth activities like surfing the world wide web. Plotting two RTA displays us- ing different measures and the same ring order would enable this comparison at both granular and detailed levels. Normally, the data to be examined is abundant and commonly known patterns dominate over exceptional traffic patterns. Therefore, filters are crucial for the task of finding malfunc- tions and threats within large network security data sets. In our tool, we implemented rules to discard “ordinary” traffic (e.g., web traffic), but also to select just certain subsets of the traffic (e.g., traffic on ports used by known security exploits). The user interactively applies, combines and refines these filters to confirm or disprove his hypotheses about the data. One obvious disadvantage of radial layouts is unavoidable waste of a large portion of the available display at the background considering the circular shape of the former and the rect- angular shape of the latter.

7.3 Temporal analysis with RTA

For real-time capturing of network packets on a local PC, we used the packet capturing li- braries libpcap [100] and WinPcap [39], as well as the Java wrapper JPcap [20] for accessing those libraries using a Java interface. To meet high performance requirements of monitoring large networks, we decided to man-

1A list of port numbers with their applications is provided by the Internet Assigned Number Authority [74]. 7.3. Temporal analysis with RTA 119 age the data in a popular open-source relational database PostgreSQL [136]. For the analysis of large data sets, a more intelligent preprocessing is employed by merging individual packets into sessions, thus significantly reducing the data size. One way of doing this preprocessing is to take advantage of the exporting features implemented in commercial routers (e.g., the CISCO netflow protocol). Thereby, packets with the same source and destination IPs, the same port numbers, and very close timestamps are grouped into one flow record. We figured out that our tool is useful for observing network traffic characteristics over time using three different real-time modes:

1. In the aggregate mode, all traffic is aggregated and new traffic is continuously added.

2. In the time frame mode, a constant time frame is specified and the old traffic is continu- ously dropped.

3. In the constant payload mode, the same amount of traffic is displayed by specifying a flexible time frame, which always includes the latest packets.

Grouping the captured packets using a time frame up to the current moment, we can display a smooth transition by continuously updating the screen. Figure 7.5 contains a series of RTA displays for observing the evolution of the network traffic: (a) The user checks her email (blue) on two different mail servers and sends out one email using an unsecured channel (dark blue); (b) the user surfs on some web pages (port 80, dark green), whereas the blue mail traffic is still visible in the bottom left corner; (c) the user logs into her online banking account using HTTPS (bright green); (d) finally, a large file is accessed on the local file server using the netbios protocol (orange). Another possibility is to scale the radius of the circles according to the traffic load they represent. In this way, the network monitoring analyst gets a visual clue on the load situation. However, the major drawback of this option is that the display might become too small or too large to analyze because of strong variations in the network traffic.

(a) time frame 1 (b) time frame 2 (c) time frame 3 (d) time frame 4

Figure 7.5: Animation over time in RTA in time frame mode 120 Chapter 7. Multivariate analysis of network traffic

(a) Triggering RTA from a node on the HNMap (b) RTA user interface for specifying the rings

Figure 7.6: Invocation of the RTA interface from within the HNMap display

7.4 Integrating RTA into HNMap

RTA display can be triggered from any rectangle in HNMap through a popup menu as demon- strated in Figure 7.6(a). This allows the analyst to visualize more details about the hosts of the selected node or additional attributes of the data sets. After this interaction, a blank RTA display appears and the user can interactively add attributes as rings to the data set and remove them again. Figure 7.6(b) shows the user interface triggered on China’s traffic node within the data set from our university gateway and details the accumulated number of failed inbound connections on 11/29/2005. A port scan from host 218.56.57.58 is visible at the upper left and a large amount of failed attempts to open SMTP connections (email delivery) from host 218.79.183.81 can be observed on the lower left. Due to the sorting order, the whole spectrum of colors from the color scale appears several times on the second ring. This pattern visualizes probation of a continuous range of ports, which is typical for a port scan. Network traffic of “normal” applications varies the used source ports only infrequently, with normally just a few target ports employed.

7.5 Case study: Intrusion detection with RTA

RTA provides a flexible display for many different data sets and can be adjusted to a particular data set on the fly. Figure 7.7 shows an sample configuration with the inner two rings as the source and the target IP addresses and the outer ring with security alerts generated by an IDS. Having loaded 19 000 IDS alerts, we discarded ICMP Router Advertisements, ping and echo alerts. the resulting view clearly revealed that host 134.34.53.28 (yellow) was attacked by host 84.154.163.59 (green) using various methods, indicated by a large number of small slices on the outermost ring. Each of the outer slices represents a distinct class of alerts and their width corresponds to the number of these alerts. In the RTA display, the IP address dimension can be consolidated by replacing single hosts with the associated higher-level network entities (e.g., IP prefix or AS), for example, to inves- 7.6. Discussion 121

Figure 7.7: Security alerts from Snort IDS in RTA

tigate whether a denial of service attack originates from a particular AC or to assess the danger of a virus spread from neighboring AS’s.

7.6 Discussion

The feedback provided by a number of undergraduate students was encouraging. Mapping network data to a radial layout seems to present an effective overview of the composition of network communication at the level of network packets. It was recognized that the technique is applicable to small data sets captured on a local computer, as well as to traffic monitored at the university gateway, preceded by intelligent preprocessing (we obtained anonymous, cumu- lated statistics). However, the technique cannot show all details due to the visual limitations inherent in radial layouts. We can compensate for the shortcoming by discarding some obvious traffic, such as web and mail traffic, and by offering fast interactive filtering capabilities. 122 Chapter 7. Multivariate analysis of network traffic

7.7 Summary

The main contribution of this chapter lied in the development of an analytical visual interface based on the adaptation and application of radial layout techniques to the analysis needs of network traffic monitoring and network security domains. We presented the Radial Traffic Analyzer, a technique capable of visually monitoring network traffic, relating communication partners and identifying the type of traffic being transferred. The data about the network traf- fic was collected, stored and aggregated in order to be presented in a meaningful way. The RTA display is suitable for showing aggregated information on the inner rings while present- ing related information in more detail on the outer rings. It is complemented by appropriate interaction techniques, such as fading in hints on mouse-over, drag & drop to adapt the order of the rings, filtering by clicks, and details accessible via a popup menu. By specifying a time frame for refreshing the view, the user capable of continuously mon- itoring network traffic. Due to the applied grouping characteristics, changes within the visu- alization are smooth in many realistic scenarios. Integration of RDA into HNMap enables the analyst to receive additional information about arbitrary nodes in the IP/AS hierarchy. In a short case study, we demonstrated how IDS alerts can be intelligently grouped in order to see the types of alerts in proportion to their number. Radial Traffic Analyzer is not only suitable for monitoring purposes, but can also be used in the educational context for teaching and learning networking concepts, such as the flipped use of source and destination ports (TCP and UCP protocol) in the replies of network requests demonstrated in Figure 7.4. 8 Visual analysis of network behavior

,,Nothing has such power to broaden the mind as the ability to investigate system- atically and truly all that comes under thy observation in life.”

Marcus Aurelius

Contents 8.1 Related work on dimension reduction ...... 124 8.2 Graph-based monitoring of network behavior ...... 125 8.2.1 Layout calculation ...... 127 8.2.2 Implementation ...... 127 8.2.3 User interaction ...... 127 8.3 Integration of the behavior graph in HNMap ...... 129 8.4 Automatic accentuation of highly variable traffic ...... 131 8.5 Case studies: monitoring and threat analysis with behavior graph . . . 131 8.5.1 Case study I: Monitoring network traffic ...... 132 8.5.2 Case study II: Analysis of IDS alerts ...... 135 8.6 Evaluation ...... 138 8.7 Summary ...... 139 8.7.1 Future Work ...... 139

HIS chapter focuses on tracking behavioral changes in traffic of hosts or higher level net- Twork entities as one of the most essential tasks in the domains of network monitoring and network security. We propose a new visualization metaphor for monitoring time-referenced host behavior. Our method is based on a force-directed graph layout approach which allows for a multi-dimensional representation of several hosts in the same view. The proposed metaphor emphasizes changes in traffic over time and is therefore perfectly suited for detecting atypical system behavior. We use the visual variable position to give an indication about traffic propor- tions at a host at a particular moment in time: high traffic proportions of a particular protocol attract the observation nodes resulting in clusters of similar nodes. So-called traces are used to connect these host snapshots in chronological order. Various interaction capabilities allow fine-tuning of the layout, highlighting of hosts of inter- est, and retrieval of traffic details. As a contribution in the area visual analytics, an automatic accentuation of hosts with high variations in the used application protocols of network traffic was implemented in order to guide the interactive exploration process. In our work, we abstract from actual date and time values by aggregating network data into intervals. Each monitored network host (or, more abstract, a network entity) is repre- sented through a group of connected nodes in a graph and the position of each one of these 124 Chapter 8. Visual analysis of network behavior nodes represents the network traffic for the particular host in the specified time interval. The brightness of the traces is used to convey the nodes’ time references. The chapter is structured as follows. We start by discussing related work in the field of dimension reduction and proceed by introducing the behavior graph approach detailing layout calculation, implementation, user interaction, integration in HNMap, and automatic accentu- ation of highly variable hosts and networks. The applicability of the presented methods for network monitoring and intrusion detection is demonstrated in two case studies, followed by the discussion of the performance as well as of the scalability limitations of the approach in the evaluation section. The last section concludes the chapter by summarizing its contributions.

8.1 Related work on dimension reduction

In this chapter, a graph-based approach to monitoring the behavior of network hosts is pro- posed. In the essence, this approach can be described as a dimension reduction technique since it projects high-dimensional points onto the two-dimensional display space. We there- fore briefly review some related work in this field. Rather than visualizing all dimensions of a data set, dimension reduction techniques reduce the high-dimensional data points to two or three dimensions, which can then be plotted to a visual layout. One such technique is the the Principal Component Analysis (PCA) [130], which displays the data as a linear projection onto a subspace of the original data in a way that best preserves the variance in the data. However, since the data is described in terms of a linear subspace, this technique cannot take handle arbitrarily shaped clusters. Multidimensional Scaling (MDS) [159] is an alternative approach, which operates on a matrix of pairwise dissimilarities of entries. The goal is to optimize the representations in such a way that the distances between the items in low-dimensional space are as close as possible to the original distances. Note that some MDS approaches can also be applied to non-metric spaces. One problem with nonlinear MDS methods is their computational expensiveness for large data sets. However, Morrison et al. presented a subquadratic MDS algorithm in 2002 [121]. Since even subquadratic algorithms are computationally too expensive for interactive analysis of large data sets, the MDSteer algorithm that refines selected subregions of the MDS plot was presented as a compromise [183]. In contrast to PCA and MDS methods, the goal of the method presented in this chapter is not only to plot high-dimensional data points in such a way that high-dimensional clusters are preserved in the two-dimensional visualization, but also to convey information about the characteristics of dimensional values of particular data points. This goal is achieved by in- troducing a few fixed nodes in a force-directed spring embedder graph layout [53]. In the proposed layout, each dimension is modeled as a vertice named “dimension node” and each host in a particular time interval as a vertice named “observation node”. Thereby, dimension nodes are fixed a priori and the position of each observation node is determined by traffic measurements, which are expressed through attraction forces between the respective node and the dimension nodes. Furthermore, repulsion forces between nodes in high density regions indirectly result in the assignment of more display space to these interesting data regions. 8.2. Graph-based monitoring of network behavior 125

host A host B host B 1.0 1.0 1.0 0.8 0.8 SSH FTP 0.6 0.6 DNS HTTP

0.5 Undefined 0.4 0.4 IMAP SMTP 0.2 0.2 0.0 0.0 0.0 1 2 3 4 5 6 1 2 3 4 5 6 Figure 8.1: Normalized traffic measurements defining the states of each network entity (host A or host B) for the intervals 1 to 6 − 0.5 8.2 Graph-based monitoring of network behavior − 1.0 In this section we deal with the problem of monitoring the behavior4 of hosts or higher level network entities through their network traffic. The goal of the proposed visualization is to effectively discover changes in the network entities’ behavior by comparing their states over time. This kind of analysis is mostly centered on a set of monitored hosts within the adminis- trated local area network. While there are many attacks from outside the network, it is difficult to see when these attacks are successful. Our threat model therefore assumes that hosts, which all of a sudden experience drastic changes in their network traffic could be hacked and there- fore impose a serious threat on the network infrastructure. In particular, the internal network might be at risk since the hacked computer can be used to bypass security measures such as a central firewall. Figure 8.1 shows the states of host A and host B at time intervals from 1 to 6 by calculating the normalized traffic proportions for each type of traffic within the interval. We interpret these states as points in a high-dimensional space (one dimension per traffic type). Although the charts show all the relevant information, their scalability is limited since displaying such detailed information for many hosts and time intervals makes would make it difficult to keep the overview. To account for the above deficiency, we represent each network entity in a two-dimensional plot through several connected points, which all together compose the entity’s trace. Both color and shape are used to make entities distinguishable from one another. Each node repre- sents the state of a single network entity for a specific time interval and its position is calculated through the entity’s state during that interval. We basically map a high-dimensional space onto a distorted two-dimensional space. If the nodes for one entity are not in the same place, the entity’s state has changed from one time interval to the other. This leads to some intended effects, which help to visually filter the image. Entities that do not change form small clusters or might even be only visible as a single point, whereas entities that have changed reveal visible trails, either locally or across the view. These long and prominent lines, which represent considerable changes in the behavior of a network entity, are more likely to catch the user’s attention. To be able to visualize more than two dimensions in a two-dimensional plot, we use a 126 Chapter 8. Visual analysis of network behavior

dimension nodes SSH 1

attraction forces FTP SMTP

5 observation nodes 2 3 4 6 6 5 DNS 3 IMAP

traces

1 host A 2 host B HTTP Undefined

Figure 8.2: Coordinate calculation of the host position at a particular point of time (the final graph layout is calculated using a force-based method) force-directed layout approach to approximate distance relationships from high-dimensional space in 2D. Each data dimension is represented by a dimension node. In the first step, the layout of these nodes is calculated. Although arbitrary layouts are possible for positioning these dimension nodes, the current implementation uses a circular force-directed layout to distribute the nodes over the available space. This chosen layout now defines the distortion of the projected space. After fixing the positions of the dimension nodes, the observation nodes are placed in the plane and connected to their corresponding dimension nodes through virtual springs. All observation nodes of the same entity are also tied together with virtual springs. The forces are calculated in an iterative fashion until an equilibrium is approximated. Figure 8.2 sketches the layout calculation exemplarily for the two hosts from the previous figure. The analyst can now trace the state changes of each host at all intervals. Fine-tuning the graph layout with respect to trace visibility is done by attaching additional attraction forces to the trace edges, which are then taken into consideration during layout calculation. To visually highlight the time dependency of the nodes, we mapped time to the alpha value of the connecting traces: older traces gradually fade out while newer ones are clearly visible. For many analysis scenarios, not only traffic proportions matter, but also absolute traffic measures. In other words, the graph layout will assign almost the same position to two nodes with each having 50 % IMAP and SMTP traffic, ignoring the fact that one of them has trans- ferred several megabytes whereas the other one transferred only a few bytes. Therefore, we decided to vary node size according to the absolute value of the traffic measure (normally, the sum of the transferred bytes) using logarithmic scaling to compensate for large variations in traffic volumes. 8.2. Graph-based monitoring of network behavior 127

8.2.1 Layout calculation The weights of the attraction edges of each observation node represent the portions of the employed application protocols within the network traffic in a particular time interval. The first node of host B in Figure 8.2, for example, is only connected to the SMTP dimension node. Since node positions are calculated step-wise using a spring-embedder graph layout and since all observation and dimension nodes push each other away due to additional repul- sion forces, a consistent graph layout is generated where each nodes has a unique position. We used the spring embedder algorithm to calculate the forces between the nodes [53]. The calculation of the attracting forces follows the idea of a physical model of atomic particles, exerting attractive and repulsive forces depending on the distance. While every node repels other nodes, only those nodes that are connected by an edge attract each other. It is impor- tant to note that the forces calculated by this algorithm result in speed, not acceleration as in physical systems. The reason is that the algorithm seeks for a static equilibrium, not a dy- namic one. There are several other algorithms that could solve our layout problem, such as the force directed algorithm in [42], its variant in [78], and the simulated annealing approach in [38]. The reason for choosing the Fruchterman-Reingold algorithm is its efficiency, speed, and the robustness of the force and iteration parameters. As weighted edges were needed, we extended the Fruchterman-Reingold implementation of the JUNG [129] graph drawing library to support additional factors on the forces.

8.2.2 Implementation To build a flexible and fast analysis system, we relied on the database technology provided by the relational database PostgreSQL [136]. Data loading scripts extract the involved IP addresses along with port numbers, the transferred bytes, and a timestamp from tcpdump files and store them in the database. To speed up query execution, traffic with identical IPs and ports can be aggregated into 10 minutes intervals in a new fact table. The actual behavior graph application is implemented in Java.

8.2.3 User interaction Since node positions depend on the traffic occurring in the respective time interval and the repulsion forces of nearby nodes, only an approximation of the actual load situation is given in our visualization. Furthermore, estimating traffic proportions from node positions becomes difficult or even impossible due to the ambiguity introduced by projecting multi-dimensional data points into 2D. Figure 8.2, for example, shows that hosts A and B have almost the same position in the sixth interval although their traffic load is completely different (see Figure 8.1). This might happen because there exist several sets of traffic loads that are mapped to the same 2D location. We resolve this ambiguity through user interaction: moving the mouse over a node triggers a detail view. Alternatively, dimension nodes can be moved by dragging to estimate their influence on a particular node or a group of nodes. Figure 8.3 shows a behavior graph of 33 prefixes over a time span of one hour, in which the user selected three prefixes to trace their behavior. Interaction is used as a means to retrieve 128 Chapter 8. Visual analysis of network behavior

Figure 8.3: Host behavior graph showing the behavior of 33 prefixes over a timespan of 1 hour traffic details for a particular node (bar chart in the middle) and the configuration panel on the right allows for fine-tuning the layout. A simple click on a dimension node results in highlighting all observation nodes containing the respective traffic. This highlighting is realized by coloring all normal nodes in grayscale while showing the highlighted ones in color. Using the configuration panel, further dimension nodes and observation node groups can be added to or removed from the visualization. Because our application was designed carefully to ensure its usefulness for a variety of analysis scenarios, the user can flexibly choose the attributes representing attraction nodes and observation node groups depending on the available attributes in the considered data set. To abstract from the technical details, the user can simply select from the available data attributes in the two drop-down menus shown in Figure 8.3. In addition to the above manipulation options, the configuration panel provides four sliders:

1. Movement accentuation slider highlights suspicious hosts with highly variant traffic. Further details of this feature are given in Section 8.4.

2. A slider controlling the number of observation nodes by increasing or decreasing the time intervals for aggregating traffic. Changing the granularity of time intervals is a powerful means to remove clutter (less nodes due to larger time intervals) or to show more details (more nodes) in order to understand traffic situations better.

3. A slider for fine-tuning the strength of inter-object forces. Since each distinct node represents the state of a particular host during a time interval, we use edges to enable the user to trace a node’s behavior over time. However, following these edges can become a challenge since nodes can end up in widely varying places. In order to make these 8.3. Integration of the behavior graph in HNMap 129

SSH SSH 1

FTP SMTP FTP SMTP

5 1

2 2 3 5 3 4 4 6 6 6 6 5 5 DNS IMAP DNS 3 IMAP 3

2 1 1 host A host A 2 host B host B HTTP Undefined HTTP Undefined

(a) Low inter-object forces (b) High inter-object forces

Figure 8.4: Fine-tuning the graph layout through cohesion forces between the nodes of each host for improving the compactness of traces

observation node groups more compact, additional attraction forces can be defined on neighboring nodes of a chain and adjusted with the given slider. Figure 8.4 demonstrates the effect of changing the forces.

4. Last, but not least, the attraction forces between observation and dimension nodes play an important role in ensuring the graph’s interpretability. Too strong attraction forces result in dense clusters around the dimension nodes, whereas too weak attraction forces produce ambiguity in interpreting traffic proportions since repulsion forces between ob- servation nodes push some nodes closer to unrelated dimension nodes.

While adjusting the first slider results in mere re-coloring of the observation nodes and their traces, changing the lower three sliders causes changes in the calculation of the graph layout. In particular the time interval slider effects the number of observation nodes and, therefore, triggers a database query to determine the dimensional values of each observation node. Since the relationships between the old and the new observation nodes are difficult to figure out, the layout is recomputed from scratch in this case, whereas changes of the inter- object and dimension forces can be calculated by using the current graph, adjusting the forces, and realigning the node positions in few iterations of the spring embedder layout.

8.3 Integration of the behavior graph in HNMap

In Section 5 we presented the HNMap as a hierarchical view on the IP address space. Hosts are grouped by prefixes, autonomous systems, countries, and continents using a space-filling hierarchical visualization. This scalable approach enables the analyst to retrieve the details of a quantitative measure of network traffic leaving and entering the hosts in the visualization using the above mentioned aggregation levels. While this approach takes the IP address dimension 130 Chapter 8. Visual analysis of network behavior

Figure 8.5: Integration of the behavior graph view into HNMap as well as a user-defined quantitative traffic measurement into account, the behavior graph focuses on behavioral changes, which are usually reflected in the dimensions of the data sets orthogonal to the IP dimension, e.g., application port or type of IDS alert. The combination of these two visualization modules is promising for providing a more comprehensive view on these complex data sets. Figure 8.5 shows the HNMap view drilled-down to the AS level. Through a pop-up menu, a behavior graph for any one of the shown AS rectangles can be displayed. Since the detailed information necessary for building the behavior graph is available for all hierarchy levels, the user is free to choose the appropriate one. Thereby, the behavior of the selected node can be analyzed at any grain by showing the node’s constituent child or descendant nodes in the graph. Note that only the two lowest hierarchy levels are available in the menu because the selected node (red node at the upper left corner of the pop-up menu) is an AS node. Higher level behavior graphs can be triggered from coarsely grained nodes, e.g., in country or continent level HNMap views. While the behavior graph on prefixes, AS’s, countries, or continents represents less detailed information about particular substructures of the Internet, it has proven to be beneficial since the provided aggregated views significantly reduce the information overload a network ad- ministrator faces when dealing with large-scale network traffic monitoring. Hence, finding the relevant subset using HNMap in combination with aggregation in detail views can be seen as a possible solution to getting hold of scalability. For example, visualizing the behavior of all prefixes within a particular AS can be understood better than visualizing the behavior of all 200 000 available prefixes at once. 8.4. Automatic accentuation of highly variable traffic 131

8.4 Automatic accentuation of highly variable traffic

In the behavior graph, clusters immediately stand out. However, to minimize the threat of misbehaving hosts within the adminstrated network, the analyst is in many scenarios interested in nodes with highly variable traffic or, in other words, in nodes that jump from one position to another in the visualization. This corresponds to the assumpution in our threat model that we can see changes in the behavior of hacked computers by analyzing their network traffic. Since our visualization spans an n-dimensional metric space, it is possible to calculate the −→ normalized positional changes pcnorm of all subsequent observations o t (1 ≤ t ≤ tmax) of a host in this metric space:

−→o −→o r = (8.1) |−→o | tmax−1 −→r −→r ∑t=1 | o t − o t+1| pcnorm = , 0 ≤ pcnorm ≤ 2 (8.2) tmax − 1

Note that the relative position −→o r of an observation node needs to be calculated first as the graph layout tries to place nodes with identical relative positions close to each other. After calculating pcnorm for every node in each observation group, the groups with the highest values are accentuated. The bounds of pcnorm can be explained through the fact that the maximum −→r difference between two vectors o t of length 1 is 2. To demonstrate the capabilities of our tool in a reproducible way, we used traffic data from our university network. In particular, we loaded all netflows passing the university gateway into the database and aggregated their traffic on the internal ‘/24’-prefixes. A database query calculates the data for each node and loads it into the visualization. Figure 8.6 shows the behavior of 96 ‘/24’-prefixes in the data set with time intervals of 12 minutes. Note that nodes with the highest variations in their traffic are automatically accentuated in accordance with the outcome of our calculations. Likewise, these calculations can be used as a filter to remove nodes with no significant changes over time. Since this filtering step reduces the amount of observation groups, less computational power is needed for the layout calculations of the remaining nodes.

8.5 Case studies: monitoring and threat analysis with behavior graph

In this section we present two case studies validating the design of the behavior graph: the first case study deals with gaining an overview of network traffic thereby identifying groups of hosts with similar behavior as well as abnormal behavior of individual hosts; in the second case study, alerts from an IDS are visualized in order to further explore, relate and disseminate security-relevant data. 132 Chapter 8. Visual analysis of network behavior

Figure 8.6: Automatic accentuation of highly variable ‘/24’-prefixes using 1 hour traffic from the university network

8.5.1 Case study I: Monitoring network traffic

In this case study, we used a data set containing the traffic of the most active hosts of our university network within a time frame of one day. The IP addresses of both the internal hosts and their external communication partners were anonymized to protect the privacy of the network users. In the absence of a packet inspection sensor, we relied on the used appli- cation ports in order to infer the application program from the network traffic. Since most registered applications use small port numbers, we restricted our analysis to application ports below 10 000. Due to the difficulty of determining for each packet, which one of the two hosts initiated the connection, the application port cannot always be identified with absolute confidence. Despite the fact that many applications use dynamic or private ports ranging from 49 152 to 65 535 as the source port of their network communication on which they will later wait for the replies, we did not automatically determine the application port by choosing the smaller port number between the source and the destination port since this would introduce additional uncertainty to the behavior graph. As a result, many observation nodes are strongly attracted by the “unspecified” node. Figure 8.7 shows 50 most active nodes in the data sets from 12 to 18 hours consolidated into 10 minute intervals, resulting in 1715 observation nodes, which comes close to the scal- ability limit of our visualization since both layout calculation and interactions causing layout changes become slow and tedious. Despite these drawbacks, observation node groups can be interactively highlighted in order to track their behavioral changes. The generated visual- ization displays several prominent clusters, such as cluster of university employees who use the Network File System (NFS) protocol or the cluster of Oracle database hosts. The largest cluster represents network traffic of standard applications, such as DNS, HTTP or the “unspec- ified” dimension nodes. Apart from these clusters, host details can be easily tracked, e.g., the 8.5. Case studies: monitoring and threat analysis with behavior graph 133

Standard: DNS, HTTP and Unspecified

NFS of employees

Oracle DBs

Figure 8.7: Overview of network traffic from the University of Konstanz between 12 and 18 hours (showing 50 hosts with the highest traffic volume) green host strongly attracted by the SMTP node is probably the central outgoing mail server. The same behavior analysis can be conducted for the night time. As shown in Figure 8.8, less hosts are active between 0 and 6 hours of the same day; 10 of the top 50 daily hosts are not active at all. There are some behavioral changes during night time, such as the nightly backups, which result in movements towards the backup node and back again as seen on some hosts from the large cluster as well as on the green nodes of the DNS server. Furthermore, there is an increase of Oracle database traffic of both the green and the blue server. In general, specialized servers have less variant traffic since they predominantly serve a very limited range of application ports. Because of this high behavioral changes of the Oracle servers in the early morning, these two hosts are among the three hosts with the highest change according to their pcnorm value as demonstrated in 8.9(a). In general, there are two different approaches to investigating suspicious traffic with the behavior graph. The first one is to use the automatic accentuation feature presented in Section 8.4, which is shown in Figure 8.9(a). A slider is used to adjust the number of colored hosts to be inspected in detail. This approach is based on the behavioral changes defined in the high-dimensional space, whereas the second approach is based on the observations in the low- dimensional visualization. As demonstrated in Figure 8.9(b), clicking on a particular node causes the associated observation node group to become colored, whereas the remaining nodes and their traces are shown in gray. In this exploration approach, the analyst is more likely to 134 Chapter 8. Visual analysis of network behavior

Nightly backup of the DNS server

Increasing Oracle DB traffic at 6 h AM

Nightly backup

Return to normal operations

Figure 8.8: Nightly backups and Oracle DB traffic in the early morning

(a) Three hosts with the largest behavioral changes (b) Two employee machines active at night time

Figure 8.9: Investigating suspicious host behavior through accentuation 8.5. Case studies: monitoring and threat analysis with behavior graph 135

Figure 8.10: Evaluating 525 000 SNORT alerts recorded from January 3 to February 26, 2008 pick out hosts that visually stand out not only due to their highly variant traffic, but also due to their prominent position outside of common clusters, such as the blue and the green hosts, which correspond to two employees working at night time. A double click on the background either colors all nodes or sets them back to gray. A single click colors a node group or sets it back to gray depending on the state of the group before the click. Having demonstrated the capability of the behavior graph to visualize host behavior of raw network traffic in a monitoring scenario, we proceed to the next case study demonstrating this technique’s applicability for analysis of large amounts of IDS alerts.

8.5.2 Case study II: Analysis of IDS alerts For this case study, we evaluated the 525 000 alerts generated by a SNORT intrusion detection sensor within a subnet of the university network from January 3 to February 26, 2008. The alerts referred to 684 hosts that scanned and attacked the network or generated suspicious network traffic. The attraction nodes in this case were initialized not with the application port numbers, but rather with the 15 most prominent SNORT alerts of our data set and an “unspecified” traffic node for the remaining 19 rarely occurring alerts. Figure 8.10 shows the outcome of the behavior graph, in which each node represents the aggregated observations for a particular host within 3 days. Larger nodes indicate a higher number of alerts thus helping to quickly identify the most active hosts. Colors and shapes 136 Chapter 8. Visual analysis of network behavior

(a) Behavior of external hosts (b) Behavior of internal hosts

Figure 8.11: Splitting the analysis into internal and external alerts reveals different clusters make nodes of different observation groups more distinguishable. For explanatory purposes, some observation groups were highlighted manually. While some clusters can be clearly seen, long colored straight lines immediately stand out. These lines can be explained either through alternating scanning or through attack patterns such as the top left alternating pattern between ICMP PING CYBERKIT 2.2 WIN and ICMP PING or through a transition from one attack pattern to the other. For example, it is quite common to first scan the target network with various pings or ICMP echo replies in order to infer the operation systems of the victims from the results. Afterwards, more specialized attacks that exploit security leaks of the victims’ operation systems can be launched. In general, our visualization reveals correlations of alerts to the security analyst through these connecting lines and thereby supports him in his task to gain an overview over thousands of security alerts. During interactive exploration, we discovered that the clusters were distinguishable by in- ternal and external hosts since they generated quite dissimilar alert patterns. We therefore created an internal and an external view as shown in Figure 8.11. Especially router adver- tisements seem to be an internal problem, which might be partially caused by the fact that the SNORT sensor does not explicitly exclude the internal routers and therefore logs any ad- vertisement as a possible intrusion. Furthermore, the size of some internal nodes (see Figure 8.11(b)) is considerably larger since they trigger more alerts. In particular, the large green host on the left triggered more than 4000 alerts in each of the 3-day intervals. A closer examination revealed that the host is monitored externally and replies to three ICMP echo requests every 3 minutes. But not only large amounts of alerts are interesting to observe. Small clusters like the one on the left near the ICMP DESTINATION UNREACHABLE PORT alert might give an indication to hosts that silently scan the internal network infrastructure, especially when a host is represented in more than one time interval like the red node group. 8.6. Evaluation 137

(a) Behavior of attacking hosts (b) Behavior of victim hosts

Figure 8.12: Analysis of 63 562 SNORT alerts recorded from January 21 to 27, 2008

To further evaluate the behavior graph, we loaded a subset of a large data set containing 63 562 SNORT alerts, which were generated in the week from January 21 to 27, 2008. Figure 8.12(a) shows the behavior of the attacking hosts by focusing the analysis on the source ad- dresses. There are relatively few traces due to the choice of rather long 24-hour time intervals. Traces are therefore only drawn if a host generated alerts on two different days. Figure 8.12(b) shows more traces since this behavior graph monitors local hosts within the network and these hosts get attacked over and over again. This second plot is strongly biased by the ICMP PING node on the upper right because almost all hosts in the network are scanned daily.

While the attacking hosts in Figure (a) form relatively homogeneous clusters, Figure (b) is dominated by the large cluster, which is linked to several outlier nodes. Having detected potential victims via a network scan using an ICMP PING, the attackers employ more special- ized methods to find the victim’s vulnerabilities. It is interesting to observe that, for example, the large red node in (b) is scanned between 2800 and 4300 times per day using up to five different ping or echo reply methods. Due to the similarity of these attacks, this attack pattern also forms a cluster in (a) (left of the center), which consists of large heterogeneous nodes. Having inspected these nodes in detail, we realized that they originate from the same IP ad- dress range, which is an indication of a monitoring script that checks those hosts’ availability on a regular basis. Note that some providers do not assign static IP addresses thus forcing their clients to change the IP address every 24 hours. Due to long time intervals of 24 hours, we observe no traces within this cluster. 138 Chapter 8. Visual analysis of network behavior

8.6 Evaluation

To evaluate our tool, we measured the performance for laying out 100 to 1000 nodes on a dual 2 GHz PowerPC. While conducting the performance measurements, we realized that the sec- ond CPU stayed almost completely unused, which indicates that additional performance can be gained by parallelizing the algorithm. Since we used the Fruchterman-Rheingold algorithm to calculate the graph layout, the time complexity is θ(|V|2) for each of the 100 iterations. Ad- ditional computational resources are used for drawing the preliminary results after each itera- tions. The resulting performance charts in Figure 8.13 show the expected quadratic runtime. Although real-world data was used in the experiments, the number of edges was proportional to the number of nodes by the factor of approximately 2.5. At this point we have to acknowl- edge the existance of faster algorithms for calculating the graph layout, for example, in the GraphViz software [46]. However, for fast implementation of our prototype we preferred an algorithm from the JUNG framework [129], which was written completely in Java. Our tool can calculate a stable layout within approximately seven seconds for 100 observa- tion nodes and within 2 minutes for 1000 observation nodes. The number of actual observation nodes depends on the number of monitored network entities, the time interval over which the data is aggregated and the monitored time span. Each one of these can be seen as a factor impacting the estimation of the number of observation nodes (e.g., monitoring 50 hosts over six 10-minute intervals results in 300 or less observation nodes). Above 1000 observation nodes, layout calculation becomes tedious and fine-tuning layout parameters such as the host cohesion force, the repulsion force, and the dimension force turn into challenges of their own. In our case study, we used up to 15 data dimensions and one dimension to aggregate the remaining sparsely populated dimensions of the data sets. Although this might be appropri- ate for many data sets, there is definitely a limit to the number of used dimensions. With less sparse high-dimensional data, interpretation of the projected data points becomes more challenging. However, there are various possibilities to aggregate the data prior to analyzing 100000 100000 60000 60000 runtime (ms) runtime (ms) 20000 20000

200 400 600 800 1000 500 1000 1500 2000 2500

layout nodes edges nodes 100 200 300 400 500 600 700 800 900 1000 edges 247 503 748 991 1247 1495 1733 1984 2211 2465 runtime (sec) 7,2 12,9 20,6 29,3 39,4 54,7 65,8 80,9 102,1 122,1

Figure 8.13: Performance analysis of the layout algorithm 8.7. Summary 139 it with the plot: 1) several correlated dimensions, such as different kinds of ping, router ad- vertisements, etc., could be aggregated into a single dimension, 2) the time interval can be adjusted to reduce the number of observation nodes in order to fit within the scalability lim- its of our visualization as small time intervals generate many observation nodes, whereas the opposite applies to long intervals, 3) user-defined filters can remove considerable amounts of well-studied parts of the data.

8.7 Summary

In the scope of this chapter, we discussed a novel network traffic visualization metaphor for monitoring host behavior. The technique is denoted the behavior graph and uses an adaptation of the force-driven Fruchterman-Reingold graph layout to place host observation nodes with similar traffic proportions close to each other in a two-dimensional node-link diagram. Various means of interaction with the graph make the tool suitable for exploratory data analysis. Nodes with specific traffic or the temporal behavior of a few chosen hosts can be highlighted. Furthermore, the analyst can move the dimension nodes and, thereby, sort the observation nodes according to his own criteria. Since the behavior graph can be used to analyze both low-level host behavior and the behavior of more abstract network entities, we integrated it into the HNMap tool, from which the former can there be triggered through a pop-up menu on network entities of various granularity levels (e.g., hosts, prefixes, AS’s). To enhance our tool with a visual analytics feature, we introduced a normalized measure for positional changes in n-dimensional metric space to automatically accentuate suspicious hosts with highly variant traffic. The usefulness of the tool was demonstrated in two case studies using traffic measurements from our university network and IDS alerts from a SNORT sensor. Thereby, it was shown that the tool is useful for typical traffic monitoring scenarios as well as for analysis of intrusion detection alerts. In the evaluation section, the scalability limitations of the visualization with respect to runtime performance were discussed.

8.7.1 Future Work We noticed that enabling the user to select a certain point in time for visualization would provide additional functionality to the tool. One possibility would be to use a histogram of the amount of traffic over time. The user could then select an interval on this histogram to view the traffic. Another interesting possibility would be the option to visualize network flows in realtime with a sliding time window starting at the present and extending to some time in the past. As our layout is calculated iteratively, realtime visualization should be possible with a standard processor. Another direction for further work is the integration of an option to automatically select interesting dimensions. For data sets with very high dimensionality the view gets cluttered. As our technique focuses on a general view, it would make sense to use methods like PCA to eliminate dimensions, which have only a minor effect on the resulting visualization layout.

9 Content-based visual analysis of network traffic

,,Exploration really is the essence of the human spirit, and to pause, to falter, to turn our back on the quest for knowledge, is to perish.” Frank Borman

Contents 9.1 Related work on visual analysis of email communication ...... 142 9.2 Self-organizing maps for content-based retrieval ...... 143 9.2.1 Use cases ...... 143 9.2.2 Feature Extraction ...... 144 9.2.3 SOM generation ...... 146 9.3 Case study: SOMs for email classification ...... 147 9.4 Summary ...... 148 9.4.1 Future work ...... 149

HILE the previous chapters mainly dealt with the analysis of network traffic meta data, W this chapter’s focus is on the analysis of the actual content of network communication. For this purpose, we consider electronic mail, which has become one of the most widely- used means of communication. While mailing volumes have shown high growth rates since the introduction of email as an Internet service and considerable work has been done to im- prove the efficiency of email management, the effectiveness of email management from a user perspective has not received a comparable amount of research attention. Typically, users are given little means to intelligently explore the wealth of cumulated information in their email archives. In this chapter, we extend our framework with a visualization module based on Self-Orga- nizing Maps (SOMs) [90] generated from a term occurrence email descriptor. We apply this module to an email archive and enhance the functionality of an email management system by offering powerful visual analysis features. The rest of this chapter is structured as follows. In Section 9.1, related efforts on enhancing visual analysis of email communication are discussed. The concept of self-organizing maps is introduced in Section 9.2 by presenting use cases, demonstrating tf-idf feature extraction on emails, and giving an intuition about how SOMs are generated. Section 9.3 shows in a case study how the SOM can be used to explore emails classified according to spam and ordinary email. Finally, we sum up our contributions and possible directions for future work in the concluding section. 142 Chapter 9. Content-based visual analysis of network traffic

9.1 Related work on visual analysis of email communication

The main reason for improving the graphical user interface of email applications is that the success and popularity of email communication have led to high daily volumes of messages being sent and received, resulting in email overload situations where important messages get overlooked or “lost” in archives [180]. Although usage of email has changed significantly over the years since nowadays many people use it to manage their appointments and tasks, for file exchange and as a personal archive, email clients have stayed very much the same [41]. Several research groups felt motivated to propose novel features for email applications to make email clients more adequate for these tasks. Mandic and Kerne, for example, recognized that email communication is acually “a diary we were never aware we were keeping ”[112] and, therefore, regard personal email archives as a potential source of valuable insight into the structure and dynamics of one’s social network. Acknowledging the fact that traditional email interfaces have undergone little evolution despite their intensive usage, they propose faMailiar, a visualization interface for revealing intimacy and rhythm of personal email communication to the user. The system combines user-defined categorization of contact intimacy with message intimacy, computed using the presence of intimate and anti-intimate syntagmata, in order to visualize communication patterns of email messages through glyphs in a calendar view. The IBM Remail project also focused on chronological aspects of email communications with Thread Arcs, which visualize the reply patterns of communication threads in a compact way [88]. Apart from this key feature, the prototype was designed to integrate several sources, such as chat communication, news, calendar events, reminders, etc., into one communication platform. Through the scatter-and-gather feature, the inbox can be quickly reduced to the latest messages of on-going communication threats. So-called collections can be used to order the contents of email communication according to user-defined preferences while not moving the messages out of the inbox. Especially the pivoting interface allows to rapidly switch between different views of the inbox, collections, and the to-do list without losing context. Microsoft also started an attempt to innovate email interfaces with the Social Network and Relationship Finder (short: SNARF) [126]. The prototype aggregates social meta data about email correspondents to aid email triage, which is the process of viewing unhandled email and deciding what to do with it. In the prototype, emails can be sorted according to several social metrics that capture the nature and the strength of the relationship between the user and each correspondent. In addition to exploring relationships between users and groups of users, the Email Mining Toolkit (EMT) can be used to investigate chronical flows of emails for detection of misuses, such as virus propagation and spam spread [104]. The tool comes with a clique panel for visu- alizing the relationships in a circular node link diagram, which is extended through concentric rings to depict the time dimension. Besides a node-link diagram for social network exploration called Social Network Frag- ment, the tool by Viegas´ et al. also offers a temporal visualization of email communication named PostHistory, which uses a calendar metaphor to visualize overal email volumes as well as highlighted communication from user-selected contacts [174]. To facilitate interactive management of large volumes of email, we investigated techniques 9.2. Self-organizing maps for content-based retrieval 143 for visualizing temporal and geo-related attributes of emal archives in the previous work [82]. These techniques were based on a recursive pattern pixel visualization for displaying temporal aspects and on a map distortion technique to visualize the distribution of emails according to their geographic origin. In contrast to the methods reviewed above, which all visualize structured attributes from the content headers (e.g., sender, date, and in-reply-to fields) or other meta data (e.g., geography), this chapter will focus on the analysis of unstructured text content of email messages. The- mail, for example, is a typographic visualization of an individual’s email content over time [175]. The tool displays the most frequently used terms as “yearly words” in a large font in the background and a more detailed selection of “monthly words” in a small font in the fore- ground, where those terms were chosen according to their frequency and distinctiveness using an adapted tf-idf feature extraction approach. A recent approach to visualizing emails in a self-organizing map [128] is probably closest to our own research. The authors propose an externally growing SOM with the aim of providing an intuitive visual profile of considered mailing lists and a navigation tool where similar emails are located close to each other. While both their and our approach use tf-idf feature vectors, the focus of the former is to adapt the map to the distribition of the underlying data by a growing process in order to avoid the time-consuming retraining process. Our approach, on the contrary, deals with the issue of how well additional classification attributes, such as the classification into spam and ordinary email, is preserved within the SOM.

9.2 Self-organizing maps for content-based retrieval

Self-Organizing Map [90] is a neural network algorithm that is capable of projecting a dis- tribution of high-dimensional input data onto a regular grid of map nodes in low-dimensional (usually, 2-dimensional) output space. This projection is capable of (a) clustering the data, and (b) approximately preserving the input data topology. The algorithm is therefore especially useful for data visualization and exploration purposes. Attached to each node on the output SOM grid is a reference (codebook) vector. The SOM algorithm learns the reference vectors by iteratively adjusting them to the input data by means of a competitive learning process. SOMs have previously been applied in various data analysis tasks. An example of the application on a large collection of text documents is the well-known WebSom project. Sev- eral visualization techniques supporting different SOM-based data analysis tasks exist [173]. The U-matrix, for example, visualizes the distribution of inter-node dissimilarity, supporting cluster analysis. Component planes are useful for visualizing the distribution of individual components in the reference vectors in order to support correlation analysis. If the input data points are mapped to their respectively best matching map nodes, histograms of map popula- tion, such as the distribution of object classes on the map, are possible.

9.2.1 Use cases Conceptually, we identify several interesting use-cases for SOM-based visualization support in an email client: 144 Chapter 9. Content-based visual analysis of network traffic

• Classification. Using either automatic or manual methods, the SOM can be partitioned into regions representing different types of email, e.g., spam and non-spam email, busi- ness and private mail, and so on. For incoming email, the best matching region can then be identified and the mail can be classified as belonging to the label of that region.

• Retrieval. The user can search for email messages by mapping a query to the SOM node that best matches the query, followed by exploring the emails mapped to the neighbor- hood of that node using a technique like U-matrix or a histogram-based visualization to guide the search.

• Organization. The user can employ the SOM generated from his/her email archive to learn about the overall structure of the emails contained in the archive. The user might then create a directory hierarchy for organizing emails reflecting the SOM structure information.

9.2.2 Feature Extraction To obtain feature vectors from email data, we employ a well-known scheme from information retrieval. The n most frequent terms from the subject fields of all emails in the archive are determined after having filtered the irrelevant terms using a stop-word list in order to avoid inclusion of non-discriminating terms in the description. Then, the tf-idf document indexing model [6] is applied, considering each email to be a document titled by its subject field. The model assigns to each document and each of the terms a weight indicating the relevance of the term in the given document with respect to the document collection. By concatenating the term weights for a given document we obtain a feature vector (descriptor) for that document. The tf-idf vectors can be calculated by counting the frequencies of the terms. Usually, the term frequency count is normalized to prevent a bias towards longer documents, which naturally contain terms with higher frequencies due to the overall document length. Therefore, the following formula is used to calculate the term frequency tfi, j of term ti in email e j:

ni, j tfi, j = (9.1) maxk(nk, j)

where ni, j is the number of occurrences of the considered term in email e j and the denom- inator is the maximum occurrence of any single term in e j. Note that there exist variations of the normalized term frequency, for example, using the sum instead of the maximum function in the denominator. The inverse document frequency idfi measures the general importance of term ti with respect to the whole email collection E.

|E| idfi = log (9.2) |{e j : ti ∈ e j}|

Thereby, terms that rarely occur in the whole document collection get high idf values since those terms are characteristic for that collection, whereas terms that occur in almost every 9.2. Self-organizing maps for content-based retrieval 145 Toy example From: [email protected] Subject:From: Buy mail@provider a new car! .com Subject: flatrate! Why don’t you buy a new car today?Get …cheap. internet. ….

Term frequency counts tf-idf vectors e e … e1 e2 … |ej : ti ! ej| 1 2 buy 2 0 … 13 buy 0.89 0 … new 2 0 … 14 new 0.85 0 … car 2 0 … 4 car 1.40 0 … today 1 0 … 25 today 0.30 0 … internet 0 1 … 12 internet 0 1.15 … toy 0 0 … 11 toy 0 0 … max 2 1 Florian Mansmann 21st JulyFigure 2005, CEAS 9.1: tf-idf feature extraction on a collection of 100 emails

document only result in small idf values. The tf-idf value of term ti in in email e j is then calculated as the product of its term frequency and its inverse document frequency:

tf-idfi, j = tfi, j × idfi (9.3)

The tf-idf vector of a document is composed of the respective tf-idf values of all terms. For two sample emails in Figure 9.1, we first count the term frequencies of terms “buy”, “new”, “car”, “today”, “internet”, and “toy”. From these values and the sum of each term’s total 2 occurrences, we can calculate the normalized term frequencies, such as tfbuy,1 = 2 = 1 and 1 100 tfinternet,2 = 1 = 1. Next, the resulting inverse document frequencies are idfbuy = log 13 and 100 idfinternet = log 7 given the collection size of 100 emails. Therefore, the relevance of the term “buy” is 0.89 for e1 and that of term “internet” is 1.15 for e2. Note that the term “internet” is more important for the second email than the term “buy” for the first email since the former term only occurs in 7 emails whereas the latter appears in 13 emails. On the resulting vectors, several distance measures for calculating the similarity between documents can be used. Baeza-Yates and Ribeiro-Neto [6], for example, propose to use the cosinus function between a document vector d~ j = (w1, j,w2, j,...,wt, j) and a query vector ~q = (w1,q,w2,q,...,wt,q), where |d~ j| and |~q| are the norms of the document and query vector, respectively:

d~ j •~q sim(d j,q) = 0 ≤ sim(d j,q) ≤ 1 (9.4) |d~ j| × |~q| |t| ∑i=1 wi, j × wi,q = q q (9.5) |t| 2 |t| 2 ∑i=1 wi, j × ∑i=1 wi,q 146 Chapter 9. Content-based visual analysis of network traffic

The set of all feature vectors of the collection serves as input for the SOM generation pro- cess. We acknowledge that more sophisticated email descriptors can be thought of. Specifi- cally, in addition to body text, email data usually contains a wealth of meta data and attributes which are candidates for inclusion in the description. In this section, we chose to start with a basic feature extractor, leaving the design of more complex descriptors for future work.

9.2.3 SOM generation

Self-Organizing Maps can be seen as a nonlinear projection of feature vectors of any dimen- sionality onto a usually two-dimensionally arranged set of reference vectors. Assume a set of 16 randomly initialized reference vectors. During the learning phase, the best matching reference vector m~ i, also denoted the winner neuron (“winner”), is chosen and adapted to the input vector ~x. In the diagrams in Figure 9.2, the angles of the reference nodes represent the values of the reference vectors. The winner neuron was chosen according to the best match with the input vector ~x. In this example, the input vector is identical to the vector inside the dark gray circle in the second diagram. Note that the original reference vector can be found in

Figure 9.2: The learning phase of the SOM with the input vector adapting to its most similar reference vector (dark gray circle) and influencing its neighborhood (light gray). Six frames (top-down, left to right) show the process of transforming the unordered features (first frame) into the ordered features (last frame). 9.3. Case study: SOMs for email classification 147 the respectively previous frame. Only minor changes need to be made to the chosen reference vectors since the input vectors were already close to them. In addition to above vector adaptation, the neighborhood function on the topology deter- mines the adjacent map nodes and the degree of their adaptation to ~x. In Figure 9.2, this neighborhood function is represented through gray lines connecting each of the neurons with up to eight neighbors. Note that other more complicated neighborhood functions are possi- ble. The neurons along with their interconnections form a neuronal network. In the example, all reference vectors in light-gray circles were adjusted from their original position half-way towards ~x. This process is repeated several times with all input vectors and results in smooth and ordered looking m~ i values like the ones shown in the last frame of Figure 9.2. In general, this competitive learning algorithm can be analogously applied to high-dimensi- onal input and reference vectors resulting in an ordering of the high-dimensional input data. For the generation of the SOM in our experiments, we relied on the SOM PAK software and rules-of-thumb for good parameter choices found in the literature [91, 90].

9.3 Case study: SOMs for email classification

As a proof of concept for the proposed technique, we present the results of two experiments. We generated a SOM from an archive of 9 400 emails using 500 most frequent terms in the subject field for the tf-idf descriptor. All emails were labeled as belonging to either the spam or the non-spam class, as judged by a spam filter in combination with manual classification. By applying the competitive learning algorithm on 108 reference vectors using the 9 400 tf-idf vectors of the emails as input, we obtained a map layout for an ordering of the high- dimensional input data. Figure 9.3 shows a spam-histogram on the generated SOM. For each

Figure 9.3: Spam histogram of a sample email archive (shades of red indicate spam) 148 Chapter 9. Content-based visual analysis of network traffic

Figure 9.4: Component plane for term “work” (shades of yellow indicate high term weights) map node (gray), the coloring indicates the fraction of spam emails within all emails mapped to the respective node, with shades of red indicating high degrees of spam and shades of blue for low degrees of spam (the latter are the “good” email regions on the SOM). The coloring of the fields between the nodes is determined by interpolating the values of the adjacent nodes. Clearly, the SOM, which was learned from our basic tf-idf descriptor, is capable of discrimi- nating spam from non-spam emails to a certain degree. Dark red or dark blue nodes indicate good classification results, whereas light red, light blue, or white nodes contain both spam and non-spam emails. The image in Figure 9.4 illustrates the component plane for the tf-idf term “work”, with shades of yellow indicating high weight magnitude. Combining both images, we learn that this specific term occurs in emails both of type spam and non-spam. The rightmost “work” cluster belongs to the “good” region and compounds university-related emails from a PhD student in our working group. Interaction capabilities help the user to explore different regions on the SOM. A click on a gray node returns a list view of all emails that were assigned to that particular reference vector. For explanatory purposes, an additional list, containing the tf-idf values of the terms of the reference vector m~ i sorted in descending order, can be displayed.

9.4 Summary

In this chapter, we presented a method for extracting text-based tf-idf features from email messages in order to be able apply content-based visualization techniques. In particular, a self-organizing map (SOM) was created on top of the email feature vectors to enable visual exploration of large email collections based on the tf-idf similarity metric from the information 9.4. Summary 149 retrieval. While SOMs can be used for classification, retrieval, and organization, our case study only gave a proof of concept of how the map can be used for classification into spam and ordinary email messages. Since feature extraction and the SOM learning are expensive processes, we refrained from considering the proposed technique for classifying network traffic on a large scale. However, we see the potential of the technique in the scope of personal information management.

9.4.1 Future work Until now, we only used ordinary tf-idf feature vectors for the training process of the SOM. Since emails also contain a wealth of information in their headers, attributes such as the tra- versed email servers, their time zones, spam ratings, attachments, etc., could be used to im- prove the classification results on the SOM. However, weighting this meta data in an appro- priate way is still an open issue. Due to the fact that we only considered a static email archive, i.e., not subject to changing over time, we did not deal with re-training issues of the SOM based on incoming emails. For the actual usage of the technique in an email client, this becomes an important aspect since, for example, the user might be involved in new projects, which may contain characteristic terms that had not been considered in the original scheme of the selected terms. Likewise, without re-training efforts, some SOM nodes might contain large amounts of emails whereas other nodes are only sparsely populated.

10 Thesis conclusions

,,Do not say a little in many words but a great deal in a few.” Pythagoras Contents 10.1 Conclusions ...... 151 10.2 Outlook ...... 154

HE work in this dissertation dealt with designing methods for visual analysis of network Ttraffic to enable interactive monitoring, detection, and prevention of security threats. This thesis can therefore be seen as an effort to complement automatic monitoring and intrusion detection systems with visual exploration interfaces that empower human analysts to gain deeper insight into large, complex, and dynamically changing data sets.

10.1 Conclusions

The first three chapters motivated the need for visual analysis methods for network traffic, introduced basic concepts of networking as well as the underlying data modeling and mainte- nance, and discussed how information visualization and visual analytics methods can be used to solve current problems in network security: • Chapter 1 presented a general motivation of our work and put it into the context of current threats, which endanger the network infrastructure of the Internet. While so- phisticated intrusion detection systems generate enormous amounts of detailed alerts, handling these alerts without any visual support becomes tedious or even impossible for human analysts. • Chapter 2 introduced basic networking and intrusion detection concepts, which explain the network data sets used in the course of this dissertation, as well as data warehousing concepts at the back-end of the data management and retrieval process. When dealing with large data sets, retrieval performance becomes the major bottleneck and is therefore an important precondition to enable visual analysis for network security. • Chapter 3 put our work and other proposed visual analysis systems for network mon- itoring and security into the context of information visualization and visual analytics research. Using the data type by task taxonomy, we explained how the proposed meth- ods satisfy different criteria of the taxonomy and how visual analytics methods can be used to improve both visual and automatic analysis methods. 152 Chapter 10. Thesis conclusions

While the first three chapters did not describe novel research, they highlighted methods and concepts, which play an important role in understanding the research presented in the following chapters. Furthermore, database aspects are an integral part of the implemented systems and contribute enormously to their performance. Chapters 4 to 9, which build the core of this thesis, proposed a series of visualization tech- niques offering a comprehensive and highly interactive view on network traffic and network security events.

Visual analysis of temporal aspects (Chapter 4) To overcome scalability constraints of common visual- ization techniques for time series data, we extended the recursive pattern technique as follows: 1) empty fields 10 min, 6 hours were inserted to deal with irregular time intervals, 2) 1 min, 1 hour spaces between high-level time units made the visual- Enhanced Recursive Pattern ization more readable, 3) to compare several time series at once, three coloring modes and three layout configuration modes for detailed analysis were proposed. These innovations made the technique especially suitable for analyzing a multitude of data-intensive time series occurring in network monitoring.

Hierarchical analysis of network traffic in the IP address space (Chapter 5)

61.1 20 2 2 2 61.1 20 2 2 2 2 218. 58.32.0 6 6161 2 2 59.64 202. 1CNIX FHNE 2 218. 58.32.0 6 6161 2 2 59.64 202. 1CNIX FHNE 21 21 218 59. 61. 20 219 21 21 218 59. 61. 20 219 61.1 59.5 60. 2 2 2 61.1 59.5 60. 2 2 2 21 218 6161 202. 211. 219. 21 218 6161 202. 211. 219. 21 218.8 2 2 20 21 21 218.8 2 2 20 21 61.1 59.3 59 60.1 61 61 2020 CH CH JING2 61.1 59.3 59 60.1 61 61 2020 CH CH JING2 60 2 2 ERX-CERNET-BKB218. 219. 21 211 60 2 2 ERX-CERNET-BKB218. 219. 21 211 CHINANET-SH-AP218.8 Chi 60.1 60.1 6 61 210. CHINANET-SH-AP218.8 Chi 60.1 60.1 6 61 210. 61.1 22 22 222 2 2 2 20 61.1 22 22 222 2 2 2 20 Hierarchical Network Map (HNMap) 221.2 59 61 61 21 221.2 59 61 61 21 In Chapter 5, the 61.1 210. 222.16 61.1 210. 222.16 5 59. 2 211. 211. 5 59. 2 211. 211. 60 60.1 2121 60 60.1 2121 61.1 222.6 59 6161 61 6 61 61 218. 61.1 222.6 59 6161 61 6 61 61 218. 22 22 222 218. 211. 21 222.19 22 22 222 218. 211. 21 222.19 222.6 218. 218. CMNET-GD G 222.6 218. 218. CMNET-GD G 202 6 61 61 6 61 61.1 61.1 61 61 202 6 61 61 6 61 61.1 61.1 61 61 218 22 218 22 10^3 5 5 58 60 60 60. 6161 61 6161 2121 218. 5 5 58 60 60 60. 6161 61 6161 2121 218. 61 6 6161 61 61 218 2 61 6 6161 61 61 218 2 60 61. 61. 616 61 218. 218 61.232. 61.235. 60 61. 61. 616 61 218. 218 61.232. 61.235. 60. 60. 60. 60. 60. 60. 60 2 218CHINANET-BACKBONE No.31,Jin-rong222. Street22 222. 60 2 218CHINANET-BACKBONE No.31,Jin-rong222. Street22 222. 60 21921 220 22 21 60 21921 220 22 21 218 219 22 61. 61. 61. 61. 210 2 210 218 219 22 61. 61. 61. 61. 210 2 210 6 60.2 60 6 61 2 218 219. 21 22 22 22 6 60.2 60 6 61 2 218 219. 21 22 22 22 218 2 219 22 220 22 21 218 2 219 22 220 22 21 6 6 6 6 CNCNET-CN2 21 21 6 6 6 6 CNCNET-CN2 21 21 as a space-filling hierarchical mapping method for vi- 6161 2 2 6161 2 2 2 21 21 218 219 2 222 222. 221.172.0.0/ 2 21 21 218 219 2 222 222. 221.172.0.0/ 60.2 61 21 212 220China 220 222 CRNET CHINA RA 21 2182 2 60.2 61 21 212 220China 220 222 CRNET CHINA RA 21 2182 2 616161 6 219 2 22 616161 6 219 2 22 61 22 21 21 2 2 222 2 2 61 22 21 21 2 2 222 2 2 21 2 21 GWBN-CHENG2 2 21 2 21 GWBN-CHENG2 2 2CHINA169-BACKBONE21 2 21 21 222 222 222 2 2CHINA169-BACKBONE21 2 21 21 222 222 222 2 219 21 21 218 2 22 22222222 22 2 219 21 21 218 2 22 22222222 22 2 21 218 221 222.32.0.0/1 21 218 221 222.32.0.0/1 2 2 2 219. 220. CHINATE222.223 CH21 2 2 2 219. 220. CHINATE222.223 CH21 21 218 21212121 222. 222.208.0. 21 218 21212121 222. 222.208.0. 221 22 221 221 22 221 2 2 2222 221.2 218. 2222 222. 222 2 2 2222 221.2 218. 2222 222. 222 2 218. 220 2 218. 220 221 2 22 222. 2 220 2 221 2 22 222. 2 220 2 22 2 2 22 22 218. 221.2 222. 222 22 2 2 22 22 218. 221.2 222. 222 2 2 21 2 220. 222 6 2 CHINATELEC222.89.252 2 2 21 2 220. 222 6 2 CHINATELEC222.89.252 2 2 2 2 2 2 sualizing aggregated traffic of IP hosts was proposed. 61 2 20 20 2 6 6 2 220 61 2 20 20 2 6 6 2 220 22 2 222. 161CHINA169-BJ1 2 CNCGROUP20 IP network2 China1692 222.12 22 2 222. 161CHINA169-BJ1 2 CNCGROUP20 IP network2 China1692 222.12 2 22 2 2 61 20 20 2 2 2 2 2 UNICOM CHINA U 2 2 2 22 2 2 61 20 20 2 2 2 2 2 UNICOM CHINA U 2 2 22 22 22 2 CHINA 22 2 CHINA 2 221 221 2 CITIC21 2 221 221 2 CITIC21 60CNNIC-CN-COLN218 21 22 22 61.CHINA169-BBN61. 61 61. 61.1CNCGROUP221.216.0 IP 60CNNIC-CN-COLN218 21 22 22 61.CHINA169-BBN61. 61 61. 61.1CNCGROUP221.216.0 IP 21 22 22 2 2 21 22 22 2 2 ET CNCGRO2 2 2 2 CNCGRO21 2 2 CN TOPWA22 ET CNCGRO2 2 2 2 CNCGRO21 2 2 CN TOPWA22 10^2 2 2 2 2 2 211.BJGY211 srit 21corp21 CAPNE2 2 2 2 2 211.BJGY211 srit 21corp21 CAPNE2 2 2 2 218. 2 60.1 2 2 2 219 2 218. 2 60.1 2 2 2 219 2 5 5 221. 5 5 221. 58. CHIN BEELINK- CHINA DXTNET Beijing21 Dian CNNIC-WAS58. 218 21 DQT BJ- CNNIC211.1 211TJB 58. CHIN BEELINK- CHINA DXTNET Beijing21 Dian CNNIC-WAS58. 218 21 DQT BJ- CNNIC211.1 211TJB JNGD6 219. 2 2 221. 20 2 218 219 219 2 2 22 JNGD6 219. 2 2 221. 20 2 218 219 219 2 2 22 The approach aimed at finding a compromise between 2005-11-23 00:00:00 ... 2005-11-23 02:14:59 2005-11-23 02:15:00 ... 2005-11-23 04:29:59 61.1 20 2 2 2 61.1 20 2 2 2 2 218. 58.32.0 6 6161 2 2 59.64 202. 1CNIX FHNE 2 218. 58.32.0 6 6161 2 2 59.64 202. 1CNIX FHNE 21 21 218 59. 61. 20 219 21 21 218 59. 61. 20 219 61.1 59.5 60. 2 2 2 61.1 59.5 60. 2 2 2 21 218 6161 202. 211. 219. 21 218 6161 202. 211. 219. 21 218.8 2 2 20 21 21 218.8 2 2 20 21 61.1 59.3 59 60.1 61 61 2020 CH CH JING2 61.1 59.3 59 60.1 61 61 2020 CH CH JING2 60 2 2 ERX-CERNET-BKB218. 219. 21 211 60 2 2 ERX-CERNET-BKB218. 219. 21 211 CHINANET-SH-AP218.8 Chi 60.1 60.1 6 61 210. CHINANET-SH-AP218.8 Chi 60.1 60.1 6 61 210. 22 22 222 2 2 2 20 22 22 222 2 2 2 20 61.1 221.2 59 61 61 21 61.1 221.2 59 61 61 21 5 59. 61.1 210. 222.16 2 211. 211. 5 59. 61.1 210. 222.16 2 211. 211. 60 60.1 2121 60 60.1 2121 61.1 222.6 59 6161 61 6 61 61 218. 61.1 222.6 59 6161 61 6 61 61 218. 22 22 222 218. 211. 21 222.19 22 22 222 218. 211. 21 222.19 222.6 218. 218. CMNET-GD G 222.6 218. 218. CMNET-GD G 202 6 61 61 6 61 61.1 61.1 61 61 202 6 61 61 6 61 61.1 61.1 61 61 such optimization criteria as using display area to rep- 5 5 58 60 60 2121 218 22 5 5 58 60 60 2121 218 22 60. 6161 61 6161 218. 60. 6161 61 6161 218. 61 6 6161 61 61 218 2 61 6 6161 61 61 218 2 60 61. 61. 616 61 218. 218 61.232. 61.235. 60 61. 61. 616 61 218. 218 61.232. 61.235. 60. 60. 60. 60. 60. 60. 60 2 218CHINANET-BACKBONE No.31,Jin-rong222. Street22 222. 60 2 218CHINANET-BACKBONE No.31,Jin-rong222. Street22 222. 60 21921 220 22 21 60 21921 220 22 21 218 219 22 61. 61. 61. 61. 210 2 210 218 219 22 61. 61. 61. 61. 210 2 210 6 60.2 60 6 61 2 218 219. 21 22 22 22 6 60.2 60 6 61 2 218 219. 21 22 22 22 10^1 6 6 218 2 219 22 220 22 21CNCNET-CN2 21 21 6 6 218 2 219 22 220 22 21CNCNET-CN2 21 21 616 6 616 6 61 2 2 2 21 21 218 219 2 222 222. 221.172.0.0/ 61 2 2 2 21 21 218 219 2 222 222. 221.172.0.0/ 60.2 61 21 212 220China 220 222 CRNET CHINA RA 21 2182 2 60.2 61 21 212 220China 220 222 CRNET CHINA RA 21 2182 2 616161 6 219 2 22 616161 6 219 2 22 61 22 21 21 2 2 222 2 2 61 22 21 21 2 2 222 2 2 21 2 21 GWBN-CHENG2 2 21 2 21 GWBN-CHENG2 2 2CHINA169-BACKBONE21 2 21 21 222 222 222 2 2CHINA169-BACKBONE21 2 21 21 222 222 222 2 resent network size, full utilization of the screen space, 219 21 21 218 2 22 22222222 22 2 219 21 21 218 2 22 22222222 22 2 21 218 221 222.32.0.0/1 21 218 221 222.32.0.0/1 2 2 2 219. 220. CHINATE222.223 CH21 2 2 2 219. 220. CHINATE222.223 CH21 21 218 21212121 222. 222.208.0. 21 218 21212121 222. 222.208.0. 221 22 221 221 22 221 2 2 2222 221.2 218. 2222 222. 222 2 2 2222 221.2 218. 2222 222. 222 2 218. 220 2 218. 220 221 2 22 222. 2 220 2 221 2 22 222. 2 220 2 22 2 2 22 22 218. 221.2 222. 222 22 2 2 22 22 218. 221.2 222. 222 2 2 21 2 220. 222 6 2 CHINATELEC222.89.252 2 2 21 2 220. 222 6 2 CHINATELEC222.89.252 2 2 2 2 2 2 61 2 20 20 2 6 6 2 220 61 2 20 20 2 6 6 2 220 22 2 222. 161CHINA169-BJ1 2 CNCGROUP20 IP network2 China1692 222.12 22 2 222. 161CHINA169-BJ1 2 CNCGROUP20 IP network2 China1692 222.12 2 22 2 2 61 20 20 2 2 2 2 2 UNICOM CHINA U 2 2 2 22 2 2 61 20 20 2 2 2 2 2 UNICOM CHINA U 2 2 22 22 22 2 CHINA 22 2 CHINA 2 221 221 2 CITIC21 2 221 221 2 CITIC21 60CNNIC-CN-COLN218 21 22 22 61.CHINA169-BBN61. 61 61. 61.1CNCGROUP221.216.0 IP 60CNNIC-CN-COLN218 21 22 22 61.CHINA169-BBN61. 61 61. 61.1CNCGROUP221.216.0 IP position awareness, and rectangle aspect ratio. In the 21 22 22 2 2 21 22 22 2 2 ET CNCGRO2 2 2 2 CNCGRO21 2 2 CN TOPWA22 ET CNCGRO2 2 2 2 CNCGRO21 2 2 CN TOPWA22 2 2 2 2 2 211.BJGY211 srit 21corp21 CAPNE2 2 2 2 2 211.BJGY211 srit 21corp21 CAPNE2 2 2 2 218. 2 60.1 2 2 2 219 2 218. 2 60.1 2 2 2 219 2 5 5 221. 5 5 221. 58. CHIN BEELINK- CHINA DXTNET Beijing21 Dian CNNIC-WAS58. 218 21 DQT BJ- CNNIC211.1 211TJB 58. CHIN BEELINK- CHINA DXTNET Beijing21 Dian CNNIC-WAS58. 218 21 DQT BJ- CNNIC211.1 211TJB JNGD6 219. 2 2 221. 20 2 218 219 219 2 2 22 JNGD6 219. 2 2 221. 20 2 218 219 219 2 2 22 experiments, two layout variants and their combination 2005-11-23 04:30:00 ... 2005-11-23 06:44:59 2005-11-23 06:45:00 ... 2005-11-23 09:00:00 at different levels of the hierarchy were evaluated with HNMap respect to visibility, average rectangle aspect ratio, and layout preservation. The applicability of the tool for resource planning, network monitoring, and network security was demonstrated with three case studies.

Visualization of traffic links on top of HNMap (Chapter 6) In Chapter 6, hierarchical edge bundles were com- bined with HNMap to visualize source-destination re- lationships in network traffic. We used splines rather than straight connecting links to avoid visual clutter by grouping links according to their joint ancestors in the IP/AS hierarchy. Mouse interactions provide deeper in- sights into network traffic communication patterns. Edge bundles on HNMap 10.1. Conclusions 153

Multivariate visual analysis of network traffic (Chapter 7)

In Chapter 7, a radial visualization technique was adapted to the needs of network network security an- alysts. The presented Radial Traffic Analyzer (RTA) enables visual traffic monitoring, is capable of relating communication partners, and indicates the used applica- tion ports. The visual layout groups network traffic data according to a set of attributes in a hierarchical fashion: inner rings represent the high-level aggregates while the outer rings display more detailed information. The tech- RTA nique was complemented by interaction options such as hints on mouse-over events, drag-&-drop to manipulate the order of the rings, filtering with mouse clicks, and prompting for more details accessible through a popup menu. The tool’s potential for temporal analysis was demonstrated by creating smooth animations of the RTA display.

Visualization of host behavior using a force-directed graph layout (Chapter 8)

In the scope of Chapter 8, we discussed a novel network traffic visualization metaphor for monitoring the behav- ior of hosts. A force-directed graph layout is used to place host observation nodes with proportionally similar traffic or alerts close to each other in a two-dimensional node-link diagram. Through a visual analytics feature, nodes with highly variant traffic in the high-dimensional space can be automatically highlighted. Furthermore, the tool comprises interaction features such as capabil- Behavior graph ities to interactively mark suspicious nodes or to fine- tune the graph layout. Two case studies were presented to demonstrate the tool’s applicability for monitoring host behavior.

Content-based analysis of network traffic using SOMs (Chapter 9)

While the analysis of the previous chapters focused on protocol-based meta data, such as the IP addresses, ap- plication ports, etc., Chapter 9 analysed the actual pay- load of network communication. In particular, tf-idf fea- ture vectors were extracted from a sample email archive and ordered through the SOM technique. The exper- iments showed that the map, which was trained with low-level features, was capable of separating spam from Self-Organizing Map ordinary emails. Furthermore, SOMs can also be used for similarity-based retrieval and organization. 154 Chapter 10. Thesis conclusions

In conclusion, this dissertation gave an extensive overview of the findings in the field of visualization for network monitoring and security. Through the implementation of several prototypes and their invocation on real-world data, it was demonstrated that visualization tech- niques can significantly contribute to solving challenging analysis tasks, such as discovery of previously unknown patterns in large-scale data sets. Since the scale of the problems and the uncertainty of automatic analysis methods often result in the inability of those methods to reveal enough insight to the analysts, we are confident that more sophisticated visualiza- tion systems will be used for network monitoring and security in the future and hope that the methods presented in this thesis will inspire software engineers responsible for designing such applications.

10.2 Outlook

While we discussed the concrete possible improvements for each of the visualization tech- niques in the respective chapters, a more general view on future developments in the field will be given in this section. Due to the increased bandwidth and speed, analysis of network data will become increasingly challenging in the future. In addition to that, the shift from IPv4 (232 addresses) to IPv6 (2128 addresses) will have substantial consequences not only for routing, but also for security issues. Scanning, for example, would make little sense if a single end-user is to be found among trillions of addresses. In the end, these trends will impose huge demands upon the scalability of the utilized rout- ing, capturing, storage, data mining, and visualization methods, in particular for real-time analysis. Despite the fact that purely automatic or purely visual methods could be success- fully applied in the past, some problems will only be manageable through smart visual analyt- ics methods, which combine human intelligence and background knowledge with the speed and accuracy of computers. We therefore think that in the future these challenges will lead to a paradigm shift from automatic monitoring and intrusion detection systems towards interactive visual analytics applications. Bibliography

[1] Kulsoom Abdullah, Chris Lee, Gregory Conti, John A. Copeland, and John Stasko. Ids rainstorm: Visualizing ids alerts. In Proceedings of the IEEE Workshop on Visualization for Computer Security (VizSEC), Minneapolis, U.S.A., October 2005.

[2] Christopher Ahlberg and Ben Shneiderman. Visual information seeking: tight coupling of dynamic query filters with starfield displays. In CHI ’94: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 313–317, New York, NY, USA, 1994. ACM Press.

[3] Natalia Andrienko and Gennady Andrienko. Exploratory Analysis of Spatial and Tem- poral Data - A Systematic Approach. Springer, 2006.

[4] Mihael Ankerst, Daniel A. Keim, and Hans-Peter Kriegel. Recursive pattern: A tech- nique for visualizing very large amounts of data. In Proceedings of Visualization ’95, Atlanta, GA, pages 279–286, 1995.

[5] Mihael Ankerst, Daniel A. Keim, and Hans-Peter Kriegel. Circle segments: A tech- nique for visually exploring large multidimensional data sets. In Visualization ’96, Hot Topic Session, San Francisco, CA, 1996.

[6] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.

[7] Tony Bates, Philip Smith, and Geoff Huston. CIDR Report, September 2006. http: //bgp.potaroo.net/cidr/ as retrieved on 30/12/2006.

[8] Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis G. Tollis. Graph Drawing: Algorithms for the Visualization of Graphs. Prentice-Hall, 1999.

[9] Richard A. Becker, Stephen G. Eick, and Allan R. Wilks. Visualizing network data. IEEE Transactions on Visualization and Computer Graphics, 1(1):16–21, March 1995.

[10] Benjamin B. Bederson, Aaron Clamage, Mary P. Czerwinski, and George G. Robertson. Datelens: A fisheye calendar interface for pdas. ACM Trans. Comput.-Hum. Interact., 11(1):90–119, 2004.

[11] Benjamin B. Bederson, Ben Shneiderman, and Martin Wattenberg. Ordered and quan- tum treemaps: Making effective use of 2d space to display hierarchies. ACM Trans. Graph., 21(4):833–854, 2002. 156 Bibliography

[12] Jacques Bertin. Semiologie graphique – les diagrammes, les rseaux, les cartes. Mouton, 1967.

[13] Jacques Bertin. Graphics and Information-Processing. Walter de Gruyter, Berlin, 1981.

[14] Jacques Bertin. Semiology of graphics. University of Wisconsin Press, 1983.

[15] Matthew A. Bishop. Computer Security: Art and Science. Addison-Wesley, 2003.

[16] Mark Bruls, Kees Huizing, and Jarke J. Van Wijk. Squarified treemaps. In Proceedings of the Joint Eurographics and IEEE TCVG Symposium on Visualization, 2000.

[17] Business Objects SA. BusinessObjects OLAP Intelligence, 2005. http:// www.businessobjects.com/products/queryanalysis/olapi.asp as retrieved on 31/12/2005.

[18] Stuart K. Card, Jock D. Mackinlay, and Ben Shneiderman, editors. Readings in infor- mation visualization: using vision to think. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999.

[19] Suart K. Card and Jock D. Mackinlay. The structure of the information visualization design space. In Proceedings of the IEEE Symposium on Information Visualization, page 92, 1997.

[20] Patrick Charles. JPcap, 2007. http://jpcap.sourceforge.net/ as retrieved on 06/03/2008.

[21] Surajit Chaudhuri and Umeshwar Dayal. An overview of data warehousing and olap technology. ACM SIGMOD Record, 26(1):65–74, 1997.

[22] Surajit Chaudhuri, Umeshwar Dayal, and Venkatesh Ganti. Database technology for decision support systems. IEEE Computer, 34(12):48–55, 2001.

[23] Herman Chernoff. The Use of Faces to Represent Points in K-Dimensional Space Graphically. Journal of the American Statistical Association, 68(342):361–368, 1973.

[24] Bill Cheswick, H. Burch, and S. Branigan. Mapping and visualizing the internet. In Proceedings of the USENIX Annual Techincal Conference, 2000.

[25] Mei C. Chuah. Dynamic aggregation with circular visual designs. In 1998 IEEE Sym- posium on Information Visualization, pages 35–43, 1998.

[26] Mei C. Chuah and Stephen G. Eick. Information rich glyphs for software management data. IEEE Comput. Graph. Appl., 18(4):24–29, 1998.

[27] Mei C. Chuah and Steven F. Roth. On the semantics of interactive visualizations. In Proceedings of the IEEE Symposium on Information Visualization, pages 29–36, Los Alamitos, CA, USA, 1996. IEEE Computer Society. Bibliography 157

[28] K.C. Claffy. Caida: Visualizing the internet. IEEE Internet Computing, 05(1), 2001.

[29] CNS International. DataWarehouse Explorer, 2005. http://www.dwexplorer.com/ products/producttour as retrieved on 23/04/2008.

[30] E.F. Codd, S.B. Codd, and C.T. Salley. Providing OLAP (on-line analytical processing) to user-analysts: An IT mandate. White paper, E.F. Codd & Associates, 1993.

[31] Cognos Software Corporation. Cognos PowerPlay: Overview –OLAP Software, 2005. http://www.cognos.com/powerplay as retrieved on 23/04/2008.

[32] L. Colitti, G. Di Battista, F. Mariani, M. Patrignani, and M. Pizzonia. Visualizing Interdomain Routing with BGPlay. Journal of Graph Algorithms and Applications, 9(1):117–148, 2005.

[33] Greg Conti. Security Data Visualization: Graphical Techniques for Network Analysis. No Starch Press, 2007.

[34] Gregory Conti, Kulsoom Abdullah, Julian Grizzard, John Stasko, John A. Copeland, Mustaque Ahamad, Henry L. Owen, and Chris Lee. Countering security information overload through alert and packet visualization. IEEE Computer Graphics and Appli- cations, 26(2):60–70, 2006.

[35] Pier Francesco Cortese, Giuseppe Di Battista, Antonello Moneta, Maurizio Patrignani, and Maurizio Pizzonia. Topographic visualization of prefix propagation in the internet. IEEE Transactions on Visualization and Computer Graphics, 12(5):725–732, 2006.

[36] Donna Cox and Robert Patterson. Nsfnet visualization, 1995. National Cen- ter for Supercomputing, http://virdir.ncsa.uiuc.edu/virdir/raw-material/ networking/nsfnet/NSFNET_1.htm as retrieved on 03/03/2008.

[37] Anita D. D’Amico, John R. Goodall, Daniel R. Tesone, and Jason K. Kopylec. Visual discovery in computer network defense. IEEE Computer Graphics and Applications, 27(5):20–27, 2007.

[38] Ron Davidson and David Harel. Drawing graphs nicely using simulated annealing. ACM Trans. Graph., 15(4):301–331, 1996.

[39] Loris Degioanni, Gianluca Varenni, Fulvio Risso, and John Bruno. WinPcap, 2008. http://www.winpcap.org as retrieved on 06/03/2008.

[40] Martin Dodge and Rob Kitchin. Atlas of Cyberspace. Addison-Wesley, 2001.

[41] Nicolas Ducheneaut and Victoria Bellotti. E-mail as habitat: an exploration of embed- ded personal information management. interactions, 8(5):30–38, 2001.

[42] Peter Eades. A heuristic for graph drawing. In Congressus Numerantium, volume 42, pages 149–160, 1984. 158 Bibliography

[43] Stephen G. Eick. Visualizing multi-dimensional data. SIGGRAPH Comput. Graph., 34(1):61–67, 2000.

[44] Stephen G. Eick. The Visualization Handbook, chapter Scalable Network Visualization, pages 819–829. Elsevier, 2005.

[45] Steven G. Eick. Visual scalability. Journal of Computational & Graphical Statistics, March 2002.

[46] J. Ellson, E.R. Gansner, E. Koutsofios, S.C. North, and G. Woodhull. Graph Drawing Software, chapter Graphviz and dynagraph–static and dynamic graph drawing tools, pages 127–148. Springer, 2003.

[47] Robert F. Erbacher, Kim Christensen, and Amanda Sundberg. Designing visualization capabilities for ids challenges. In Proceedings IEEE Workshop on Visualization for Computer Security (VizSEC), pages 121–127, 2005.

[48] Robert F. Erbacher, Kenneth L. Walker, and Deborah A. Frincke. Intrusion and misuse detection in large-scale systems. IEEE Computer Graphics and Applications, 22(1):38– 48, 2002.

[49] Jean-Daniel Fekete and Catherine Plaisant. Interactive information visualization of a million items. In Proceedings of the IEEE Symposium on Information Visualization, Los Alamitos, CA, USA, 2002. IEEE Computer Society.

[50] Glenn A. Fink, Paul Muessig, and Chris North. Visual correlation of host processes and network traffic. In Proceedings IEEE Workshop on Visualization for Computer Security (VizSEC), pages 11–19, 2005.

[51] Glenn A. Fink and Chris North. Root polar layout of internet address data for security administration. In Proceedings of the IEEE Workshop on Visualization for Computer Security (VizSEC), Minneapolis, U.S.A., October 2005.

[52] Stefano Foresti, James Agutter, Yarden Livnat, and Shaun Moon. Visual correlation of network alerts. IEEE Computer Graphics and Applications, 26(2):48–59, March/April 2006.

[53] Thomas M. J. Fruchterman and Edward M. Reingold. Graph drawing by force-directed placement. Software - Practice and Experience, 21(11):1129–1164, 1991.

[54] V. Fuller, T. Li, J. Yu, and K. Varadhan. RFC 1519 Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy, September 1993. http: //tools.ietf.org/html/rfc1519 as retrieved on 23/04/2008.

[55] Mark Fullmer and Steve Romig. The OSU Flow-tools Package and Cisco NetFlow Logs. In Proceedings of the 14th Conference on Large Installation System Administra- tion (LISA), 2000. Bibliography 159

[56] George W. Furnas. Generalized fisheye views. In CHI ’86: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 16–23, New York, NY, USA, 1986. ACM Press.

[57] Emden Gansner, Yehuda Koren, and Stephen North. Topological fisheye views for visualizing large graphs. IEEE Transactions on Visualization and Computer Graphics, 11(4), 2005.

[58] Luc Girardin and Dominique Brodbeck. A Visual Approach for Monitoring Logs. In Proceedings of the Twelfth Systems Administration Conference (LISA ’98), pages 299– 308, 1998.

[59] John R. Goodall, Wayne G. Lutters, Penny Rheingans, and Anita Komlodi. Focusing on context in network traffic analysis. IEEE Computer Graphics and Applications, 26(2):72–80, 2006.

[60] Alexander Gostev. Malware evolution: January - march 2007. Technical report, Kasper- sky Lab, 2007. http://www.viruslist.com/en/analysis?pubid=204791938 as retrieved on 06/03/2008.

[61] Dennis P. Groth and Kristy Streefkerk. Provenance and annotation for visual ex- ploration systems. IEEE Transactions on Visualization and Computer Graphics, 12(6):1500–1510, 2006.

[62] Ming C. Hao, Umeshwar Dayal, Daniel A. Keim, Dominik Morent, and Jorn¨ Schnei- dewind. Intelligent visual analytics queries. In IEEE Symposium on Visual Analytics Science and Technology, pages 91–98, 2007.

[63] Ming C. Hao, Umeshwar Dayal, Daniel A. Keim, and Tobias Schreck. Importance driven visualization layouts for large time-series data. In Proceedings of the IEEE Symposium on Information Visualization. IEEE Computer Society, 2005.

[64] Mark A. Harrower and Cynthia A. Brewer. Colorbrewer.org: An online tool for select- ing color schemes for maps. The Cartographic Journal, 40(1):27–37, 2003.

[65] J. Hawkinson and T. Bates. RFC 1930 Guidelines for creation, selection, and registra- tion of an Autonomous System (AS), March 1996. http://tools.ietf.org/html/ rfc1930 as retrieved on 23/04/2008.

[66] John Heidemann, Yuri Pradkin, Ramesh Govindan, Christos Papadopoulos and- Genevive Bartlett, and Joseph Bannister. Census and survey of the visible internet (extended), February 2008. Technical Report ISI-TR-2008-649.

[67] Roland Heilmann, Daniel A. Keim, Christian Panse, and Mike Sips. RecMap: Rect- angular Map Approximations. In Proceedings of the IEEE Symposium on Information Visualization, pages 33–40, October 2004. 160 Bibliography

[68] Harry Hochheiser and Ben Shneiderman. Dynamic query tools for time series data sets: Timebox widgets for interactive exploration. Information Visualization, 3(1):1– 18, 2004. [69] Danny Holten. Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data. IEEE Transactions on Visualization and Computer Graphics, 12(5):741–748, 2006. [70] William H. Inmon. Building the Data Warehouse. John Wiley & Sons, Inc., New York, NY, USA, 3 edition, 2002. [71] insecure.org. NMap – Free Security Scanner For Network Exploration & Security Au- dits, 2008. http://nmap.org/ as retrieved on 06/03/20008. [72] insecure.org. Top 11 packet sniffers, 2008. http://sectools.org/sniffers.html as retrieved on 06/03/2008. [73] Alfred Inselberg and Bernard Dimsdale. Parallel coordinates: a tool for visualizing multi-dimensional geometry. In VIS ’90: Proceedings of the 1st conference on Visu- alization ’90, pages 361–378, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. [74] Internet Assigned Numbers Authority. TCP and UDP port numbers, 2008. http: //www.iana.org/assignments/port-numbers as retrieved on 06/01/2008. [75] Ipv4 address report, May 2007. http://www.potaroo.net/tools/ipv4/ as retrieved on 01/05/2007. [76] Takayuki Itoh, Hiroki Takakura, Atsushi Sawada, and Koji Koyamada. Hierarchical visualization of network intrusion detection data. IEEE Computer Graphics and Appli- cations, 26(02):40–47, 2006. [77] Brian Johnson and Ben Shneiderman. Tree maps: A space-filling approach to the visu- alization of hierarchical information structures. In Proceedings of IEEE Visualization, pages 284–291, 1991. [78] Tomio Kamada and Satoru Kawai. An algorithm for drawing general undirected graphs. Inf. Process. Lett., 31(1):7–15, 1989. [79] Daniel A. Keim. Visual Support for Query Specification and Data Mining. Shaker Verlag, 1995. [80] Daniel A. Keim. Pixel-oriented visualization techniques for exploring very large databases. Journal of Computational and Graphical Statistics, 5(1):58–77, March 1996. [81] Daniel A. Keim, Ming C. Hao, Umeshwar Dayal, and Meichun Hsu. Pixel bar charts: A visualization technique for very large multi-attribute data sets. Information Visual- ization Journal, 1(2), 2002. Bibliography 161

[82] Daniel A. Keim, Florian Mansmann, Christian Panse, Jorn¨ Schneidewind, and Mike Sips. Mail explorer - spatial and temporal exploration of electronic mail. In Proceedings of the Eurographics/IEEE-VGTC Symposium on Visualization (EuroVis 2005), Leeds, United Kingdom June 1st-3rd, 2005.

[83] Daniel A. Keim, Stephen C. North, Christian Panse, Matthias Schafer,¨ and Mike Sips. HistoScale: An efficient approach for computing pseudo-cartograms. In IEEE Confer- ence on Visualization, pages 28–29, October 2003.

[84] Daniel A. Keim, Stephen C. North, Christian Panse, and Mike Sips. Pixelmaps: A new visual data mining approach for analyzing large spatial data sets. In ICDM 2003, The Third IEEE International Conference on Data Mining, 19-22 November 2003, Mel- bourne, Florida, USA. IEEE Computer Society, November 2003.

[85] Daniel A. Keim, Jorn¨ Schneidewind, and Mike Sips. FP-Viz: Visual Frequent Pattern Mining. In IEEE Symposium on Information Visualization, 2005. Poster Paper.

[86] Daniel A. Keim and Matthew O. Ward. Intelligent Data Analysis, chapter Visual Data Mining Techniques, pages 403–427. Springer, 2 edition, 2002.

[87] Peter R. Keller and Marry M. Keller. Visual Cues - Practical Data Visualization. IEEE Press, 1993.

[88] Bernard Kerr and Eric Wilcox. Designing remail: reinventing the email client through innovation and integration. Conference on Human Factors in Computing Systems, pages 837–852, 2004.

[89] Ernst Kleiberg, Huub van de Wetering, and Jarke J. Van Wijk. Botanical visualization of huge hierarchies. In Proceedings of the IEEE Symposium on Information Visualization, page 87, Washington, DC, USA, 2001. IEEE Computer Society.

[90] Teuvo Kohonen. Self-Organizing Maps. Springer, Berlin, 3rd edition, 2001.

[91] Teuvo Kohonen, Jussi Hynninen, Jari Kangas, and Jorma Laaksonen. Som pak: The self-organizing map program package. Technical Report A31, Helsinki University of Technology, Laboratory of Computer and Information Science, 1996.

[92] Hideki Koike and Kazuhiro Ohno. Snortview: visualization system of snort logs. In VizSEC/DMSEC, pages 143–147, 2004.

[93] Hideki Koike, Kazuhiro Ohno, and Kanba Koizumi. Visualizing cyber attacks using ip matrix. In Visualization for Computer Security, pages 91–98, Los Alamitos, CA, USA, 2005. IEEE Computer Society.

[94] Anita Komlodi, Penny Rheingans, Utkarsha Ayachit, and John R. Goodall. A user- centered look at glyph-based security visualization. In Proceedings of the IEEE Work- shop on Visualization for Computer Security (VizSEC), pages 21–28, October 2005. 162 Bibliography

[95] Jim Kurose and Keith Ross. Computer Networking: A top-down approach featuring the Internet. Pearson Education, Inc., 2005.

[96] Mohit Lad, Dan Massey, and Lixia Zhang. Visualizing internet routing changes. IEEE Transactions on Visualization and Computer Graphics, 12(6):1450–1460, 2006.

[97] Kiran Lakkaraju, Ratna Bearavolu, Adam Slagell, William Yurcik, and Stephen North. Closing-the-loop in nvisionip: Integrating discovery and search in security visualiza- tions. In Proceedings of the IEEE Workshop on Visualization for Computer Security (VizSEC), October 2005.

[98] Stephen E. Lamm, Daniel A. Reed, and Will H. Scullin. Real-time geographic visual- ization of world wide web traffic. In Proceedings of the fifth international World Wide Web conference on Computer networks and ISDN systems, pages 1457–1468, Amster- dam, The Netherlands, The Netherlands, 1996. Elsevier Science Publishers B. V.

[99] Stephen Lau. The spinning cube of potential doom. Communications of the ACM, 47(6), 2004.

[100] Lawrence Berkeley National Laboratory. tcpdump and libpcap, 2008. http://www. tcpdump.org/ as retrieved on 06/03/2008.

[101] Jeffrey LeBlanc, Matthew O. Ward, and Norman Wittels. Exploring n-dimensional databases. In Proceedings of the First IEEE Conference on Visualization, pages 230– 237, 23-26 Oct 1990.

[102] Christopher P. Lee, Jason Trost, Nicholas Gibbs, Raheem Beyah, and John A. Copeland. Visual firewall: Real-time network security monito. In Proceedings of the IEEE Workshop on Visualization for Computer Security (VizSEC), pages 129–136, 2005.

[103] Barry M. Leiner, Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard Kleinrock, Daniel C. Lynch, Jon Postel, Larry G. Roberts, and Stephen Wolff. A brief history of the internet, December 2003. Version 3.32, http://www.isoc.org/internet/history/ brief.shtml.

[104] Wei-Jen Li, Shlomo Hershkop, and Salvatore J. Stolfo. Email archive analysis through graphical visualization. In VizSEC/DMSEC, pages 128–132, 2004.

[105] Manuel Lima. Visual complexity – a visual exploration on mapping complex networks. Online, 2007. http://www.visualcomplexity.com/vc/ retrieved on 10/10/2007.

[106] Kirk Lougheed and Yakov Rekhter. RFC 1163 A Border Gateway Protocol (BGP), June 1990. http://tools.ietf.org/html/rfc1163 as retrieved on 23/04/2008.

[107] Peter Lyman and Hal R. Varian. How much information, 2003. Retrieved from http: //www.sims.berkeley.edu/how-much-info-2003/ on 22/02/08. Bibliography 163

[108] Kwan-Liu Ma. Guest editor’s introduction: Visualization for cybersecurity. IEEE Com- puter Graphics and Applications, 26(2):26–27, 2006.

[109] Jock Mackinlay. Automating the Design of Graphical Presentations of Relational In- formation, chapter 2, pages 66–81. Morgan Kaufmann Publishers, 1999.

[110] Jock Mackinlay, Pat Hanrahan, and Chris Stolte. Show me: Automatic presentation for visual analysis. IEEE Transactions on Visualization and Computer Graphics, 13:1137– 1144, November 2007.

[111] W. Victor Maconachy, Corey D. Schou, Daniel Ragsdale, and Don Welch. A model for Information Assurance: an Integrated Approach. In Proceedings of the 2001 IEEE Workshop on Information Assurance and Security, (West Point, NY), 2001.

[112] Mirko Mandic and Andruid Kerne. Using intimacy, chronology and zooming to visual- ize rhythms in email experience. Conference on Human Factors in Computing Systems, pages 1617–1620, 2005.

[113] Svetlana Mansmann and Marc H. Scholl. Visual OLAP: a New Paradigm for Exploring Multidimensional Aggregates. In Proceedings of the IADIS International Conference on Computer Graphics and Visualization (CGV), July 2008.

[114] Zhuoqing Morley Mao, David Johnson, Jennifer Rexford, Jia Wang, and Randy H. Katz. Scalable and accurate identification of as-level forwarding paths. In INFOCOM, 2004.

[115] David J. Marchette. Computer Intrusion Detection and Network Monitoring - A Statis- tical Viewpoint. Statistics for Engineering and Information Science. Springer, 2001.

[116] Maxmind, Ltd. Geoip database, 2007. http://www.maxmind.com as retrieved on 30/12/2007.

[117] John McCumber. Information Systems Security: A Comprehensive Model. In Pro- ceedings of the 14th National Computer Security Conference, (Baltimore, MD), 1991.

[118] Peter McLachlan, Tamara Munzner, Eleftherios Koutsofios, and Stephen North. Liv- erac: interactive visual exploration of system management time-series data. In CHI ’08: Proceeding of the twenty-sixth annual SIGCHI conference on Human factors in computing systems, pages 1483–1492, New York, NY, USA, 2008. ACM.

[119] Jonathan McPherson, Kwan-Liu Ma, Paul Krystosk, Tony Bartoletti, and Marvin Chris- tensen. Portvis: a tool for port-based detection of security events. In Proceedings of the ACM workshop on visualization and data mining for computer security, pages 73–81, New York, NY, USA, 2004. ACM Press.

[120] Miniwatts Marketing Group. Internet usage statistics - the big picture, March 2007. http://www.internetworldstats.com/stats.htm. 164 Bibliography

[121] Alistair Morrison, Greg Ross, and Matthew Chalmers. A hybrid layout algorithm for sub-quadratic multidimensional scaling. In Proceedings of the IEEE Symposium on Information Visualization, 2002.

[122] Chris Muelder, Kwan-Liu Ma, and Tony Bartoletti. A visualization methodology for characterization of network scans. In Proceedings of the IEEE Workshop on Visualiza- tion for Computer Security (VizSEC), Minneapolis, U.S.A., October 2005.

[123] S. Mukosaka and H. Koike. Integrated visualization system for monitoring security in large-scale local area network. In APVis ’07: Proceedings of the 2007 Asia-Pacific Symposium on Information Visualisation, pages 41–44, Los Alamitos, CA, USA, 2007. IEEE Computer Society.

[124] Randall Munroe. Map of the internet, the IPv4 space, 2006. http://www.xkcd.com/ 195/asretrievedon24/04/2008.

[125] Tamara Munzner, Eric Hoffman, K. C. Claffy, and Bill Fenner. Visualizing the global topology of the mbone. In Proceedings of the IEEE Symposium Information Visualiza- tion, Los Alamitos, CA, USA, 1996. IEEE Computer Society.

[126] Carman Neustaedter, A.J. Bernheim Brush, Marc A. Smith, and Danyel Fisher. The Social Network and Relationship Finder: Social Sorting for Email Triage. Proceedings of the Conference on Email and Anti-Spam, 2005.

[127] Steven Noel and Sushil Jajodia. Managing attack graph complexity through visual hierarchical aggregation. In VizSEC/DMSEC, pages 109–118, 2004.

[128] Andreas Nurnberger¨ and Marcin Detyniecki. Externally growing self-organizing maps and its application to e-mail database visualization and exploration. Applied Soft Com- puting Journal, 6(4):357–371, 2006.

[129] Joshua O’Madadhain, Danyel Fisher, Padhraic Smyth, Scott White, and Yan-Biao Boey. Analysis and visualization of network data using jung. Journal of Statistical Software, to appear.

[130] K. Pearson. On lines and planes of closest fit to systems of points in space. Philosoph- ical Magazine, 2(6):559–572, 1901.

[131] Torben Bach Pedersen and Christian S. Jensen. Multidimensional database technology. IEEE Computer, 34(12):40–46, 2001.

[132] Doantam Phan, John Gerth, Marcia Lee, Andreas Paepcke, and Terry Winograd. Vi- sual analysis of network flow data with timelines and event plots. In VizSEC 2007 – Workshop on Visualization for Computer Security. Springer, 2008. to appear.

[133] Doantam Phan, Ling Xiao, Ron Yeh, Pat Hanrahan, and Terry Winograd. Flow map layout. In Proceedings of the IEEE Symposium on Information Visualization, pages 219–224, Washington, DC, USA, 2005. IEEE Computer Society. Bibliography 165

[134] William A. Pike, Chad Scherrer, and Sean Zabriskie. Putting security in context: Vi- sual correlation of network activity with real-world information. In VizSEC 2007 – Workshop on Visualization for Computer Security. Springer, 2008. to appear.

[135] Joshua Polterock. Ipv4 census map, mapping the address space. http://www. caida.org/research/id-consumption/census-map/.

[136] PostgreSQL Global Development Group. PostgreSQL, 2008. http://www. postgresql.org/ as retrieved on 06/03/2008.

[137] J. Propach and S. Reuse. Data Warehouse: ein 5-Schichten-Modell. WISU - das Wirtschaftsstudium, 32(1):98–106, January 2003.

[138] Pin Ren, Yan Gao, Zhichun Li, Yan Chen, and Benjamin Watson. Idgraphs: Intru- sion detection and analysis using stream compositing. IEEE Computer Graphics and Applications, 26(2):28–39, 2006.

[139] J. S. Risch, D. B. Rex, S. T. Dowson, T. B. Walters, R. A. May, and B. D. Moon. The starlight information visualization system. In IV ’97: Proceedings of the IEEE Conference on Information Visualisation, page 42, Washington, DC, USA, 1997. IEEE Computer Society. http://starlight.pnl.gov.

[140] George G. Robertson, Jock D. Mackinlay, and Stuart K. Card. Cone trees: animated 3d visualizations of hierarchical information. In CHI ’91: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 189–194, New York, NY, USA, 1991. ACM Press.

[141] Hans Rosling. Gapminder, 2007. http://www.gapminder.org/.

[142]J orn¨ Schneidewind, Mike Sips, and Daniel A. Keim. Pixnostics: Towards measuring the value of visualization. In IEEE Symposium on Visual Analytics and Technology, Baltimore, Maryland, USA, October 29 - November 3, 2006.

[143] Tobias Schreck, Daniel A. Keim, and Florian Mansmann. Regular treemap layouts for visual analysis of hierarchical data. In Spring Conference on Computer Graphics (SCCG’2006), April 20-22, Casta Papiernicka, Slovak Republic. ACM Siggraph, 2006.

[144] Heidrun Schumann and Wolfgang Muller.¨ Visualisierung - Grundlagen und allgemeine Methoden. Springer, 2000.

[145] Ben Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In IEEE Symposium on Visual Languages, pages 336–343, 1996.

[146] Sourcefire. Real-time network awareness, 2005. http://www.sourcefire.com/ products/rna.html as retrieved on 11/11/2005.

[147] Robert Spence. Information Visualization - Design for Interaction. Pearson Education Limited, 2nd edition, 2006. 166 Bibliography

[148] John T. Stasko and Eugene Zhang. Focus + context display and navigation techniques for enhancing radial, space-filling hierarchy visualizations. In Proceedings of the IEEE Symposium on Information Visualization, 2000.

[149] Chris Stolte, Diane Tang, , and Pat Hanrahan. Polaris: A system for query, analy- sis, and visualization of multidimensional relational databases. IEEE Transactions on Visualization and Computer Graphics (TVCG), 8(1):52–55, January–March 2002.

[150] Chris Stolte, Diane Tang, and Pat Hanrahan. Polaris: A system for query, analysis, and visualization of multidimensional relational databases. IEEE Transactions on Visual- ization and Computer Graphics, 8(1):52–65, 2002.

[151] Symantec. Symantec Internet Security Threat Report: Trends for July-December 06, March 2007. Volume XI, http://eval.symantec.com/mktginfo/enterprise/ white_papers/ent-whitepaper_internet_security_threat_report_xi_03_ 2007.en-us.pdf as retrieved on 01/03/2007.

[152] Tableau Software, Inc. Tableau professional edition, 2008. http://www. tableausoftware.com/ as retrieved on 06/03/2008.

[153] Andrew S. Tanenbaum. Computer networks. Pearson Education, Inc., 4th edition, 2002.

[154] Soon Tee Teoh, Kwan-Liu Ma, Shyhtsun Felix Wu, and T.J. Jankun-Kelly. Detecting flaws and intruders with visual data analysis. IEEE Computer Graphics and Applica- tions, 24(5):27–35, 2004.

[155] Daniel R. Tesone and John R. Goodall. Balancing interactive data management of massive data with situational awareness through smart aggregation. In IEEE Symposium on Visual Analytics Science and Technology, pages 67–75, 2007.

[156] Jim Thomas. Visual analytics: a grand challenge in science - turning information over- load into the opportunity of the decade. In Proceedings IEEE Symposium on Informa- tion Visualization, page xii. IEEE Computer Society, 2005. Keynote address.

[157] Jim Thomas and Kristin A. Cook. Illuminating the Path: Research and Development Agenda for Visual Analytics. IEEE-Press, 2005.

[158] Jim Thomas and Kristin A. Cook. A Visual Analytics Agenda. IEEE Transactions on Computer Graphics and Applications, 26(1):12–19, January/February 2006.

[159] W.S. Torgerson. Multidimensional scaling: I. Theory and method. Psychometrika, 17(4):401–419, 1952.

[160] Ying Tu and Han-Wei Shen. Visualizing changes of hierarchical data using treemaps. IEEE Transactions on Visualization and Computer Graphics, 13(6):1286–1293, 2007.

[161] Edward R. Tufte. The visual display of quantitative information. Graphics Press, Cheshire, CT, USA, 1986. Bibliography 167

[162] Edward R. Tufte. Envisioning information. Graphics Press, Cheshire, CT, USA, 1990.

[163] Edward R. Tufte. Visual explanations: images and quantities, evidence and narrative. Graphics Press, Cheshire, CT, USA, 1997.

[164] Edward R. Tufte. Beautiful Evidence. Graphics Press, Cheshire, CT, USA, 2006.

[165] John W. Tukey. Exploratory Data Analysis. Addison-Wesley, Reading MA, 1977.

[166] Aaron Turner. U.s. critical infrastructure in serious jeopardy, May 2007. http:// www2.csoonline.com/exclusives/column.html?CID=32893.

[167] Marc van Krevelda and Bettina Speckmann. On rectangular cartograms. Computational Geometry, 37(3):175–187, 2007.

[168] Jean-Pierre van Riel and Barry Irwin. InetVis, a visual tool for network telescope traffic analysis. Proceedings of the 4th international conference on Computer graphics, virtual reality, visualisation and interaction in Africa, pages 85–89, 2006.

[169] Jarke J. van Wijk. The value of visualization. In IEEE Conference on Visualization, 2005.

[170] Jarke J. Van Wijk and Edward R. Van Selow. Cluster and calendar based visualization of time series data. In Proceedings of the IEEE Symposium on Information Visualization, pages 4–9. IEEE Computer Society, 1999.

[171] V. Rao Vemuri, editor. Enhancing Computer Security with Smart Technology. Auerbach Publications, 2006.

[172] George Verghese. Network Algorithmics – an interdisciplinary approach to designing fast networked devides. Elsevier/Morgan Kaufmann, 2005.

[173] Juha Vesanto. SOM-based data visualization methods. Intelligent Data Analysis, 3(2):111–126, 1999.

[174] Fernanda B. Viegas, Danah Boyd, David H. Nguyen, Jeffrey Potter, and Judith Donath. Digital artifacts for remembering and storytelling: Posthistory and social network frag- ments. 37th Annula Hawaii International Conference on System Sciences (HICSS’04) - Track 4, 2004.

[175] Fernanda B. Viegas, Scott A. Golder, and Judith S. Donath. Visualizing email content: portraying relationships from conversational histories. Proceedings of the SIGCHI con- ference on Human Factors in computing systems, pages 979–988, 2006.

[176] Svetlana Vinnik and Florian Mansmann. From analysis to interactive exploration: Building visual hierarchies from OLAP cubes. In Proceedings of the 10th International Conference on Extending Database Technology, pages 496–514, 2006. 168 Bibliography

[177] Matthew O. Ward. Creating and manipulating n-dimensional brushes. In Proceedings of the Joint Statistical Meeting, 1997.

[178] Colin Ware. Information Visualization, Perception for Design. Morgan Kaufmann Publishers, 1st edition, 2000.

[179] Martin Wattenberg. A Note on Space-Filling Visualizations and Space-Filling Curves. In Proceedings of the IEEE Symposium on Information Visualization, 2005.

[180] S. Whittaker and C. Sidner. Email overload: exploring personal information manage- ment of email. In The 1996 Conference on Human Factors in Computing Systems, CHI 96, pages 276–283, 1996.

[181] Leland Wilkinson. The Grammar of Graphics. Springer, 1999.

[182] Leevar Williams, Richard Lippmann, and Kyle Ingols. An interactive attack graph cascade and reachability display. In Proceedings of the Workshop on Visualization of Computer Security 2007. Springer, 2008. to appear.

[183] Matt Williams and Tamara Munzner. MDSteer: Steerable, progressive multidimen- sional scaling. In Proceedings IEEE Symposium on Information Visualization, 2004.

[184] Pak Chung Wong and Jim Thomas. Visual analytics - guest editors’ introduction. IEEE Transactions on Computer Graphics and Applications, September/October 2004.

[185] A. Wood. Intrusion detection: Visualizing attacks in ids data. Giac gcia practical, SANS Institute, February, 2003.

[186] Ling Xiao, John Gerth, and Pat Hanrahan. Enhancing visual analysis of network traffic using a knowledge representation. In Visual Analytics Science and Technology (VAST), pages 107–114, October 2006.

[187] XMLA. Report portal 2.0, 2005. http://www.reportportal.com as retrieved on 30/12/2005.

[188] Di Yang, Elke A. Rundensteiner, and Matthew O. Ward. Analysis guided visual ex- ploration of multivariate data. In IEEE Symposium on Visual Analytics Science and Technology, pages 83–90, 2007.

[189] Jing Yang, Matthew O. Ward, Elke A. Rundensteiner, and Anilkumar Patro. Interring: a visual interface for navigating and manipulating hierarchies. Information Visualization, 2(1):16–30, 2003.

[190] Xiaoxin Yin, William Yurcik, Michael Treaster, Yifan Li, and Kiran Lakkaraju. Vis- flowconnect: netflow visualizations of link relationships for security situational aware- ness. In VizSEC/DMSEC, pages 26–34, 2004. Index

Symbols D 0-Day-Exploit, 22 data link layer, 11 data warehouse, 25 A demilitarized zone, 23 access control, 19 Denial of Service, 22 ADVIZOR, 32 – Distributed, 23 anonymization, 20 dimension, 26 anti virus software, 23 dimension node, 126 application layer, 12 Domain Name System, 16 Arp Poison Routing, 17 ARPANET, 10 E authentication, 19 Email Mining Toolkit, 142 Autonomous System, 15 email overload, 142 Autonomous System Number, 15 encryption, 18 availability, 19 – asymmetric, 19 – symmetric, 19 B Ethernet Hardware Address, 15 BI tools, 26 ETL, 25 Border Gateway Protocol, 15 botanical tree, 71 F botnet, 22 fact – table, 27 C facts, 26 Chernoff Faces, 113 faMailiar, 142 Classful addressing, 15 firewall, 23 Classless Inter-Domain Routing, 15 flow map, 103 codebook vector, see reference vector FP-Viz, 115 color mapping, 40, 74, 105 Fruchterman-Reingold algorithm, see sring color scale, see color mapping embedder127 competitive learning algorithm, 147 component plane, 143 G computer virus, see virus gateway, 15 computer worm, see worm glyphs, 113 Cone Tree, 70 graph visualization confidentiality, 19 – abstract, 102 cube, see hypercube – geographic, 103 170 Index

H MDS, 124 H-Curve, 71 – subquadratic, 124 harvest, 22 MDSteer, 124 hierarchical edge bundles, 103, 104 measure, 26 Hierarchical Network Map, 69 message integrity, 19 Hilbert Curve, 71 multidimensional data model, 26 HistoMap 1D, 82 HistoMap layout, 79 N HistoScale, 72 neighborhood function, 147 HNMap, see Hierarchical Network Map netflows, 17 honey pot, 24 Network Address Translation, 13 host, 10 network interface layer, see data link layer – files, 16 network layer, 11 HSI, 40 network scan, 21 hypercube, 26 neuronal network, 147 node-link diagrams, 102 I nonrepudiation, 19 information visualization, 36 normalization, 40 interaction, 89, 106, 116, 118, 127 O Internet, 10 observation node, 126 Internet Assigned Number Authority, 13 On-Line Analytical Processing, 25 Internet Assigned Numbers Authority, 16 On-Line Transactional Processing, 25 Internet Control Message Protocol, 11 Open System Interconnection, see OSI ref- Internet Group Management Protocol, 11 erence model internet layer, see network layer OSI reference model, 10 Internet Protocol, 11, 13 internet relay chat, 22 P Interring, 115 packet sniffing, 17 Intrusion Detection System, 21, 24 parallel coordinates, 113 IP address, 14 PCA, 124 phishing, 23 J physical layer, 11 jigsaw map, 71 pivot table, 27, 113 JPcap, 18 Pixel Bar Charts, 113 L pixel visualization techniques, 113 layout preservation, 86 PixelMap, 72 learning phase, 146 Pixnostics, 113 libpcap, 18 Polaris, 32, 113 link layer, see data link layer POP3, 12 Local Area Network, 10 port numbers – dynamic, 16 M – private, see port numbers, dynamic MAC address, 15 – registered, 16 materialized views, 31 – well-known, 16 Index 171 port scan, 21 – visualization goals, 39 PostHistory, 142 TCP, see Transmission Control Protocol prefix, 14 TCP/IP, 10 ProClarity, 32 tcpdump, 18 promiscuous mode, 17 Themail, 143 pseudocoloring, 40 Thread Arcs, 142 public key cryptography, 19 Time Searcher, 55 time series, 54 R top-level domain, 16 Radial Traffic Analyzer, 115 Transaction Control Protocol, 15 rectangle packing algorithm, 71 Transmission Control Protocol, 11 recursive pattern, 55 transport layer, 11 – extended, 56 Treemap, 71 – host, 91 – Million Items, 71 reference vector, 143 – Ordered, 71 Remail, 142 – Slice-and-dice, 71 Request For Comment, 10 – Squarified, 71 resource records, 17 – Strip, 71 routing, 14 Trojan, 22 – table, 15

S U secure communication, 18 U-matrix, 143 signature, 19 User Datagram Protocol, 11, 15 Skype, 16 V smart aggregation, 32 virus, 22 SNARF, 142 visual analytics, 42 snowflake schema, 27 – scope, 42 Social Network Fragment, 142 visual scalability, 77 Solar Plot, 115 vizsec tools space-filling curves, 71 – Avatar, 48 spam, 23 – BGPlay, 51 spring embedder, 127 – IDGraphs, 47 SSH tunnel, 20 – IDS Rainstorm, 49 star schema, 27 Strip Treemap layout, 83 – InetVis, 46 summary tables, 31 – Intrusion Detection toolkit, 113 Sunburst, 115 – IP matrix, 71 – knowledge representation, 48 T – Link Rank, 50 Tableau Software, 113 – LiveRAC, 55 taxonomy, 36 – Nuance, 49 – semantics of interactive visualizations, 39 – NVisionIP, 48 – task by data type, 36 – Portall, 47 – visual data analysis techniques, 39 – PortVis, 46 172 Index

– RNA visualization module, 50 – Root Polar Layout, 114 – Rumint, 48 – Skitter, 103 – SnortView, 49 – SourceFire, 50 – Spinning Cube of Potential Doom, 46 – Starlight, 48 – Starmine, 50 – TNV, 47 – VIAssist, 48 – VisAware, 114 – VisFlowConnect, 113 – Visual Firewall, 49

W WebSom, 143 Wide Area Network, 10 winner neuron, 146 WinPcap, 18 Wireshark, 18 World Wide Web, 10 worm, 22

X XmdvTool, 113 XMLA, 32 Acronyms

ARPANET Advanced Research Projects Agency Network APR Arp Poison Routing AS Autonomous System ASN Autonomous System Number BGP Border Gateway Protocol CIDR Classless Inter-Domain Routing DARPA Defense Advanced Research Projects Agency DDoS Distributed Denial of Service DMZ DeMiliterized Zone DoS Denial of Service EHA Ethernet Hardware Address ETL Extract, Transform, Load FTP File Transfer Protocol GUI Graphical User Interface HNMap Hierarchical Network Map HTTP HyerText Transfer Protocol IANA Internet Assigned Number Authority ICMP Internet Control Message Protocol IDS Intrusion Detection System IGMP Internet Group Management Protocol IRC Internet Relay Chat IMAP Internet Message Access Protocol IP Internet Protocol LAN Local Area Network MAC Media Access Control MDS Multi-Dimensional Scaling NAT Network Address Translation NFS Network File Service OLAP On-Line Analytical Processing OLTP On-Line Transactional Processing OS Operating System OSI Open System Interconnection PCA Principal Component Analysis POP Post Office Protocol POSIX Portable Operating System Interface RFC Request For Comment 174 Acronyms

RIP Routing Information Protocol RNA Real time Network Awareness RTA Radial Traffic Analyzer SANS System Administration and Network Security SMTP Simple Mail Transfer Protocol SOM Self-Organizing Map SRI Stanford Research Institute SSH Secure Shell TCP Transmission Control Protocol TTT Data Type by Task Taxonomy TELNET TELetype NETwork UDP User Datagram Protocol WAN Wide Area Network WWW World Wide Web