qflow: a fast customer-oriented NetFlow database for accounting and data retention

Hallgrímur H. Gunnarsson

FacultyFaculty of of Industrial Industrial Engineering, Engineering, MechanicalMechanical Engineering Engineering and and ComputerComputer Science Science UniversityUniversity of of Iceland Iceland 20142014

QFLOW: A FAST CUSTOMER-ORIENTED NETFLOW DATABASE FOR ACCOUNTING AND DATA RETENTION

Hallgrímur H. Gunnarsson

60 ECTS thesis submitted in partial fulfillment of a Magister Scientiarum degree in Computer Science

Advisors Snorri Agnarsson Helmut Neukirchen

Faculty Representative Jón Ingi Einarsson

Faculty of Industrial Engineering, Mechanical Engineering and Computer Science School of Engineering and Natural Sciences University of Iceland Reykjavik, September 2014 qflow: a fast customer-oriented NetFlow database for accounting and data retention

60 ECTS thesis submitted in partial fulfillment of a M.Sc. degree in Computer Sci- ence

Copyright c 2014 Hallgrímur H. Gunnarsson All rights reserved

Faculty of Industrial Engineering, Mechanical Engineering and Computer Science School of Engineering and Natural Sciences University of Iceland Hjarðarhagi 2-6 107 Reykjavik, Reykjavik Iceland

Telephone: 525 4000

Bibliographic information: Hallgrímur H. Gunnarsson, 2014, qflow: a fast customer-oriented NetFlow database for accounting and data retention, M.Sc. thesis, Faculty of Industrial Engineering, Mechan- ical Engineering and Computer Science, University of Iceland.

Printing: Háskólaprent, Fálkagata 2, 107 Reykjavík Reykjavik, Iceland, September 2014 Abstract

Internet service providers in Iceland must manage large databases of network flow data in order to charge customers and comply with data retention laws. The databases need to efficiently handle large volumes of data, often billions or tril- lions of records, and they must support fast queries of traffic volume per customer over time and extraction of raw flow data for given customers.

Popular open-source tools for storing flow data, such as nfdump and flow-tools, are backed by flat binary files. They do not provide any type of indexing or summaries of customer traffic. As a result, flow queries for a given customer need to linearly scan through all the flow records in a given time period.

We present a high-performance customer-oriented flow database that provides fast customer queries and compressed flow storage. The database is backed by indexed flow tablets that allow for fast extraction of customer flows and traffic volume per customer.

Útdráttur

Internetþjónustur á Íslandi þurfa að geyma mikið magn af netmælingargögnum til að gjaldfæra netnotkun og uppfylla lög um gagnageymd. Gagnagrunnskerfi fyrir net- mælingargögn þurfa að ráða við margar færslur, oft á tíðum milljarða eða billjarða, og þau verða að styðja hraðvirka uppflettingu á gagnamagni og hráum netmælin- garfærslum fyrir notendur.

Vinsælar opnar lausnir, t.d. nfdump og flow-tools, geyma færslur í flötum skrám. Þau bjóða ekki upp á flýtivísa fyrir hraðvirka leit eða samantektir á gagnamagni notenda. Þar af leiðandi þarf að lesa allar færslurnar til að svara fyrirspurnum.

Í þessu verkefni kynnum við til sögunar nýtt afkastamikið notendaskipt gagna- grunnskerfi fyrir netmælingargögn. Kerfið geymir gögnin á þjöppuðu sniði en býður samt upp á hraðvirkar fyrirspurnir um notendur. Gagnagrunnskerfið byggir ofan á safni af litlum töflusneiðum sem mynda eina heild. Hverri töflusneið fylgir flýtivísir sem gerir kleift að framkvæma hraðvirkar fyrirspurnir um færslur og gagnamagn notenda.

v

Contents

List of Figures ix

List of Tables xi

1. Introduction 1 1.1. Motivation ...... 1 1.2. Requirements ...... 2 1.3. Contribution ...... 4 1.4. Related work ...... 5 1.5. Structure of thesis ...... 6

2. Flow-based monitoring 7 2.1. Network monitoring ...... 7 2.2. Flow probes ...... 8 2.2.1. Overview ...... 8 2.2.2. Flow export ...... 10 2.2.3. Packet sampling ...... 11 2.3. Cisco NetFlow ...... 13 2.3.1. History ...... 13 2.3.2. Version 5 ...... 13 2.3.3. Version 9 ...... 16 2.3.4. Storage requirements ...... 19 2.4. Observation points ...... 19 2.4.1. Edge deployment ...... 19 2.4.2. Ingress/egress monitoring ...... 20 2.4.3. Deployment strategies ...... 22 2.4.4. Customer traffic ...... 22

3. Design and implementation 23 3.1. Architecture ...... 23 3.2. Collector ...... 24 3.2.1. Design ...... 24 3.2.2. Flow format ...... 25 3.2.3. Configuration ...... 27 3.2.4. Backend protocol ...... 30

vii Contents

3.3. Database ...... 31 3.3.1. Design ...... 31 3.3.2. Table queue ...... 32 3.3.3. Record format ...... 33 3.3.4. Indexer ...... 34 3.3.5. Tablets ...... 35 3.3.6. Materialized views ...... 36 3.4. Filtering ...... 38 3.4.1. Language ...... 38 3.4.2. Implementation ...... 39 3.5. Reports ...... 40 3.5.1. Flow extraction ...... 41 3.5.2. Flow summary ...... 41 3.5.3. Flow filter ...... 41 3.5.4. Time-based reports ...... 42 3.5.5. Customer reports ...... 43

4. Evaluation 45 4.1. Environment ...... 45 4.2. Collector ...... 45 4.2.1. Preparation ...... 46 4.2.2. Results ...... 48 4.3. Indexer ...... 51 4.4. Flow storage ...... 53 4.5. Flow extraction ...... 54 4.5.1. Preparation ...... 54 4.5.2. Results ...... 56 4.6. Materialized views ...... 57

5. Conclusions 61 5.1. Summary ...... 61 5.2. Future work ...... 61

Bibliography 63

A. Flow protobuf 65

B. Collector configuration protobuf 69

C. Grammar for the filter language 71

viii List of Figures

2.1. Flow probe internals ...... 9

2.2. Relative sampling error ...... 12

2.3. NetFlow v5 export packet ...... 14

2.4. Structure of NetFlow v9 export packet ...... 17

2.5. NetFlow v9 template flowset ...... 18

2.6. NetFlow v9 data flowset ...... 19

2.7. NetFlow edge deployment ...... 20

2.8. Example with both ingress and egress monitoring enabled . . . 21

2.9. Provider network with both ingress and egress monitoring enabled . . 21

3.1. An overview of the qflow system ...... 23

3.2. Collector overview ...... 25

3.3. Directory layout of the flow database ...... 31

3.4. Flow capture pipeline ...... 32

3.5. Structure of a block ...... 34

3.6. Directory layout for flow tablets ...... 35

3.7. Internal layout of a flow tablet ...... 36

3.8. View file format ...... 38

ix LIST OF FIGURES

3.9. Parse tree for example filter expression ...... 40

4.1. File size of materialized view ...... 57

4.2. Update time for materialized view ...... 58

4.3. Query time for materialized view ...... 58

4.4. Export time for materialized view ...... 59

x List of Tables

2.1. Format of NetFlow v5 header ...... 15

2.2. Format of NetFlow v5 record ...... 15

2.3. Format of NetFlow v9 header ...... 16

4.1. Collector performance results for NetFlow v5 ...... 49

4.2. Collector performance results for NetFlow v9 ...... 50

4.3. Indexer performance for 10M records ...... 51

4.4. Indexer performance for 20M records ...... 52

4.5. Indexer performance for 30M records ...... 52

4.6. Indexer performance for 40M records ...... 52

4.7. Storage efficiency of qflow vs. flow-tools ...... 53

4.8. Flow extraction performance for a single IP ...... 56

4.9. Flow extraction performance for a network ...... 56

xi

1. Introduction

1.1. Motivation

Internet service providers (ISPs) in Iceland must store and query large volumes of network traffic monitoring data in order to charge customers and comply with data retention laws. A typical ISP might need to store and process up to 50K flow records per second, which translates into more than 8 GB per hour of monitoring data [9]. Data retention laws require the data to be stored for six months, and for 50K records per second it would take 34 TB to store the resulting 770 billion records.

A number of commercial and open source solutions exist for storing and managing network flow data, e.g. Cisco NetFlow collector, pmacct, flow-tools and nfdump. They usually store flow records in either a relational database or raw binary files. In general, relational databases offer flexibility and a powerful query language, but they are known to be slower, especially in terms of insertion rate, and consume more disk space when compared to specialized tools that use raw binary files [6].

Furthermore, scaling relational databases to billions or trillions of records can be a real challenge. For example, the iiBench 1B row insert benchmark for MySQL shows that the insert rate is highly dependent on the table size. As the table grows larger and the data no longer fits in memory, the performance degrades dramatically. In the beginning of the benchmark, MySQL can sustain around 40,000 inserts per second, but after inserting 200M rows, it has fallen down to 5,000 inserts per second. Close to 1B rows, the rate is down to 876 inserts per second. With 50K records per second, the table would reach 200M rows in one hour, and 1B rows in less than six hours.

In contrast, tools based on raw binary files can handle a high insertion rate, e.g. nfdump can store over 250K records per second on a dual-core machine [6]. The binary files are flat, without any sort of index, and the insertion rate does not degrade over time. The flow records are written in the order they are received and files are usually rotated regularly, e.g. every 5, 10 or 15 minutes.

Although they support fast insertion, such tools present two problems when they

1 1. Introduction are used for data retention and billing. First, extracting flows for a given customer requires sequentially scanning all the flow records within the given time period. For a typical ISP, this could mean scanning hundreds of gigabytes or terabytes to locate a relatively few number of records.

Secondly, fetching hourly/daily/monthly traffic summaries per customer also re- quires sequentially scanning all the flow records. Typically, ISPs write scripts to precompute summaries and store them in a relational database. After each rotation of a raw binary file, a script will compute the volume per IP and update summaries in a relational database. This can also be problematic to scale when the number of files and IPs grows large.

In this thesis we present a new system that stores flow records in custom sorted flow tables that allow for fast extraction of customer flows. In addition, the system maintains an index for each flow table that can be used to answer queries about traffic volume per customer, and it also maintains higher-level aggregate summaries (hourly, daily, monthly) in an efficient manner.

1.2. Requirements

This section describes the key requirements of the system. The requirements are based on our experience working with Icelandic ISPs on flow accounting and data retention.

1. Data retention fulfillment

The EU data retention directive, adopted into article 42 of the Icelandic Elec- tronic Communications Act no. 81/2003 [1], requires that every Internet ser- vice provider retain a log of Internet traffic metadata for the purpose of law enforcement. The log must contain customer IP addresses, all connections that were established, time of connection, and the IP address of both con- nection endpoints. The log must be retained for six months, after which any older records are to be removed. The ISP must be able to deliver customer metadata to law enforcement upon request.

2. Compressed storage

Given the large volume of data and the requirement to store it for six months, the system should store the data in a compressed format.

Typical zlib compression ratios are on the order of 2:1 to 5:1 [13]. Given 50K

2 1.2. Requirements

records per second and 34 TB of data over six months, zlib compression could be expected to reduce the storage requirements to something between 6 to 17 TB.

3. Traffic summaries

The system needs to maintain time-based aggregates of traffic volume per customer IP, e.g. hourly, daily and monthly aggregates. The time periods should be configurable. Updates and queries must be fast.

This requirement is motivated by the needs of both ISPs and their customers. First, for usage-based charging of network traffic, an ISP needs to determine the traffic volume per customer during a given billing period (usually 1 month). This data is then imported into a billing system at the end of the billing period.

Secondly, according to the Icelandic regulation no. 526/2011 on electronic telecommunications billing [18], customers are entitled to get detailed itemiza- tion of bills for telecommunications charges. In particular, they are entitled to data usage details broken down by time period with hourly precision. Internet service providers usually provide a web interface for customers where they can view their traffic volumes broken down by different time periods (monthly, daily, hourly).

4. Flow extraction

When necessary, the system must be able to drill down and extract raw records for a given customer. This is necessary both for data retention fulfillment, and to settle any billing disputes when customers complain about usage-based charges.

The system should respond to extraction queries in a reasonable amount of time. This condition precludes the possibility of sequentially scanning the records to filter out customer records, since it would be prohibitively expensive to scan gigabytes or terabytes for each extraction query.

5. Scalability

Ideally, the system should scale to any volume of data given that enough hardware is available to store and process it.

To support large installations that exceed the capacity of a single machine, the system should be able to run on multiple machines in a distributed fashion.

3 1. Introduction

6. Customer tagging

Internet service providers usually use some sort of an IPAM (IP address man- agement) system to manage and allocate IP blocks within their IP space. At a given point in time, every active IP address should be allocated to a specific customer. IP allocations can change over time, and some IP allocations can be dynamic, e.g. with RADIUS, where users get assigned an IP for a limited period of time from a shared pool of IPs.

To accurately track and account for traffic volume per customer, every flow record should be associated with the appropriate customer according to the IPAM system. Customer allocations in the IPAM system are usually asso- ciated with some type of unique identifier that is understood by the billing system. By “tagging” flows using this identifier, the correct customer can be charged when the records are summarized and imported into the billing sys- tem.

The system must support live tagging of flows as they arrive. The tag should follow the flow throughout the system and also be part of any aggregate, such that it is possible to separate the traffic volume of two customers that shared the same IP during a given billing period. It must be possible to push updated tags to the system.

7. Destination-sensitive billing

In Iceland, the dominant ISP billing model is to charge customers only for foreign traffic that originates or terminates outside of Iceland. Domestic traf- fic, exchanged within Iceland, is free. This is because submarine cables are expensive and foreign bandwidth is still scarce relative to domestic bandwidth.

The system must provide features to distinguish and separately account foreign and domestic traffic. It must work both for smaller ISPs that do not have their own foreign links and purchase foreign transit from a larger upstream provider, and for larger ISPs that have their own dedicated foreign links.

1.3. Contribution

In this thesis, we present a new flexible high-performance NetFlow accounting sys- tem. It meets all the key requirements laid out in previous section and provides several key features that are not present in other flow accounting systems.

4 1.4. Related work

Our first contribution is a flow collector that can distribute flows to multiple back- ends in a configurable way. Unlike other systems (e.g. flow-tools and flowd), our system separates the collection and capture (storage) of flows. This separation al- lows us to distribute the processing of flows over multiple machines. Furthermore, the collector is stateless and a large ISP could run multiple redundant collector repli- cas for increased reliability and scalability. Finally, the collector can pre-process the flows and apply common business logic, such as filtering and tagging.

Our second contribution is a compressed customer-oriented database for storing flows. The database is designed to sustain high-insertion rates and provide fast access to customer flows and traffic summaries without a sequential scan of all the flows. Other systems, e.g. flow-tools and flowd, store flows in flat binary files and extracting customer flows requires scanning all the flows.

1.4. Related work nfdump [15], silk [11] and flow-tools [8] are the most popular tools for storing flows. When it comes to data retention and flow accounting, these systems have several shortcomings. First, all of them store flow records in compressed flat binary files. They do not provide any sort of indexing. As a result, the entire set of flows needs to be sequentially scanned (and decompressed) to extract the flows of a given customer. Secondly, they do not provide any facilities for maintaining higher level summaries of customer traffic. Finally, these tools are designed as monolithic collectors that dump flows to disk. They provide limited options for distributing work across multiple machines.

Column-oriented databases using bitmap indexing have been proposed for indexing flow data [6, 9, 19]. A column-oriented database stores its data vertically in columns rather than rows. The values in a given column are stored contiguously, which often improves compression ratios due to data similarity. Furthermore, a query only needs to read the columns that appear in the query, leading to less I/O when compared to row-oriented databases, especially when dealing with rows with a large number of attributes, such as flow records.

A bitmap index can be constructed for a given column and queries can be answered by performing bitwise logical operators on the bitmaps. Bitmap indexes provide good performance for large read-only datasets [19], which makes them especially convenient for flow data.

The proposed column-oriented flow systems offer a more general solution to the flow

5 1. Introduction extraction problem than our system, but none of them meet all of our other require- ments. The column-oriented systems are primarily designed for fast offline analysis of flow data, and they lack essential flow accounting features such as NetFlow v9 support, an extensible storage format, online customer identification/tagging and destination sensitive billing.

1.5. Structure of thesis

The remainder of the thesis is organized as follows. Chapter 2 reviews the neces- sary background information on flow-based monitoring and the NetFlow technology. Chapter 3 describes the architecture, design and implementation of the qflow sys- tem. Chapter 4 presents our evaluation and experimental results. Finally, Chapter 5 concludes and outlines future work.

6 2. Flow-based monitoring

This chapter describes flow-based monitoring and the NetFlow technology.

2.1. Network monitoring

The three most common approaches to network monitoring are: SNMP, packet traces, and network flows. Here we will briefly overview each one and discuss their applicability to data retention and traffic billing.

Simple Network Management Protocol (SNMP) is the de facto standard protocol used for managing and monitoring network devices such as routers and switches. Devices run an SNMP agent (server) which exposes various knobs and metrics, such as CPU load, memory usage, and interface counters. Clients collect metrics by polling SNMP servers at a regular frequency.

Within ISPs, SNMP is used to monitor link utilization and, in some cases, it is also used for billing, e.g. billing based on 95th percentile bandwidth is common for whole- sale and large links. But SNMP is not suitable for data retention or destination- sensitive billing because it only provides a coarse-grained overview on link utilization and traffic volume. It does not provide any information on the traffic itself, such as the source/destination IP.

In contrast to SNMP, packet traces provide a complete view of the traffic. Packet headers are collected by passively tapping links and feeding the packets to a nearby server for processing. Packet traces are popular within the network research com- munity because they provide the greatest level of detail for studying traffic behavior. However, using packet traces for network accounting and data retention is imprac- tical because it is expensive, both in terms of the hardware for collecting traces on high-speed links, and the costs associated with storing traces over a long period of time. Moreover, the method is unecessarily expensive because packet traces provide much more detail than is required for accounting and data retention.

Network flows provide a good compromise in terms of cost and detail. Most net-

7 2. Flow-based monitoring work devices support exporting network flow information to an external collector (server). Flows contain less detail than packet level traces, since they do not provide information on individual packets. Instead, flows provide a “connection-level” view of the traffic. A network flow record summarizes a set of packets that share common header attributes and keeps track of aggregate statistics, such as the total bytes and packets belonging to each flow.

The rest of this chapter will discuss network flows and flow-based monitoring in greater detail.

2.2. Flow probes

2.2.1. Overview

Flow-based traffic measurement methods are based on monitoring network traffic in terms of flows. Packets are observed at an observation point by a flow probe which aggregates packets into flows and exports flow records to an external flow collector. An observation point is a location in the network where packets can be observed, e.g. an interface on a router. Flow probes are usually embedded into flow-capable network devices, such as Cisco routers, but it is also possible to run a standalone software probe on a server, e.g. nProbe [5], that observes packets using a network tap or port mirroring on a switch.

A flow is defined as a set of IP packets passing through an observation point during a certain time interval and sharing a set of common properties (the flow key), e.g. source/destination IP, source/destination port, input/output interface, protocol [4]. The flow key can be configurable or fixed, depending on the underlying flow export protocol. Older flow protocols, such as Cisco’s NetFlow versions 1 to 8, use a fixed flow key and some of the NetFlow versions differ only in which flow key they use. Newer protocols, such as Cisco’s NetFlow version 9 and IETF’s IPFIX, are designed to be extensible. They use a flexible template-based approach that allows the user to specify flow keys (“templates”) and multiple flow keys can exist within the same flow probe.

8 2.2. Flow probes

Input Output interface interface Traffic Router

Inspect packet

Source IP address Flow cache

Destination IP address Update Flow key # pkts # bytes flow record Source port 4376 437182

Destination port

Protocol

...

Figure 2.1: Flow probe internals

The flow key specifies the granularity of the aggregation that happens in the flow probe. For each observed flow, the probe maintains a flow record that includes counters such as the total number of packets/bytes in the flow and the time of the first and last packet in the flow. Active flow records are maintained in a flow cache and the probe updates the appropriate flow record for each observed packet. A small flow key will result in fewer flows being tracked and less memory usage, but it will also provide less detail on the observed network traffic.

Flow records have a limited lifetime at the probe. A flow record is created when the first packet belonging to that flow is observed, and it expires when one of five events occurs:

1. The idle timeout of the flow when there has been no activity for a long time.

2. The natural end of the flow, e.g. TCP connection is closed. This only applies to protocols that have connection-oriented semantics, such as TCP.

3. The active timeout of the flow. This ensures that long-lived active flows are regularly exported.

9 2. Flow-based monitoring

4. The flow cache is full. The details depend on the probe implementation, but it might, for example, evict the oldest flow record to make room for new entries.

5. Overflow protection. The flow record counters have a fixed size and a record is expired if an update would cause an overflow to occur.

An expired flow record is queued for export to an external collector. For the sake of efficiency, the queued records are usually batched in flow export packets according to the semantics of the underlying export protocol, e.g. each NetFlow version 5 export packet can carry 30 flow records.

2.2.2. Flow export

Each flow probe (also known as flow exporter) is configured to export flows to one or more collectors at a given IP and port. Most flow probes, especially those embedded in routers, use a simple exporting strategy: each packet is sent to every configured collector. However, some devices, such as Cisco routers, only support two collector destinations. If more destinations are required, then it must be implemented on the collector side.

Export packets are usually transported from probe to collector using either UDP or SCTP. Originally, NetFlow only supported UDP and therefore many older de- vices still only support using UDP. However, exporting via UDP has several major disadvantages [2]:

1. UDP is congestion unaware. The exporter sends packets as fast as it can generate them, without regard to available bandwidth or how fast the collector is consuming them.

2. UDP is unreliable. Packets can be lost, duplicated, or delivered out of order. The collector must be robust and tolerate such events.

3. UDP is vulnerable to spoofing and packet insertion. Due to the lack of a handshake, an attacker could blindly spoof packets from the exporter.

SCTP is a reliable message-oriented transport layer protocol. It has many advan- tages over UDP, including:

1. SCTP is congestion aware and provides congestion control similar to that for TCP. Messages are buffered as until they can be sent.

10 2.2. Flow probes

2. SCTP is reliable. Messages are buffered until they have been acknowledged by the collector. Lost messages are retransmitted.

3. SCTP uses a 4-way handshake with signed cookies, which prevents spoofing.

Furthermore, SCTP opens the door for more advanced features in the flow probe. For example, it is possible to configure a backup collector in Cisco routers. The router sends SCTP heartbeat messages to the primary and backup collectors. If the primary goes down, it will start exporting to the backup collector. This is not possible when exporting via UDP because it is a “fire-and-forget” protocol. It is connectionless and the probe communication is only in one direction: the flow probe talks to the collector, but the collector doesn’t talk back.

As a result, SCTP is generally preferred when it is available. This is especially true when flow collection serves a critical business function, such as when customers are charged for usage based on flow data. In that case, lost records equal lost revenue.

2.2.3. Packet sampling

Asides from the tradeoff between flow detail and memory, there is also a tradeoff between flow accuracy and CPU cost. As the packet rate grows, it can be expensive for routers to examine every packet. Most flow probes support packet sampling to reduce measurement overhead. Instead of processing every packet, the probe will sample packets with a probability p, i.e. the flow cache is updated only for a fraction p of all packets. Later, the external flow collector can renormalize (invert) the flow counters by multiplying by 1/p to obtain an unbiased estimator of the original [7].

The 95% confidence interval of the estimate percent error can be approximated by [16]: r1 %error ≤ 196 (2.1) c where c the number of bytes that belong to the traffic class, e.g. a single customer.

Since the error is a function of the number of samples used to make the estimate, the accuracy can be increased by increasing the number of samples. This can be done in two ways:

1. Increasing the sampling probability

2. Sampling over a longer period of time

11 2. Flow-based monitoring

100%

80%

60%

% Error 40%

20%

0% 1 10 100 1000 10000 Number of samples in class

Figure 2.2: Relative sampling error

Figure 2.2 shows the number of samples required to obtain a given % error.

In a typical billing application, where a service provider charges customers, as iden- tified by IP address, for byte usage, the objective is to determine from all the packets traversing the network during the billing period (usually 1 month) how many bytes belong to a particular customer. In that case, the class in the error equation would correspond to a customer.

A common ISP billing strategy is to include an allowance, say 10 GB, and charge users for each additional GB. If the sampling probability is 1/100 and the observed interface has an MTU of 1500 bytes, then at least 70,000 samples will be taken for a customer that consumes 10 GB. Using the equation above, the error would be within 0.6%. The lower bound is based on the average packet size being equal to the MTU, which is the worst case, since it will result in the fewest number of packets. For real traffic, the packet size distribution would most likely follow a bimodal distribution with very small (40 bytes) and very big (1500 bytes) packets being most common [21]. Consequently, the expected error would be even lower than 0.6%.

Finally, the customer charge can incorporate the confidence interval to counter the sampling error. Billing by the lower bound of the confidence interval would ensure that no customer is overcharged.

12 2.3. Cisco NetFlow

2.3. Cisco NetFlow

2.3.1. History

NetFlow is a set of protocols developed by Cisco for flow-based monitoring. Although originally developed by Cisco, NetFlow has become the de-facto industry standard and is currently supported by a wide range of network devices from different vendors, e.g. Cisco, Alcatel and Juniper devices. Cisco has released nine versions of the NetFlow protocol, designated as NetFlow v1 to v9. In recent years, IETF, the Internet standards body, has defined a new protocol called IPFIX in an attempt to unify and standardize on a common flow protocol. IPFIX is heavily based on NetFlow v9 but also includes several new extensions. Informally, IPFIX is also known as NetFlow v10 because its export format is compatible with the NetFlow versioning scheme and it uses the NetFlow version identifier 10.

Currently, the most widely used and supported versions of NetFlow are v5 and v9. The versions before v5 are deprecated and the versions between v5 and v9 are mostly small variations of v5 that are seldom used or supported. Each variation required a new version because all NetFlow versions before v9 used a fixed export format and flow key. NetFlow v9 is designed to be future-proof. It is flexible and extensible: record formats are defined using templates and new fields can be added without changing the protocol itself.

NetFlow v5 is still the most popular format, but it is slowly being replaced by NetFlow v9. The main issue with NetFlow v5 is that it is not extensible and it lacks support for IPv6. As Internet service providers move to adopt IPv6, they need to be able to monitor and charge for IPv6 traffic as well.

In the sections below, we will describe the inner workings of NetFlow v5 and v9.

2.3.2. Version 5

NetFlow v5 defines a flow as an unidirectional sequence of packets that share the following predefined key fields [3]:

1. Source IP address.

2. Destination IP address.

3. Source port for UDP/TCP, otherwise 0.

13 2. Flow-based monitoring

4. Destination port UDP/TCP, type/code for ICMP, otherwise 0.

5. IP protocol type, e.g. 6 for TCP, 17 for UDP.

6. IP Type of Service (ToS).

7. Ingress interface, using the SNMP index of the interface.

The values of the key fields differentiate one flow from another. Flow records also contain non-key fields that provide additional information about the flow, e.g. the nexthop IP address, but a change in the value of a non-key field does not create a new flow. It depends on the implementation, but in most cases, the values of non-key fields are decided by the first packet in the flow.

The NetFlow v5 export format is based on a fixed-length binary format with a fixed set of fields. Figure 2.3 shows the structure of a typical NetFlow v5 packet exported over UDP. Each packet contains a common header followed by a number of flow records. All fields are encoded using big-endian byte order. Table 2.1 shows the fields present in a NetFlow v5 header. Table 2.2 shows the fields present in a NetFlow v5 flow record.

IP header

UDP header

version Flow header

no. records Flow record source IP

probe uptime Flow record destination IP

probe clock ... nexthop IP

residual nanos Flow record input interface

sequence nr. output interface

... no. packets

no. bytes

...

Figure 2.3: NetFlow v5 export packet

14 2.3. Cisco NetFlow

Name Type Description version uint8 NetFlow export format version number, always 5 count uint16 Number of flow records in this export packet (1-30) uptime_ms uint32 Time in milliseconds since this device booted unix_secs uint32 Time in seconds since epoch unix_nsecs uint32 Residual nanoseconds since epoch flow_sequence uint32 Sequence counter of total flows exported engine_type uint8 Type of flow-switching engine engine_id uint8 Slot number of the flow-switching engine sampling_interval uint16 Sampling mode (2 bits) and interval (14 bits)

Table 2.1: Format of NetFlow v5 header

When exporting over UDP, the sequence counter can be used to detect lost flows and duplicates. However, the detection must also take into account that packets can be received out of order and the flow probe might restart, which would reset the sequence counter.

Name Type Description srcaddr uint32 Source IP address dstaddr uint32 Destination IP address nexthop uint32 IP address of next hop router input uint16 SNMP index of the input interface output uint16 SNMP index of the output interface dPkts uint32 Total number of packets in the flow dOctets uint32 Total number of Layer 3 bytes in the packets of the flow first uint32 Uptime at start of flow last uint32 Uptime at the time the last packet of the flow was received srcport uint16 TCP/UDP source port number or equivalent dstport uint16 TCP/UDP destination port number or equivalent pad1 uint8 Unused (zero) bytes tcp_flags uint8 Cumulative OR of TCP flags prot uint8 IP protocol type (for example, TCP = 6; UDP = 17) tos uint8 IP type of service (ToS) src_as uint16 Autonomous system number of the source dst_as uint16 Autonomous system number of the destination src_mask uint8 Source address prefix mask bits dst_mask uint8 Destination address prefix mask bits pad2 uint16 Unused (zero) bytes

Table 2.2: Format of NetFlow v5 record

15 2. Flow-based monitoring

2.3.3. Version 9

Flow header

NetFlow v9 is the successor to NetFlow v5. It abandons the fixed record format in favor of a flexible template-based system which allows new record types to be defined.

Each NetFlow v9 message starts with a common flow header. It is similar to the one used for NetFlow v5 but it has been slightly refined. Table 2.3 shows the fields present in the NetFlow v9 header.

The source ID field is new, but serves a similar purpose to the engine type/ID fields found in the NetFlow v5 header. It identifies the NetFlow process within the exporter device. For instance, when a router has multiple line cards that are running separate NetFlow processes, the collector can use the source ID to separate different export streams coming from the same device/source IP. This is especially important in NetFlow v9 because the sequence number and template state is scoped to the observation domain, i.e. the uniqueness of template IDs is local to the observation domain. Therefore, the collector must maintain separate state for each observation domain (source IP + source ID).

Name Type Description version uint16 NetFlow export format version number, always 9 count uint16 Total number of records in the export packet uptime_ms uint32 Time in milliseconds since this device booted unix_secs uint32 Time in seconds since epoch sequence_number uint32 Incremental sequence counter of all export packets source_id uint32 Identifies the Observation Domain

Table 2.3: Format of NetFlow v9 header

16 2.3. Cisco NetFlow

Flowsets

The flow header is followed by one or more flowsets, as shown in Figure 2.4.

Flow header Flowset Flowset Flowset ... Flowset Flowset

Figure 2.4: Structure of NetFlow v9 export packet

There are four types of flowsets:

1. Template flowset: Contains one or more template records. Each template record describes the type and length of individual fields within subsequent flow records that match the template.

2. Data flowset: Contains a template ID and one or more flow records. The records cannot be decoded without the right template.

3. Options template: Special type of template flowset that describes the format of options data records.

4. Options data: Special type of data flowset that contains options data records. Rather than supplying information about flows, these records describe meta- data about the NetFlow process itself, e.g. the sampling interval.

Different types of flowsets can be interleaved in the same packet in any given order.

Every flowset starts with a common flowset header. It contains two fields: the flowset ID, which identifies the flowset type, and the flowset length, which contains the total size of the flowset including the flowset header itself. The remainder of the flowset depends on its type. The following types are supported:

• 0: Reserved for template flowset.

• 1: Reserved for options template.

• 2-255: Reserved for future use.

• 256-65535: Data flowsets.

17 2. Flow-based monitoring

Template flowset

The template flowset contains a set of template records. Each template record describes the format (template) of subsequent flow records in data flowsets with the given template ID. Figure 2.5 shows the structure of a typical template flowset.

The NetFlow v9 standard (RFC3954) defines 79 field types. The standard types include all the fields of NetFlow v5 and new fields such as MAC addresses, VLAN IDs, MPLS labels, and IPv6 addresses. The field type attribute in a template record is a 16-bit short, so there is plenty of room for future types.

When the collector receives a template record, it needs to store the template to be able to decode future data flowsets that match the given template ID. Flow probes will generally send templates periodically to refresh the collector. Templates are not persistent across flow probe restarts. Consequently, if the collector receives a new template definition for an already existing template ID, it must override the previous definition.

Flowset header Flowset ID: 0 Length

Template record Template ID Field count Field 1 Field 2 ... Field N

Template record Template ID Field count Field 1 Field 2 ... Field N

Template record ...

Template record Template ID Field count Field 1 Field 2 ... Field N

Figure 2.5: NetFlow v9 template flowset

Data flowset

After sending templates to the collector, the flow probe can transmit flow records using the template. The flowset ID in a data flowset header identifies the template ID required to decode the flow records contained within the flowset. Figure 2.6 shows the structure of a typical data flowset.

18 2.4. Observation points

Flowset header Template ID Length

Flow record Field 1 value Field 2 value ... Field N value

Flow record Field 1 value Field 2 value ... Field N value

Flow record ...

Flow record Field 1 value Field 2 value ... Field N value

Figure 2.6: NetFlow v9 data flowset

2.3.4. Storage requirements

According to Cisco, the volume of NetFlow export data is estimated at roughly 1.5% of the actual traffic observed. The average customer in Iceland uses 60 GB per month, which amounts to roughly 1 GB of NetFlow data per month.

Another way to estimate the NetFlow volume is to use the flow export rate. NetFlow v5 uses a fixed-length export format with a 23 byte flow header, followed by up to 30 flow records of 47 bytes each. Assuming that a flow probe waits for 30 records before sending an export packet, we can calculate the NetFlow v5 volume for a flow rate R as follows: 23R + 47R (2.2) 30

For example, given a flow rate R of 50K per second, the resulting NetFlow v5 volume would be 2.27 MB per second or roughly 8 GB per hour.

2.4. Observation points

2.4.1. Edge deployment

Internet service providers that charge customers for network usage usually deploy NetFlow on the edges of the network, where traffic originates or terminates, as opposed to deploying NetFlow on backbone/core routers. The deployment plan needs to be comprehensive enough to capture all traffic required for billing and data retention, but it must also ensure that the same traffic isn’t counted twice, which

19 2. Flow-based monitoring can happen for example when a packet traverses a path through the network that contains two active NetFlow probes, resulting in two separate flows being exported for the same traffic. The next section will describe such scenarios in more detail.

Figure 2.7 shows an example service provider network with NetFlow deployment on the edges. The circles marked R represent routers and the smaller filled circles represent active NetFlow monitoring on the given network links.

ISP network R

R R R R

R R Core R Edge Edge R

Figure 2.7: NetFlow edge deployment

2.4.2. Ingress/egress monitoring

In general, NetFlow supports both ingress (inbound) and egress (outbound) mon- itoring. On routers, NetFlow is configured per interface in ingress and/or egress mode. Ingress monitoring accounts for all packets entering an interface, usually before any packet operations are performed by the device, such as ACLs or NAT. Egress monitoring accounts for all packets leaving an interface.

Internet service providers that charge for network usage usually need to measure both the download and upload traffic of customers, but enabling both ingress and egress monitoring at the same device can result in duplication. Figure 2.8 depicts a router with six interfaces. Ingress monitoring is enabled on interface 4, and egress monitoring on interface 3. A packet that enters on interface 4 and is switched out on interface 3 will be examined twice. Depending on the flow probe and environment, this might result in two flows with the same information or a single flow with double traffic. NetFlow v9 contains a direction field that allows the collector to distinguish the two flows (one will be marked as ingress, the other as egress), but NetFlow v5 has

20 2.4. Observation points

Egress flow Interfaces 1 2 3

Router

4 5 6

Ingress flow

Figure 2.8: Example router with both ingress and egress monitoring enabled

no such field. As a result, it is recommended not to mix ingress/egress monitoring on the same device when using NetFlow v5.

D Egress

C

B A Ingress

Figure 2.9: Provider network with both ingress and egress monitoring enabled

The same general principle also applies to the network as a whole. In Figure 2.9, enabling ingress monitoring for packets that enter device B and egress monitoring for packets that exit device C would result in duplicate flows for packets traversing that path.

21 2. Flow-based monitoring

2.4.3. Deployment strategies

For usage-based charging, the objective is that every customer packet is accounted for once and only once. This effectively means that for any path through the network, there must only be one interface that counts the packet. Of course, it is also possible to filter duplicate flows at the collector, but it is easier and less error prone to deploy NetFlow in a way that avoids duplicate flows.

A simple deployment strategy is to enable ingress monitoring on all outwards facing customer interfaces of edge routers. A customer packet will be examined at the customer interface where it enters the service provider network, but not at edge interface where it leaves the network, e.g. heading for the Internet. Likewise, a return packet would be examined when it enters the service provider network, but not at the customer edge interface. The two unidirectional flows resulting from a TCP connection could be exported from different devices, but together, they would provide a full view of any traffic going to/from customers.

2.4.4. Customer traffic

With ingress monitoring on all customer/outwards facing interfaces, all flows have the same “direction”, regardless of whether they are in fact inbound or outbound customer traffic. It is up to the collector to determine the true direction with respect to the customer.

This is often accomplished by maintaining a list of customer IPs (local prefixes) at the collector. When a flow is received, the collector will look up the source and destination IPs and determine which endpoint belongs to the customer. If the destination IP address belongs to the customer, then it is inbound (download) traffic with respect to that customer. Otherwise, it is outbound.

In the case of local customer-to-customer traffic, both endpoints will be customer IPs. For one customer the flow will be classified as inbound traffic, but as outbound traffic for the other customer. Flow accounting systems would typically produce two accounting records for such a flow, one for each customer.

22 3. Design and implementation

This chapter presents the architecture, design and implementation of the qflow sys- tem.

3.1. Architecture

The system is organized as a pipeline that receives and processes flows. The flows are tagged, sorted, indexed, aggregated and stored in a customer-oriented flow database. Conceptually, the database contains a set of flow tables. Each flow table contains a set of flows, organized by customer, along with an index that summarizes the traffic volume per customer and points to the flows in the flow table.

Network Traffic Lawful planning analysis intercept

Billing Reports to Reporting system customers IPAM

Tags

NetFlow Collector Capture Indexing Aggregation

Figure 3.1: An overview of the qflow system

23 3. Design and implementation

The system has five parts:

1. Collector: receives flows from exporters, translates them into an internal for- mat, applies tags and filters, and then sends the flows to a number of backends for further processing.

2. Capture: receives flows from collectors and dumps them to disk in a temporary spool directory. Rotates files periodically.

3. Indexing: picks up finalized flow files from the spool directory, transforms them into flow tablets and appends them to a given flow table.

4. Aggregation: maintains higher level aggregates (e.g. daily/monthly) of traffic volume per customer for each flow table.

5. Reporting: provides flow data to external systems, such as billing, reports for customers, etc.

Each part will be covered in more detail in the sections that follow.

3.2. Collector

3.2.1. Design

The collector acts as a broker between flow exporters, such as Cisco routers, and flow backends that consume flows for use in a wide variety of applications. The collector receives NetFlow packets from exporters, extracts the flows, applies common logic such as filters, tags, and other business rules, and then exports the flows to backends for further processing. An overview of the collector is shown in Figure 3.2.

The collector supports both NetFlow v5 and v9. It translates received flow records into a common protobuf-based [12] intermediate format that it uses internally during processing and externally when talking to backends. The backends are organized into groups and the collector distributes flows across the backends in a group. The collector configuration decides which flows to send to a given backend group.

This design has multiple benefits:

1. It provides loose coupling. New applications can be developed, tested and

24 3.2. Collector

deployed without re-configuring or changing the NetFlow devices that export flows.

2. It solves the problem with NetFlow devices that only support two export destinations.

3. It abstracts away the complexity of dealing with multiple NetFlow versions and transport protocols.

4. It uses TCP for publishing flows to backends instead of unreliable protocols such as UDP.

5. It allows common functionality such as flow filtering and tagging can be en- capsulated in one place.

6. It enables flows to be distributed across multiple backends that don’t need to run on the same machine as the collector.

Backend group A

Backend 1

flow protobuf Backend 2

NetFlow v5 / v9 Backend 3 Exporter Collector

Backend group B

Backend 4 Config

Backend 5

Figure 3.2: Collector overview

3.2.2. Flow format

The collector transmits flows in a common intermediate format using Google Pro- tocol Buffers [12], also known simply as protobuf. The protobuf library provides a

25 3. Design and implementation typed language for defining protobuf messages, often called protos, and tools to gen- erate serialization code for multiple programming languages, including C++, Java and Python. An example proto definition for a flow record:

message Flow { // Number of incoming bytes optional uint64 bytes = 1;

// Number of incoming packets optional uint64 packets = 2;

// Source IP address optional string src_ip = 3;

// Destination IP address optional string dst_ip = 4;

// TCP/UDP source port number optional uint32 src_port = 5;

// TCP/UDP destination port number optional uint32 dst_port = 6; // ... }

The full definition of the flow protobuf record that we use in our system is given in Appendix A.

A protobuf message consists of a sequence of fields. Each field has a name, an associated type, a unique numbered tag (used to identify the field in the serialized binary format) and a field marker such as optional/required/repeated.

Using protocol buffers for flows is a good choice for several reasons:

1. The protobuf language is flexible and allows protobuf messages to be extended with new fields without breaking backwards-compatibility. Old binaries simply ignore unknown new fields. This is an important feature because NetFlow v9 is itself extensible, so naturally the intermediate flow format should also be extensible to accomodate future changes. If NetFlow v9 introduces a useful field that is missing from the protobuf, it can simply be added along with the necessary parsing code without breaking any existing code.

2. The protobuf binary format is compact because it only encodes the fields

26 3.2. Collector

that are in use and employs tricks such as variable length encoding. A com- pact representation is important for NetFlow v9 because there might be many more fields supported than those that are actually used, and the unused fields shouldn’t take up space in the serialized format.

3. The protobuf library is available for a wide range of programming languages and platforms. Third party developers can easily write new backends that consume flow protobuf messages. Other flow systems, such as flow-tools and nfdump, define their own custom binary format, which makes interoperability harder.

4. The protobuf library is fast. It can encode/decode messages on the order of hundred nanoseconds per message [12].

During protobuf translation, the collector flattens and denormalizes the flow records into a stream of flow protos where each flow proto contains the complete information and can be processed independently of other flows protos. The flat stream makes processing easy, since the flows can be filtered and processed in any order. This also enables flows to be distributed across backends without worrying about issues such as template management.

The same protobuf type is used for both NetFlow v5 and v9. As a result, backends should not care whether a flow originally came from NetFlow v5 or v9. The protobuf is currently capable of storing a superset of NetFlow v5 along with the most com- monly used fields from NetFlow v9. Furthermore, the protobuf has been augmented to contain extra fields that are specific to our system, such as the customer IP, tag, and direction.

Within the collector, the flow protobuf also makes flow operations more generic. Code for filtering and tagging just needs to operate on the protobuf, it doesn’t need special cases for handling different versions of NetFlow. Moreover, this separation of concerns also makes unit testing easier, since the code for filtering and tagging can be tested separately and independently of the code that deals with NetFlow.

3.2.3. Configuration

The collector configuration is also defined using a protobuf message. See Appendix B for the full protobuf definition.

The configuration contains five elements:

1. Taggers. A tagger contains a map of tag definitions (IP prefix → tag).

27 3. Design and implementation

2. Prefix groups. A prefix group contains a list of IP prefixes.

3. Backend groups. A backend group contains of a list of backends. The collector will maintain a persistent TCP connection to each backend.

4. Matchers. A matcher contains two backend groups and a filter expression that will be applied to incoming flows. If the filter matches a flow, it will be sent to the first backend group. It it doesn’t match, it will be sent to the second group. Both backend groups are optional.

5. Exporters. The collector will drop packets from unknown exporters. Each exporter has a definition that specifies how to handle flows received from that exporter. It includes which taggers and matchers to run, and how to determine the customer IP and direction.

A typical configuration for an Icelandic ISP without its own dedicated foreign links is shown below. The configuration separates foreign and domestic flows and sends them to separate backends for processing.

exporter { name: "router1" ip: "10.20.30.40" customers: "local" tagger: "customers" matcher: "innlent" } matcher { name: "innlent" // This will match the peer IP against the Icelandic prefixes. // The peer IP is the IP address of the endpoint opposite the // customer IP. pattern: "peer_ip in isroutes" backend_group: "dumper_innlent" backend_group_complement: "dumper_erlent" } backend_group { name: "dumper_innlent" // A large ISP could define multiple backends. Flows would be // distributed across the backends in round-robin fashion. backend { name: "backend1" host: "127.0.0.1:9100" }

28 3.2. Collector

} backend_group { name: "dumper_erlent" backend { name: "backend2" host: "127.0.0.1:9200" } } tagger { // List of customer allocations from IPAM system. name: "customers" path: "/flow/networks/tags.txt" } prefix_group { // Contains a list of local customer prefixes. name: "local" path: "/flow/networks/local.txt" } prefix_group { // Contains a list of Icelandic prefixes. name: "isroutes" path: "/flow/networks/is-net.txt" }

An ISP with its own dedicated foreign links would export ingress flows from the edge routers that receive foreign traffic. The collector configuration for such a scenario would contain separate exporter entries for routers handling foreign and domestic traffic. An example configuration is shown below:

exporter { name: "farice" ip: "10.20.30.40" customers: "local" tagger: "customers" matcher: "innlent" } exporter { name: "rix" ip: "10.20.30.50" customers: "local" tagger: "customers" matcher: "erlent" }

29 3. Design and implementation

matcher { // Matcher with no pattern matches everything. name: "innlent" backend_group: "innlent" } matcher { // Matcher with no pattern matches everything. name: "erlent" backend_group: "erlent" } // ...

3.2.4. Backend protocol

The collector communicates with backends using a simple message based wire pro- tocol. TCP provides a reliable byte-oriented stream protocol, but the inherent lack of message boundaries requires the application to do its own message framing. The backend protocol uses prefix-length framing, where each message is prefixed with its length.

Messages can be exchanged in both directions from collector to backend. Every message starts with a single byte character that denotes the message type, followed by the actual message contents. The following message types are defined:

• H: Heartbeat message. The collector periodically sends heartbeat messages to check backend health. The backend should respond back with a heartbeat message.

• F: Flow record message. Contains a single flow proto.

The qflow code provides a FlowListener class for writing collector backends. The backend code simply creates an instance of the listener and provides a callback that should be called when a flow arrives. The FlowListener class encapsulates the backend protocol.

30 3.3. Database

database tablets ...

...... views

queue ...

Figure 3.3: Directory layout of the flow database

3.3. Database

3.3.1. Design

The flow database contains a set of flow tables. The number of flow tables and their purpose is left up to the service provider, but in general, there is usually one flow table for each accounting dimension, e.g. a separate flow table for foreign and domestic flows. Having separate flow tables allows the service provider to easily account the traffic volume separately. Figure 3.3 shows the directory layout for the flow database.

Each flow table is backed by a set of flow tablets. Each tablet contains a part of the flows that make up the table. The flows in a tablet are sorted by customer. All flows belonging to the same customer appear consecutively within the tablet.

Each flow tablet is accompanied by a flow tablet index. The index contains one entry per customer. It summarizes the flow volume for that customer and points to the customer’s flow records within the tablet.

The tablet index enables fast extraction of customer flows, since it is possible to look up exactly where the customer flows are located. Furthermore, many queries can be answered directly from the index without reading the underlying flow records, e.g. queries about customer traffic volume, which are the most common queries in flow accounting systems.

Although the index can answer simple queries about customer traffic volume, the information is scoped to the given tablet. For queries about traffic volume over a

31 3. Design and implementation longer period of time, e.g. a whole day or month, it is more convenient to main- tain precomputed aggregates that are updated every time a new tablet is added to the flow table than to query every tablet index within the given time interval. These precomputed aggregates are similar to materialized views within relational databases.

A flow table can have one or more materialized views associated with it. It is common to compute hourly, daily and monthly customer summaries for each flow table.

3.3.2. Table queue

Each flow table has a temporary spool directory on disk for incoming flows. The dumper is a collector backend that receives flows and dumps them to disk in the spool directory. The spool directory serves as a queue for the next stage in the processing pipeline – the indexer, which transforms dumped files into flow tablets.

The dumper writes flows into flow files. A flow file contains a sequence of records and each record contains a serialized flow record protobuf. The record format will be described in more detail in the next section.

The flow files are named YYYY-MM-DD.HHMMSS, e.g. 2014-08-24.131500, which denotes when the given file was created. The dumper rotates files based on file size and time interval, e.g. 250 MB or 15 minutes, whichever one comes first.

Domestic flow Domestic traffic table Queue

Dumper Indexer

Collector

Dumper Indexer

Foreign traffic Foreign flow table

Figure 3.4: Flow capture pipeline

32 3.3. Database

Many dumper processes can be running at the same time, dumping different types of flows to disk. One scenario, particularly common in Iceland, is running two dumper processes, one for domestic traffic and the other for international traffic. The flow collector is configured to separate the traffic and send it to separate dumper backends. The flows are then dumped into separate spool directories and will end up in separate flow tables. Figure 3.4 shows an example.

3.3.3. Record format

We use a simple block-compressed record-oriented file format for storing flows on disk. The same record format is used both for dumped flow files, pending in the spool directory, and the final flow tablets.

Flows can be written using the RecordWriter class:

RecordWriter writer("2014-08-20.131000.131500");

string s; qflow::netflow flow;

flow.SerializeToString(&s); writer.Write(s);

And they can be read back using the RecordReader class:

RecordReader reader("2014-08-20.131000.131500");

qflow::netflow flow; string s; while (!reader.Eof()) { if (!reader.Read(&s)) break;

if (!flow.ParseFromString(s)) break;

// do something with flow }

33 3. Design and implementation

Each record holds a variable-length arbitrary binary blob. The classes are general in nature, but in our case the record blob always contains a serialized flow proto.

magic number block size checksum no. records compression scheme

record size

record size

......

Figure 3.5: Structure of a block

Internally, each flow file contains a sequence of compressed blocks, and each block contains a sequence of records. Figure 3.5 shows the block format.

Currently, the record format supports both zlib [10] and snappy [14] compression. Snappy is a compression library from Google that offers very high speeds and rea- sonable compression. It can compress at 250 MB/s and decompress at 500 MB/s on a single core [14]. Zlib is slower but offers better compression ratios.

As each block is compressed independently, we can seek within the file and randomly access blocks. This is an important property for indexing, as we will discuss further in the next section.

3.3.4. Indexer

The indexer program watches the spool directory for each flow table and picks up flow files after they are rotated. It transforms the flow files into flow tablets and adds them to the corresponding flow table.

The transformation consists of five main steps:

1. The flow records are sorted by customer key (customer IP/mask/tag) and written into a new flow tablet file. The flow tablet is given the same name as the original flow file, i.e. the file start time.

2. An index is constructed for the newly created flow tablet.

3. The new tablet is added to the flow table.

34 3.3. Database

4. The index is used to update any materialized views associated with the table.

5. The original file is deleted.

The flow records are sorted using a parallel merge sort algorithm, which allows the indexer to take advantage of multi-core machines. The number of threads is configurable and can be specified using a command line flag (--sort_threads=N).

The tablets are organized into a hierarchy of time intervals as shown by Figure 3.6. At the toplevel, there is one directory for each month. Under a particular month, there is a directory for each day and it contains all the flow tablets, plus indexes, that started during that day.

2014-06

2014-07

2014-08 2014-08-01

2014-08-02 2014-08-02.000000 Index file

2014-08-02.001500 Index file

2014-08-02.003000 Index file

Figure 3.6: Directory layout for flow tablets

3.3.5. Tablets

As previously mentioned, tablets use the same record format as normal flow files. The difference lies in the record order and block organization.

Flow tablets have two invariants:

1. The flow records are always sorted by customer. More specifically, by customer IP, tag and routing mask. These three fields form the key that is used during sorting and for the customer entry in the the tablet index.

2. A block only contains records for a single customer.

35 3. Design and implementation

Block 0

Customer A Block 1

Block 2

Customer B Block 3

Block 4 Customer C Block 5

Figure 3.7: Internal layout of a flow tablet

The second condition is preserved during indexing by flushing a block to disk when crossing from one customer’s records to another. Figure 3.7 demonstrates the typical block layout within a flow tablet.

The tablet index contains a sequence of fixed-length index records. There is one index record for every unique customer key that appears in the flow tablet. The index record holds a pointer to the first tablet block for the given customer key. It also contains the total number of inbound and outbound flows and bytes contained within the tablet for that customer key.

The tablet index also has an order invariant. The index records are always sorted by customer key. The tablet index is designed to be mapped directly into memory, which makes it easy to locate a particular index record by using binary search.

Because the flow tablet and its index are sorted by customer key, consecutive cus- tomer IPs will be placed next to each other. When querying for a range of IPs, e.g. a CIDR, it is possible to locate the first IP and then simply walk over each adjacent IP until the range is exhausted.

3.3.6. Materialized views

The tablet indexes provide a high resolution view of the traffic volume per customer, but traffic billing is usually done on a monthly basis. It is also common to give a daily breakdown of volume to customers so they can keep an eye on their usage. For this reason, the qflow system provides a feature to store and maintain precomputed

36 3.3. Database time-based views for each flow table. The view is updated every time a new tablet is added to the flow table.

The granularity of views is configurable by the user, but it is common to compute hourly, daily and monthly customer summaries. Each flow table has a configuration file that specifies the views, e.g.

view { name: "month" pattern: "%Y-%m" } view { name: "date" pattern: "%Y-%m-%d" } view { name: "hour" pattern: "%Y-%m-%d.%H" }

When a file named “2014-08-24.153000” is added to the flow table, the indexer will also update the following views:

1. views/month/2014-08

2. views/date/2014-08-24

3. views/hour/2014-08-24.15

The views are derived from tablet indexes. Each view holds within it a list of all tablet indexes that have been added to it. This prevents the same tablet from being added twice and allows the system to inspect whether a tablet is missing from the view.

The view file format consists of a sequence of aggregation records. There is one record for every unique customer key that appears in the tablet indexes that the view is based on. The record contains the total number of inbound and outbound flows and bytes. Figure 3.8 shows the file structure.

The view has the same order invariant as the tablet index. The records are always sorted by customer key and the file format is designed to be memory mapped directly to allow for easy binary searching.

37 3. Design and implementation

2014-01-02.131500

2014-01-02.133000 Header

File list …

IPv4 entries

size of inbound inbound outbound outbound ip4 prefix_length tag IPv6 entries flows flows bytes flows bytes

Figure 3.8: View file format

The view format is optimized for fast lookups at the expense of updates. To update a view, a new shadow copy is created and then it is atomically renamed to replace the old copy. The view files are usually small, e.g. 10 MB for 200K keys, so doing updates via shadow copy should be cheap.

3.4. Filtering

3.4.1. Language

The classical UNIX tcpdump tool provides a user-friendly language based on Berkeley Packet Filter (BPF)[17] to filter a stream of packets. The filtering language is based on boolean expressions that operate on protocol fields, e.g. IP addresses, TCP/UDP port numbers, protocol type, etc. A filtering expression is evaluated for each packet to determine whether it should pass the filter. An example BPF filter expression is shown below:

dst net 192.168.2.0/24 and dst port 22

The qflow system includes a filtering language inspired by BPF, but instead of operating on packets and protocol fields, it operates on flows and flow record fields. The language supports all the fields present in our flow protobuf.

The same expression would be expressed in the qflow filtering language as:

38 3.4. Filtering

dst_ip in 192.168.2.0/24 and dst_port == 22

The language supports three field types: integer, string, and IP address. For all types, it provides the usual comparison operators, including <, <=, >, >=, ==, and !=. Additionally, it provides several ways of matching IP addresses:

1. Match a single IP:

dst_ip == 192.168.10.20 dst_ip < 192.168.10.20 dst_ip > 192.168.10.20

2. Match a CIDR range:

dst_ip in 192.168.2.0/24

3. Match a list of CIDR ranges:

dst_ip in {192.168.2.0/24, 10.15.10.0/24}

4. Match a list of CIDR ranges from a file:

dst_ip in "/flow/networks/local.txt"

5. Match a list of CIDR ranges from a predefined prefix group within the envi- ronment:

dst_ip in isroutes

The full grammar of the filtering language is described in Appendix C.

3.4.2. Implementation

The filtering language is provided as a library that can be linked into different com- ponents of the overall system, e.g. collector, capture, and query tools. The library provides a class that takes a filter expression as a string and returns a compiled filter expression:

39 3. Design and implementation

FilterBuilder builder; FilterExpression *e = builder.Build( "dst_ip in 192.168.2.0/24 and dst_port == 22");

The compiled expression can then be evaluated in the context of a flow protobuf to determine if it matches or not:

if (e->Matches(&proto)) { // ... }

Internally, the filter library builds a parse tree which is evaluated for each flow. Figure 3.9 shows the parse tree for the example expression used earlier.

AND

in ==

dst_ip 192.168.2.0/24 dst_port 22

Figure 3.9: Parse tree for example filter expression

The open-source flex tool is used to tokenize the filter expression and then the parse tree is constructed using a custom LL(1) parser.

3.5. Reports

This section demonstrates how common queries can be answered using the qflow command line tools. In the examples, the command is usually shown on the first line, prefixed by a shell prompt $, followed by the command output.

40 3.5. Reports

3.5.1. Flow extraction

Customer flows can be extracted by using the qflow-extract program.

For example, the following command will extract all flows in the erlent flow table during 2014-08-21 with a customer IP within the range 10.20.30.0/24:

$ qflow-extract erlent date:2014-08-21 -s 10.20.30.0/24

The extracted flows are written to standard output, where they can be piped into another program for further processing or redirected to a file.

3.5.2. Flow summary

Flows can be summarized by using the qflow-sum program.

It reads flows from standard input and writes a summary table to standard output. The summary key can be given as an option. For example, the following command will produce a summary for every input/output interface pair:

$ qflow-sum -k input_if,output_if < flows input_if output_if flows-in bytes-in flows-out bytes-out 339 128 42450 3487183741 0 0 ...

The summary key can be any field present in the flow protobuf.

3.5.3. Flow filter

Flows can be filtered by using the qflow-filter program.

It takes a filter expression as an argument, reads flows from standard input and writes filtered flows to standard output. For example:

$ qflow-filter ’protocol == 16 && customer_ip in 192.168.20.0/24’

41 3. Design and implementation

3.5.4. Time-based reports

Traffic volume over time can be queried by using the qflow-aggr program.

To get list of overall traffic per month for the erlent flow table:

$ qflow-aggr erlent month Size Compressed Flows-in Bytes-in Flows-out Bytes-out 2014-01 208.9 GB 81.7 GB 1.6 G 383.7 TB 1.5 G 16.9 TB 2014-02 178.7 GB 71.0 GB 1.4 G 393.8 TB 1.7 G 14.0 TB ...

The month argument in the command selects that particular view type.

We can get the same report but in the scope of a given customer IP or network by using the -s option:

$ qflow-aggr erlent month -s 10.20.30.0/24 Size Compressed Flows-in Bytes-in Flows-out Bytes-out 2014-01 307.4 MB 114.6 MB 1.8 G 329.6 GB 1.9 M 405.5 GB 2014-02 207.8 MB 80.0 MB 1.5 G 401.9 GB 1.0 M 292.4 GB ...

We can also filter by a given month by appending a selector to the view:

$ qflow-aggr erlent month:2014-02 -s 10.20.30.0/24 Size Compressed Flows-in Bytes-in Flows-out Bytes-out 2014-02 207.8 MB 80.0 MB 1.5 G 401.9 GB 1.0 M 292.4 GB

The selector fetches views that begin with the given string. For example, we could get the same overview for every day in 2014-02 by using the following command:

$ qflow-aggr erlent date:2014-02 -s 10.20.30.0/24 Size Compressed Flows-in Bytes-in Flows-out Bytes-out 2014-02-01 9.1 MB 3.6 MB 55.1 K 29.1 GB 55.0 K 16.3 GB ...

There is also a special view for the tablets. For example:

42 3.5. Reports

$ qflow-aggr erlent tablet:2014-02 -s 10.20.30.0/24 Size Compressed Flows-in Bytes-in Flows-out .. 2014-02-01.000000 40.1 KB 17.5 KB 215 129.6 MB 257 .. ...

3.5.5. Customer reports

Customer traffic volume can be queried by using the qflow-customer program.

For example:

$ qflow-customer erlent month:2014-08 Size Compressed Flows-in Bytes-in .. Tag 10.20.30.1/32 636.4K B 314.7 KB 7.5 K 6.9 GB .. abcd ...

The command supports several options. The -s option filters by customer IP. The -t 1G option excludes any customer entry with less than 1G of inbound bytes. The -O option excludes entries without a tag. Customer traffic is usually tagged using some kind of identifier that makes sense to the billing system, so an entry without a tag cannot be billed. The -m option enables a machine-friendly output format.

A monthly billing run would typically use a command like this:

$ qflow-customer erlent month:2014-05 -m -t 1G -O #ip,size,csize,flows-in,bytes-in,flows-out,bytes-out,tag 10.101.32.2/21,1129521,603002,12983,6018147900,0,0,customer1 10.101.32.3/21,9012852,3703883,103596,33392658000,0,0,customer2 10.101.32.4/21,573417,358298,6591,2676368400,0,0,customer1 10.101.32.5/21,328773,164965,3779,3104398500,0,0,customer3 10.101.32.6/21,4263348,1771798,49004,21885276300,0,0,customer4

Conversely, the -o option can be used to find IPs without a tag, i.e. orphaned entries that need to be fixed.

43

4. Evaluation

This chapter presents experimental results that demonstrate how the system per- forms and behaves under various workloads.

4.1. Environment

All experiments were performed on a dedicated 64-bit x86 Ubuntu 12.04 server running Linux kernel version 3.5.0. The hardware specifications are shown below:

1. Architecture: Sandybridge

2. CPUs: 2x Intel Xeon CPUs @ 2.0GHz, each with 6 hyperthreaded cores.

3. Cache: 384 KB L1 cache, 1.5 MB L2 cache, 15 MB L3 cache.

4. RAM: 32 GB DDR3 SDRAM

5. Disk: 4x Western Digital WD4000FYYZ 4 TB disks in a software RAID-5 configuration

6. Network: Intel I350 Gigabit Ethernet

4.2. Collector

This section evaluates the maximum flow throughput of the collector.

45 4. Evaluation

4.2.1. Preparation

Configuration

We used a collector configuration that is typical for an ISP that purchases upstream international connectivity and needs to separate foreign/domestic traffic based on peer IP address.

The full configuration is shown below:

exporter { name: "replay" ip: "127.0.0.1" customers: "local" matcher: "innlent" tagger: "customers" } matcher { name: "innlent" pattern: "peer_ip in isroutes" backend_group: "innlent" backend_group_complement: "erlent" } tagger { name: "customers" path: "/flow/networks/tags.txt" } backend_group { name: "innlent" backend { name: "b1" host: "127.0.0.1:9100" } } backend_group { name: "erlent" backend { name: "b2" host: "127.0.0.1:9200" } } prefix_group {

46 4.2. Collector

name: "local" path: "/flow/networks/local.txt" } prefix_group { name: "isroutes" path: "/flow/networks/is-net.txt" }

To simulate a real workload, the tags and prefix groups were populated with real data from a medium sized ISP in Iceland.

1. is-net.txt contains 78 prefixes

2. local.txt contains 36 prefixes

3. tags.txt contains 2230 prefixes

Instrumentation

The collector was instrumented to output a statistic line every 30 seconds containing three fields:

1. The current timestamp (seconds since epoch)

2. The total number of flows processed

3. The total number of CPU seconds used for the process (user+kernel time)

The CPU time for the process was retrieved using the getrusage system call. For example:

struct rusage ru; if (getrusage(RUSAGE_SELF, &ru) != 0) return;

double cpusec = (static_cast(ru.ru_utime.tv_sec) + static_cast(ru.ru_utime.tv_usec)*1e-6 + static_cast(ru.ru_stime.tv_sec) + static_cast(ru.ru_stime.tv_usec)*1e-6);

47 4. Evaluation

The instrumentation allows us to calculate the flow throughput per core – the num- ber of flows processed per CPU second consumed.

Traffic generator

To model a realistic workload, we captured around 3 million NetFlow v5 and v9 packets being exported over UDP from a router handling real user traffic. The NetFlow v9 export contained a single template with the same fields as NetFlow v5.

The packets were captured using tcpdump and written to separate pcap files, e.g.

tcpdump -n -s 2000 -w v5.pcap port 8888 tcpdump -n -s 2000 -w v9.pcap port 8889

We wrote a custom replay tool that reads a pcap file containing NetFlow packets and sends them as fast as possible to a target collector, e.g.

./replay -f ./v9.pcap -t 127.0.0.1:8888

4.2.2. Results

We ran a separate experiment for each NetFlow version. The steps are shown below:

1. Fake backends: start two netcat processes with output redirected to /dev/null.

2. Start collector, wait until it has connected to both backends and is ready to start processing flows.

3. Start traffic generator using either NetFlow v5 or v9 pcap input

4. Wait for 15 minutes to collect 30 statistic samples from the collector

48 4.2. Collector

Time elapsed (s) 4 Flows 4 CPU seconds Flows per CPU second 30 1762917 9.02456 195347 60 5785206 29.9539 193137 90 5784757 29.9539 193122 120 5784986 29.9499 193156 150 5784833 29.9659 193047 180 5784921 29.9659 193050 210 5784539 29.9579 193089 240 5784640 29.9499 193144 270 5785286 29.9539 193140 300 5793788 29.9739 193295 330 5784309 29.9419 193185 360 5785144 29.9459 193187 390 5784848 29.9499 193151 420 5784557 29.9539 193116 450 5785104 29.9539 193134 480 5784984 29.9579 193104 510 5785345 29.9499 193168 540 5785287 29.9579 193114 570 5785292 29.9499 193166 600 5784882 29.9419 193204 630 5785101 29.9499 193159 660 5785093 29.9499 193159 690 5785399 29.9579 193118 720 5784550 29.9499 193141 750 5784740 29.9499 193147 780 5785150 29.9499 193161 810 5784680 29.9539 193120 840 5784468 29.9539 193113 870 5794443 29.9699 193342 900 5785022 29.9579 193105

Table 4.1: Collector performance results for NetFlow v5

49 4. Evaluation

Time elapsed (s) 4 Flows 4 CPU seconds Flows per CPU second 30 1238764 6.56441 188709 60 5571828 29.9099 186287 90 5553710 29.8499 186055 120 5562957 29.8779 186190 150 5561174 29.8739 186155 180 5543514 29.7459 186363 210 5549906 29.7299 186678 240 5601589 29.9859 186808 270 5569883 29.8579 186547 300 6876027 29.7579 231066 330 6953267 29.8939 232598 360 6584359 29.8819 220346 390 5570445 29.8539 186590 420 5571983 29.9099 186292 450 5584396 29.9339 186558 480 5581606 29.9099 186614 510 5561041 29.8539 186275 540 5561534 29.8739 186167 570 5559198 29.8779 186064 600 5543595 29.8979 185418 630 5582370 29.7499 187644 660 5584250 29.9339 186553 690 5561732 29.9139 185925 720 5564407 29.9379 185865 750 5532652 29.6379 186675 780 5543466 29.8099 185961 810 5551856 29.8659 185893 840 5560750 29.8819 186091 870 5561726 29.8979 186024 900 5419689 29.7419 182224

Table 4.2: Collector performance results for NetFlow v9

According to the results shown in Table 4.1 and 4.2, the collector can handle roughly 180K flows per second. NetFlow v9 appears to be slightly more expensive than v5 in terms of CPU time, which is not surprising given the added complexity of parsing NetFlow v9.

Furthermore, the collector is clearly CPU bound and limited by the fact that it is single threaded. It consumes nearly 30 CPU seconds during every 30 second period.

50 4.3. Indexer

That is, it manages to fully utilize a single core but stays capped there. This suggests that the performance could be increased by making the collector multithreaded.

4.3. Indexer

This section evaluates the performance of the flow indexer.

We instrumented the indexer to output the same type of CPU statistics as the collector for every flow file that it processes.

We prepared four flow files of different sizes based on real traffic. The files were compressed using snappy.

1. 10M records, file size: 432 MB.

2. 20M records, file size: 864 MB.

3. 30M records, file size: 1.3 GB.

4. 40M records, file size: 1.7 GB.

For each file, we measured the indexer performance when using 1, 2, 4, 8 and 16 sorting threads. The results are shown in Table 4.3, 4.4, 4.5, and 4.6.

N Avg. time (s) Avg. CPU seconds Flows per second 1 148.1 147.5 67521 2 91.3 132.6 107411 4 62.0 123.1 161290 8 53.1 130.6 188323 16 48.1 142.4 207900

Table 4.3: Indexer performance for 10M records

51 4. Evaluation

N Avg. time (s) Avg. CPU seconds Flows per second 1 279.4 278.4 71581 2 186.2 281.9 107296 4 144.1 283.1 138792 8 117.8 286.6 169779 16 107.9 304.3 185536

Table 4.4: Indexer performance for 20M records

N Avg. time (s) Avg. CPU seconds Flows per second 1 430.6 429.4 69670 2 283.1 428.7 106007 4 210.5 430.8 142517 8 181.4 444.5 165380 16 156.5 459.4 191693

Table 4.5: Indexer performance for 30M records

N Avg. time (s) Avg. CPU seconds Flows per second 1 551.5 549.639 72529 2 384.9 587.849 103923 4 281.2 576.233 142247 8 239.9 595.282 166736 16 214.8 628.884 186219

Table 4.6: Indexer performance for 40M records

The results show that the parallel mergesort algorithm makes a huge difference. Adding more threads considerably lowers the total time required to process a file. In all cases, using 16 threads takes less than half the time compared to using only one thread. But that also implies that there is considerable overhead and the results show diminishing returns for every thread that is added.

We implemented a basic version of parallel mergesort for use in qflow. It has not been profiled and optimized, so there is probably considerable room for improvement.

52 4.4. Flow storage

4.4. Flow storage

This section evaluates the storage efficiency of our system.

We imported production NetFlow v5 data collected over a single day into both qflow and flow-tools. The total number of imported flows was 638.6M. Using the formula from Section 2.3.4, the total size of the exported flows was at least 28.4 GB. Both systems were configured to use the same compression scheme: zlib with compression level 5.

qflow flow-tools Uncompressed size 79.3 GB 38.1 GB Compressed size 9.2 9.8 Compression ratio 8.6:1 3.8:1 NetFlow compression ratio 3.1:1 2.9:1

Table 4.7: Storage efficiency of qflow vs. flow-tools

Table 4.7 shows the results. The NetFlow compression ratio shows the ratio with respect to the estimated size of the NetFlow v5 input data, i.e. the 28.5 GB figure.

The uncompressed size of qflow was roughly 2x the size of flow-tools. This can be explained by differences in the way the two systems encode flows for storage on disk.

First, flow-tools uses a fixed-length record format that closely mirrors the actual NetFlow v5 packet format. The uncompressed size should therefore be close to the actual NetFlow v5 input. The extra overhead (38.1 GB vs. 28.5 GB) is because the records are flattened out and contain extra information from the v5 header, such as exporter IP, router uptime, etc. The flattened records use 64 bytes, whereas NetFlow v5 uses 23 and 47 bytes for the header and record, respectively.

In contrast, qflow uses protobufs to encode the flows. Although protocol buffers employ techniques such as variable-length encoding of integers to save space, it is hard to beat a fixed binary format. Protocol buffers, however, bring other benefits that flow-tools lacks, such as extensibility and optional fields, both of which are required for a good NetFlow v9 implementation, which flow-tools does not support.

In addition, qflow stores IP addresses as strings within the protocol buffer. This probably accounts for a large percentage of the total uncompressed size, and also explains why qflow achieves such a good compression ratio. The rationale behind using strings for IPs was that it is simple and we can use the same fields for both

53 4. Evaluation

IPv4 and IPv6 addresses. We also expected the strings to compress very well, which appears to be the case. However, there may be a case for using separate fixed-size integer fields for IP addresses to improve performance, since parsing IP addresses from strings is more expensive and it happens frequently for operations such as flow sorting and filtering.

Despite the large difference for the uncompressed sizes and the fact that the qflow size also includes the indexes, qflow ends up with a slightly smaller compressed version. This might be explained by better compression due to data similarity in flow tablets, since qflow clusters all customer flows together and those flows are more likely to have similarities than two random flows. The customer IP is one example that would compress very well when the customer flows are clustered together.

4.5. Flow extraction

This section evaluates the flow extraction performance of our system and the flow- tools system.

4.5.1. Preparation

We imported production NetFlow v5 data collected over a single day at an Icelandic ISP into both qflow and flow-tools. The number of collected flows was roughly 630M. The flows were split into 96 files, each one containing a 15 minute slice of the day. The same flows were imported into both systems, and both systems were configured to use the same compression scheme (zlib with compression level 5).

We performed two extraction tests:

1. Extracting a single IP. We picked a single IP at random from all the customer IPs. The IP had 10260 flows that day, containing 6 GB of observed traffic.

2. Extracting a network (CIDR range). We picked a /24 network range at ran- dom. The range had roughly 1.6M flows that day, containing 185 GB of observed traffic.

Each extraction was repeated 10 times and the results averaged. To make sure that the systems were actually getting the data from disk, we cleared the disk/page cache before each extraction using the following command:

54 4.5. Flow extraction

echo 1 > /proc/sys/vm/drop_caches

Extraction in flow-tools requires a filter definition. We used the following filter configuration:

filter-primitive customer-ip type ip-address-prefix permit aa.bb.cc.dd/32 default deny

filter-primitive customer-network type ip-address-prefix permit ee.ff.gg.hh/24 default deny

filter-definition extract-ip match ip-destination-address customer-ip or match ip-source-address customer-ip

filter-definition extract-network match ip-destination-address customer-network or match ip-source-address customer-network

The commands used for the extraction in flow-tools were:

flow-cat 2014-08-21 | flow-nfilter -F ./filter.conf -f extract-ip flow-cat 2014-08-21 | flow-nfilter -F ./filter.conf -f extract-network

For qflow extraction, we used the following commands:

qflow-extract erlent date:2014-08-21 -s aa.bb.cc.dd/32 qflow-extract erlent date:2014-08-21 -s ee.ff.gg.hh/24

55 4. Evaluation

4.5.2. Results

Table 4.8 and 4.9 show the timing results for a single IP and network respectively. The elapsed time for flow-tools was almost the same for both tests, even though we were extracting a good deal more flows in the second test. This is expected because flow-tools needs to sequentially scan all records to extract the customer flows. As a result, flow-tools need to do the same amount of I/O in both tests and it is clearly I/O bound.

System Average wall time Average CPU time flow-tools 386.5 482.5 qflow 2.6 0.04

Table 4.8: Flow extraction performance for a single IP

System Average wall time Average CPU time flow-tools 387.3 616.9 qflow 3.9 0.1

Table 4.9: Flow extraction performance for a network

Extraction using qflow only took a few seconds and was two orders of magnitude faster than flow-tools. This is not surprising because qflow uses the indexes to locate the customer flows and then only reads those flows. As a result, the extraction time will be dominated by the disk seek time required to search the files. Using back of the envelope calculations, we can calculate the worst case time bound for qflow extraction and verify that the experiment was within the expected bound.

The index files contained an average of around 10K keys, so in the worst case, it will take log2(10000) = 13 seeks to locate the customer within the index using binary search. The server has enterprise grade server drives, so we can assume a seek time of around 5ms. This brings the total seek time for a single index file to 65ms. We had 96 index files, which means 96 times 65ms, a total of 6.24 seconds. According to hdparm benchmark results, the RAID array can sustain 400 MB/s sequential data reads. The size of the extracted flows was 230 KB in the first test and 29 MB in the second. This adds 0.9 microseconds to the first test, and 97ms to the second test. Finally, at worst the customer had flows in every 15 minute file, which would require an extra seek per file to get to the records. This brings the total to 11.04 seconds and 11.13 seconds, respectively.

56 4.6. Materialized views

4.6. Materialized views

This section evaluates the performance of materialized views.

First, we created 96 tablet index files with 10K to 1M random keys and then mea- sured the time it takes to update a materialized view of a given size using an index file of the same size. Each update was repeated 10 times and the result was averaged.

Secondly, we created 100 materialized views with 10K to 1M random keys and then measured the query and export (full dump) time as a function of the size. The query targets were picked randomly from the set of existing keys.

We flushed the disk/page cache before each test.

70M

60M

50M

40M

30M File size File

20M

10M

0 0 100k 200k 300k 400k 500k 600k 700k 800k 900k 1M Number of keys

Figure 4.1: File size of materialized view

Figure 4.1 shows that the materialized view format is efficient in terms of disk space. It can handle 1M customer keys using only 60MB.

57 4. Evaluation

2500

2000

1500

1000 Wall time (ms) Wall

500

0 0 100k 200k 300k 400k 500k 600k 700k 800k 900k 1M Number of keys

Figure 4.2: Update time for materialized view

Figure 4.2 shows that the shadow copy technique is quite fast. Even at 1M keys, updates only take a few seconds. Considering that the total number of Icelandic IPv4 addresses is around 800K [20], and that updates only occur when flow tablets are added, which is usually every 5-15 minutes, the update performance is far from being a problem.

35

30

25

20

15

Wall time (ms) Wall 10

5

0 0 100k 200k 300k 400k 500k 600k 700k 800k 900k 1M Number of keys

Figure 4.3: Query time for materialized view

58 4.6. Materialized views

Figure 4.3 shows that our materialized views provide interactive query performance. Even at 1M keys, queries can be answered well within 100 milliseconds. Also note that the query performance is based on a cold cache, whereas in reality it is likely that commonly queried views would already be in the page cache.

6

5

4

3

Wall time (s) Wall 2

1

0 0 100k 200k 300k 400k 500k 600k 700k 800k 900k 1M Number of keys

Figure 4.4: Export time for materialized view

Figure 4.4 shows the export performance. The export operation is mostly used to export data for billing, discover orphaned IPs, and locate customers that are exceed- ing a given threshold, e.g. their transfer allowance. As such, the export performance is not critical. Nonetheless, the export operation performs adequately and finishes in a few seconds.

59

5. Conclusions

5.1. Summary

The use of qflow for flow accounting provides major advantages when compared with flow systems based on relational databases or flat binary files. The system is engineered for high-performance and scales to billions of records on a single machine, but it also supports distributing work over a set of machines to scale beyond that.

Furthermore, the system can store flows compactly in a compressed format without sacrificing the ability to extract customer flows quickly. Extraction from a billion flows can be completed within a few seconds. Aggregate queries about customer traffic volumes can be answered within a hundred milliseconds.

Finally, the system is highly configurable and supports different types of deployment scenarios for both small and large ISPs.

Availability: This work is distributed under an open source license and is available at: http://hhg.is/qflow/

5.2. Future work

In this section, we discuss several ideas for improving the qflow system, which due to time constraints, could not be addressed in this thesis.

1. Making the collector multithreaded. Currently, the collector can only utilize one core because it is single threaded. As we saw in the experimental results the collector is CPU bound, which means that we need more threads to increase the maximum capacity.

It would be trivial to parallelize NetFlow v5 parsing because the packets can be processed independently and in any order, but the NetFlow v9 parsing would be more tricky since the template management requires the packets to

61 5. Conclusions

be processed in a particular order and the template state needs to be shared between threads.

2. Distributed query service. When running qflow in a distributed environment, the individual databases need to be queried separately and the results merged. It would be useful to have a query mixer that acts as a frontend to a set of backend databases.

3. Duplicate detection in collector. With UDP, it is possible to get duplicate packets. The collector should protect against this. It can be tricky to imple- ment duplicate detection because packets can also arrive out of order. Cur- rently, this feature is also missing from other flow accounting systems, such as flow-tools and flowd.

4. Alternative load balancing strategies in the collector. Currently, the collec- tor only supports one strategy for balancing load between a set of backends: round-robin. It might be interesting to explore alternative strategies, such as sharding by consistently hashing flows.

5. Column-oriented storage of flows. Currently, our flow database offers fast access in one dimension – the customer IP address. It would be interesting to explore alternative storage backends that support fast queries in multiple dimensions. Using column-oriented databases with bitmap indexes in the field of flow monitoring has proven promising [6].

62 Bibliography

[1] Althingi. Electronic Communications Act no. 81/2003, 2003.

[2] Cisco. NetFlow Reliable export with SCTP. http://cisco.com/c/en/us/td/ docs/ios/netflow/configuration/guide/15_1s/nf_15_1s_book/nflow_ export_sctp.pdf, 2006. [Online; accessed 21-Aug-2014].

[3] Cisco. NetFlow Export Datagram Format. http://www.cisco.com/en/ US/docs/net_mgmt/netflow_collection_engine/3.6/user/guide/format. html, 2014. [Online; accessed 20-Aug-2014].

[4] B. Claise. Rfc3954: netflow services export version 9 (2004). http://www.ietf.org/rfc/rfc3954.txt, 2014. [Online; accessed 20-Aug- 2014].

[5] L. Deri. nprobe: an open source netflow probe for gigabit networks. In Pro- ceedings of Terena TNC, 2003.

[6] L. Deri, V. Lorenzetti, and S. Mortimer. Collection and exploration of large data monitoring sets using bitmap databases. In Proceedings of the Second International Conference on Traffic Monitoring and Analysis, TMA’10, pages 73–86, Berlin, Heidelberg, 2010. Springer-Verlag.

[7] N. Duffield. Sampling for passive internet measurement: A review. Statistical Science, pages 472–498, 2004.

[8] M. Fullmer and S. Romig. The osu flowtools package and cisco netflow logs. In Proceedings of the 2000 USENIX LISA Conference, 2000.

[9] F. Fusco, M. P. Stoecklin, and M. Vlachos. Net-fli: on-the-fly compression, archiving and indexing of streaming network traffic. Proceedings of the VLDB Endowment, 3(1-2):1382–1393, 2010.

[10] J.-l. Gailly and M. Adler. Zlib compression library, 2004.

[11] C. Gates, M. P. Collins, M. Duggan, A. Kompanek, and M. Thomas. More netflow tools for performance and security. In LISA, volume 4, pages 121–132, 2004.

63 BIBLIOGRAPHY

[12] Google. Protocol buffers: Google’s data Interchange Format. http://code. google.com/p/protobuf, 2014. [Online; accessed 19-Aug-2014].

[13] J.-l. G. Greg Roelofs and M. Adler. zlib: Technical Details. http://www.zlib. net/zlib_tech.html, 2014. [Online; accessed 22-Aug-2014].

[14] S. Gunderson. Snappy. http://code.google.com/p/snappy, 2014. [Online; accessed 15-Aug-2014].

[15] P. Haag. Nfdump. Available from World Wide Web: http://nfdump. source- forge. net, 2010.

[16] InMon Corporation. sFlow accuracy and billing. http://inmon.com/pdf/ sFlowBilling.pdf, 2004. [Online; accessed 21-Aug-2014].

[17] S. McCanne and V. Jacobson. The bsd packet filter: A new architecture for user-level packet capture. In Proceedings of the USENIX Winter 1993 Confer- ence Proceedings on USENIX Winter 1993 Conference Proceedings, pages 2–2. USENIX Association, 1993.

[18] Ministry of the Interior. Regulation no. 526/2011 on electronic telecommuni- cations billing, 2011.

[19] F. Reiss, K. Stockinger, K. Wu, A. Shoshani, and J. M. Hellerstein. Enabling real-time querying of live and historical stream data. In Scientific and Statistical Database Management, 2007. SSBDM’07. 19th International Conference on, pages 28–28. IEEE, 2007.

[20] Reykjavik Internet Exchange. List of Icelandic IPv4 prefixes. http://www. rix.is/is-net.txt, 2014. [Online; accessed 21-Aug-2014].

[21] R. Sinha, C. Papadopoulos, and J. Heidemann. Internet packet size distributions: Some observations. Technical Report ISI-TR-2007-643, USC/Information Sciences Institute, May 2007.

64 A. Flow protobuf

message DataFlowset { repeated netflow flow = 1; }

message netflow { // Number of incoming bytes optional uint64 bytes = 1;

// Number of incoming packets optional uint64 packets = 2;

// IP protocol optional uint32 protocol = 3;

// Type of Service setting when entering incoming interface optional uint32 src_tos = 4;

// Cumulative OR of all the TCP flags seen for this flow optional uint32 tcp_flags = 5;

// Source IP address optional string src_ip = 6;

// Source address subnet mask (slash notation) optional uint32 src_mask = 7;

// Destination IP address optional string dst_ip = 8;

// Destination address subnet mask (slash notation) optional uint32 dst_mask = 9;

65 A. Flow protobuf

// TCP/UDP source port number optional uint32 src_port = 10;

// TCP/UDP destination port number optional uint32 dst_port = 11;

// Input interface index optional uint32 input_if = 12;

// Output interface index optional uint32 output_if = 13;

// IP address of next-hop router optional string nexthop_ip = 14;

// Source BGP autonomous system number optional uint32 src_as = 15;

// Destination BGP autonomous system number optional uint32 dst_as = 16;

// Timestamp of first packet in flow optional uint32 first_switched = 17;

// Timestamp of last packet in flow optional uint32 last_switched = 18;

// Type of Service byte setting when exiting outgoing interface optional uint32 dst_tos = 19;

enum IpVersion { IPV4 = 0; IPV6 = 1; }

optional IpVersion ip_version = 20 [default = IPV4];

enum Direction { INGRESS = 0; EGRESS = 1; }

66 // Flow direction as indicated in flow export. This field is only // available in Netflow v9 and is based on interface // ingress/egress exporting. optional Direction direction = 21 [default = INGRESS];

// The rate at which packets are sampled, i.e. a value of 100 // indicates that one of every 100 packets is sampled. If this // field is set, then the flow counters (bytes, packets) have // been scaled up accordingly. optional uint32 sampling_interval = 22;

// Customer IP address. optional string customer_ip = 23; // Customer routing mask (taken from the src/dst mask). optional uint32 customer_mask = 24; // Flow direction from the customer point of view. optional Direction customer_direction = 25 [default = INGRESS]; // Customer tag, arbitrary string identifier. optional string customer_tag = 26;

// Exporter IP address. optional string exporter_ip = 27; // Exporter source port. optional int32 exporter_port = 28; }

67

B. Collector configuration protobuf

message Backend { // Name of backend. optional string name = 1; // Backend specification in the form of "ip:port". optional string host = 2; };

message BackendGroup { // Name of backend group. optional string name = 1; repeated Backend backend = 2; };

message Matcher { // Name of matcher. optional string name = 1; // Name of backend group for flows that match the pattern. optional string backend_group = 2; // Name of backend group for flows that don’t match the pattern. optional string backend_group_complement = 3; // Match expression. Uses the qflow filter language. // The expression can also reference prefix groups defined in // this config, e.g. "peer_ip in isroutes". optional string pattern = 4; };

message Tagger { // Name of tagger.

69 B. Collector configuration protobuf

optional string name = 1; // Path to a file containing a list of tag definitions, one line // per tag. The line format is: "prefix|tag". optional string path = 2; };

message PrefixGroup { // Name of prefix group. optional string name = 1; // List of prefixes. repeated string prefix = 2; // Import prefixes from file. repeated string path = 3; };

message Exporter { // Exporter name. optional string name = 1; // Exporter IP address. optional string ip = 2; // List of taggers to run. repeated string tagger = 3; // Name of a prefix group containing customer prefixes. optional string customers = 4; // List of matchers to run. repeated string matcher = 5; };

message FlowConfig { repeated Exporter exporter = 1; repeated Matcher matcher = 2; repeated BackendGroup backend_group = 3; repeated Tagger tagger = 4; repeated PrefixGroup prefix_group = 5; };

70 C. Grammar for the filter language

hfilter-expressioni: -- hand-expressioni - ¤ § § ‘||’ hand-expressioni ¤ ¦ ¦ ¥ ¥

hand-expressioni: -- htermi - ¤ § § ‘&&’ htermi ¤ ¦ ¦ ¥ ¥

htermi: -- ‘true’ - ¤ ‘false’ § ¦ ‘not’ htermi ¥ ¦ ‘inbound’ ¥ ¦ ‘outbound’ ¥ ¦ ‘(’ hfilter-expressioni ‘)’ ¥ ¦ hfield-idi hoperationi hvaluei ¥ ¦ ¥

hfield-idi: -- ‘src_ip’ - ¤ ‘dst_ip’ § ¦ ‘src_port’ ¥ ¦ ‘dst_port’ ¥ ¦ ‘...’ ¥ ¦ ¥

71 C. Grammar for the filter language

hoperationi: -- ‘==’ - ¤ ‘!=’ § ¦ ‘<’ ¥ ¦ ‘>’ ¥ ¦ ‘<=’ ¥ ¦ ‘>=’ ¥ ¦ ‘in’ ¥ ¦ ¥

hvaluei: -- hinti - ¤ hstringi § ¦ hip-addressi ¥ ¦ hcidri ¥ ¦ hcidr-listi ¥ ¦ hnamei ¥ ¦ ¥

hcidr-listi: -- ‘{’ hcidri ‘}’ - ¤ § § ‘,’ hcidri ¤ ¦ ¦ ¥ ¥

72