- Market Survey - Detection & Filtering Solutions to Identify File Transfer of Copyright Protected Content for Warner Bros. and movielabs

Version 1.5 14.3.2011

Thomas Sladek, Eduard Bröse EANTC AG

Copyright (C) 2011 EANTC European Advanced Networking Test Center Aktiengesellschaft

This document is copyrighted by EANTC AG. It may not, in whole or in part, be repro- duced, transmitted by any means or stored in any web site or electronic retrieval sys- tem without the prior written permission of EANTC AG. EANTC AG grants the receiving party of this test plan a non-transferable right to use this document for internal purposes with regards to projects with EANTC. All copies must retain and reproduce this copyright notice and all other copyright notices contained within the original material.

Einsteinufer 17 D–10587 Berlin Germany

Tel. +49. (0)30. 318 05 95–0 Fax +49. (0)30. 318 05 95–10 E-Mail [email protected] WWW http://www.eantc.de/ Table of Contents

Introduction ...... 6 Motivation of this document ...... 6 Basic definitions ...... 7 Terminology...... 7 Abbreviations ...... 9 Contacts ...... 12

Technology Overview ...... 13 File distribution techniques ...... 13 HTTP and FTP ...... 13 Direct Downloads...... 15 Centralized P2P Architecture...... 15 P2P with Decentralized Architecture...... 17 P2P-based Streaming ...... 17 Anonymized Distributed Architectures ...... 18 Steganographic Protocols...... 19 Detection Techniques...... 19 Payload-agnostic Filtering...... 19 DPI-based Protocol Detection...... 20 DPI-based Content Detection ...... 21 Content Analysis ...... 23 Blocking Techniques...... 23 Traffic throttling techniques...... 26 Solutions Based on HTTP Proxy...... 27 Device Classification...... 27 Principle of Operation...... 27 Network Connection...... 28 Conventional Proxy ...... 28 Transparent Proxy...... 30 Conclusion...... 30

Service Provider Challenges...... 32

3 Table of Contents

Network Technology Perspective ...... 32 Integration into Service Provide (SP) networks ...... 32 Resiliency ...... 33 Network performance considerations...... 34 Network security considerations ...... 35 Copyright database handling ...... 37 Potential service provider design ...... 39 Encapsulation ...... 40 Link Aggregation...... 41 Asymmetric Traffic ...... 42 Monitoring in Impaired Traffic Flows...... 42 User Perspective ...... 43

Protocol-oriented solutions...... 44 Procera PacketLogic ...... 44 Device classification ...... 44 Hardware/software platform ...... 44 Principle of Operation...... 45 Network Connection...... 46 Supported Protocols...... 47 Additional potential advantages for the service provider ...... 47 ipoque PRX ...... 48 Purpose...... 48 Platform ...... 48 Provider Network Integration ...... 48 Principle of Operation...... 49 Additional potential advantages for the service provider ...... 50

Content-Oriented Solutions ...... 51 Vedicis V-Content Smart Switch ...... 51 Device Classification...... 51 Platform ...... 52 Provider Network Integration ...... 53 Principle of operation...... 54 Advanced Features: Protocol Decryption ...... 55 Additional potential advantages for the service provider ...... 55

Web Content Filtering...... 57 Blue Coat ...... 57 Device classification ...... 57 Hardware/software platform ...... 58 Network connection...... 58 Principle of Operation...... 58 Supported Protocols...... 59 Additional features ...... 59 Additional potential advantages for service provider ...... 59 Cisco IronPort...... 60 Device classification ...... 60 Platform ...... 60 Network Connection...... 60 Principle of Operation...... 60 Supported Protocols...... 61

4 Table of Contents

Additional Capabilities ...... 61 Additional potential advantages for the service provider ...... 61 SafeNet eSafe...... 62 Device Classification...... 62 Platform ...... 62 Network Connection...... 62 Performance ...... 62 Supported Protocols...... 62 Additional potential advantages for the service provider ...... 63

Subscriber Notification...... 64 Front Porch ...... 64 Purpose...... 64 Platform ...... 64 Network Capabilities...... 64 Principle of Operation...... 65 Supported Protocols...... 66

Executive Summary...... 67 Solutions Overview ...... 67 Vendor Comparison...... 68

5 1 Introduction

1.1 Motivation of this document

In the last 10-15 years, the average bandwidth available to common users grew enormously, from 14-64 KBit/s of the dial-up and ISDN connections to 25-100 MBit/s of the modern VDSL connections. The steadily increasing transfer and ever decreasing storage capacity gave Internet users the possibility to perform a leap from viewing tiny pictures and plain text to downloading large , digitally distributed software, using voice over IP communication and streaming video. While this capacity has opened numerous new possibilities of doing busi- ness by distributing multimedia and other data over Internet instead of phys- ical media, it also allowed users to illegally distribute copyrighted material. File eventually became one of the main contributors of the ever- increasing traffic volume transferred over the Internet and on the other end quickly displaced other, conventional methods of distributing illegal copies of copyright-protected works. could create bandwidth starvation for Internet service providers due to high traffic consumption. File sharing also deprives copyright holders from potential revenues. At the same time, file sharing technologies devel- oped innovations in terms of efficient file distribution mechanisms, resiliency and security. File sharing technologies are currently used in commercial products for content distribution. In this survey we attempt to analyze the features of various file sharing tech- niques currently widespread on the Internet as well as the technologies and solutions designed to detect and police such traffic. We analyze how well such solutions can be integrated into provider networks, their potential accu- racy, performance, functionality and pitfalls that can be expected. We also analyze how suitable various solutions are for different types of file sharing. Legal aspects will not be covered by this document.

6 Introduction

1.2 Basic definitions

Before we can continue with the description for various technologies of file sharing and the filtering techniques, it is important to clarify the terminology used in this survey. File sharing can be done in various ways and has many aspects, and it is important to avoid ambiguous definitions which may lead to incorrect understanding regarding the technologies that are actually applicable in each case.

Terminology

FILE SHARING Throughout our survey, we will often use the term “file sharing” to describe the entirety of the ways Internet user may exchange data on the Internet. This term is supposed to be understood as the broadest definition of this activity. We intend to use this term independently from the actual method of the content distribution and the copyright status of the content itself. With “illegal file sharing”, accordingly we explicitly define the file sharing of copyright-protected content. In the following chapters we describe filtering techniques aimed either at file sharing in general, or on illegal filesharing specifically. From the technical standpoint, file sharing is not limited to the peer-to-peer (P2P) protocols only, as we will see in the next chapter, and therefore should not be viewed as synonymous with it. In recent years we saw a steady shift of file sharing from P2P to other methods of distribution, specifically so called direct services, that use conventional HTTP. File sharing, as the name implies, is typically a process of exchanging static data files between Internet users. The sharing may occur directly between users, as in peer-to-peer (P2P) networks, or via intermediate storage, as in case of static servers and direct download services. Specifically video or audio content may also be exchanged between users in form of live streams. Although strictly speaking, this type of content distri- bution does not involve files, it can be included into definition of “file- sharing”, as the underlying methods and protocols, as well as methods of detection and analysis are similar from the technological point of view. File sharing occurs in the Internet using a variety of methods with significant differences in the way the files are uploaded, downloaded and searched. We describe these different file distribution architectures in the next chapter in detail. Many of the filtering techniques and solutions are designed only to handle specific file sharing techniques. When describing these, we will use the more narrowed down terms to describe the class of traffic in question. This document does not cover P2P live streaming.

ROLES IN FILE Different ways of distributing files also may impose different challenges. In SHARING PROCESS the following sections we describe four roles Internet hosts may play in the filesharing process • Static central servers that can provide data storage and coordination between individual users. • Internet forums that provide announcements of new releases and also useful auxiliary information and search capabilities for the users. • Internet users involved in the file sharing by downloading content. In most cases hosts on a broadband connection, which implies relatively low and asymmetric bandwidth, and volatile addresses.

7 Introduction

• Internet users providing the initial data source for the file sharing networks. In various terminologies related to file sharing they often called as “uploaders” or “seeders”. On the other hand we have several other parties involved in the process indirectly, or capable of observing it: • Internet service providers - companies, organizations or divisions of large ISP companies specializing in access to the Internet for users. They are most likely to encounter filesharing traffic in their network and are able to utilize file sharing filtering techniques. They also have the aim of keeping the total traffic flow in acceptable limits in order to be able to serve a large number of subscribers or users on their network. • Carriers serve as the large-scale providers and transport Internet traffic from different sources in their networks, from other providers, broadband users and businesses alike. Due to large quantities and different types of aggregated traffic transported in their networks and usually no direct connection to the individual Internet users, monitoring and filtering of the file sharing traffic is difficult. • Internet hosting providers, companies that provide web-based services that can be involved in the different types of file sharing process, including the file hosting services and Internet websites and forums. • Companies interested in enforcing their copyright, or companies acting on behalf of copyright owners, in order to perform analysis of file sharing traffic, or investigate specific cases of illegal file sharing. Depending on the actual type of such company or organization, the legal aspects of the investigation may vary extremely. Regarding the analysis and investigation of specific users, one should keep in mind that the term “Internet user” when mentioned in the following sections, mostly refers not to a person, but to a network entity represented by a single IP address of the host involved in the file sharing process. For most parties except the user’s immediate Internet service provider, it is typically not possible to link an IP address to a specific broadband account or person. Internet service providers can be of different kinds. When discussing file sharing scenarios, one should not only assume that the discussion centers around broadband ISPs and private users. Mobile Service Providers (MSPs), carriers, as well as large companies, organizations and education institu- tions can play a similar role. These types of service providers have different interests, abilities and responsibilities. Not every legal and technical aspect can be applied to different types in the same way. For example, a company or institution may enforce more strict policies for Internet access than a broadband service provider, but at the same have less capability to asso- ciate observed IP addresses with specific persons.

8 Introduction

Abbreviations List of abbreviations

Abbr. Meaning Explanation AAA Authentication, Authoriza- Protocols and associated infrastructure of the providers respon- tion, Accounting sible for the authentication of the dialup, broadband, wireless or mobile internet users and collection of accounting data (e.g. used up traffic volume) ADN Application Delivery Network technologies designed to improve networking application Network performance, security or collaboration in companies or organizations. ADSL Asymmetric Digital Most widespread form of DSL access for private subscribers. Charac- Subscriber Line terized by significantly lower than download bandwidth. DSL specifications described as “ADSL” provide access speeds of up to 24 MBit/s downstream (in most practical cases limited to 12 or 16 MBit/ s) and up to 1.4 MBit/s upstream. API Application Programming Definitions of data structures and functions that can be used by third interface party applications to use specific functionality in existing software. ATM Asynchronous Transfer High-bandwidth optical transport network, increasingly deprecated Mode by Ethernet, but still widely utilized in legacy networks. BNG Broadband Network Gateway device that terminates the immediate connection to a broad- Gateway band user’s equipment and routes the traffic to the Internet. For the user it usually appears as the nearest router. BRAS Broadband Remote Access Gateway device that terminates the local connections from the broad- Server band users and forwards their traffic to internet. Typically aggregates traffic from few to dozens of DSLAMs and thousands of broadband users. This term is deprecated by the more generic “BNG”, but still frequently used. CLI Command Line Interface Interface to devices, software or operating systems where control is performed by entering string commands. This interface is easiest to implement on both server and side and is best suited for auto- mation. DDoS Distributed DoS A type of DoS attack performed simultaneously from many hosts in order to increase efficiency or exhaust target’s resources. DHCP Dynamic Host Configura- Widely used protocol for automatic IP configuration and other param- tion Protocol eters (e.g. DNS servers) for computers attaching to a network. DNS Domain Name System Worldwide network of servers and the associated protocols that primarily perform resolution of domain and host names to IP addresses. DOCSIS Data Over Cable Service Colloquial: “Cable Internet”. Family of standards specifying broad- Interface Specification band access method which uses available frequency ranges in televi- sion cable for the last mile connection. Another widespread broad- band access method for private subscribers alongside DSL. DoS Denial of Service Malicious attack on a device or service aimed to disrupt its normal operation. DPI Deep Packet Inspection The entirety of network traffic analysis techniques that inspect not only the headers, but also payload of the packets DRDL Datastream Recognition A markup/programming language internally used by Procera Definition Language Networks to define the recognition rules for their DPI devices. DSCP Differential Service Code A field in IPv4 packet header that specifies the priority of the packet. Point DSCP-aware routers are capable of priorizing transmission of certain packets in order to ensure transmission quality requirements (i.e. latency, loss ratio) of specific protocols or services. DSL Digital Subscriber Line Broadband access method that utilized copper pairs of telephone cables as the last mile connection DSLAM DSL Access Multiplexer Device that terminates the DSL link from the user’s modem and relays traffic to conventional ATM or Ethernet links.

9 Introduction

Abbr. Meaning Explanation FP Flow Processor Hardware unit in Procera Networks devices that performs DPI analysis on packet data. FTP Application protocol primarily aimed to transfer of large files between clients and servers. IDS Intrusion Detection Systems Firewall-like devices equipped with techniques to intercept malicious traffic and payloads. ISP Internet Service Provider Company responsible for provision of Internet access to private and corporate users. GGSN GPRS Core Network Part of a mobile networks infrastructure that serves as the gateway to IP network GNU GNU is Not Unix Mass collaboration project responsible for development of numerous free and open source applications, and in general providing support, promotion and guidelines for free software development and usage. GPRS General Packet Radio Access to IP network (i.e. the Internet) for 2G and 3G mobile devices. Service GRE Generic Routing Encapsu- An encapsulation protocol capable of transporting various Layer3 lation network protocols over IP tunnel. Can be used by service providers to transport subscriber traffic from access equipment to provider network over Internet. ICAP Internet Content Adapta- A protocol supported by some traffic analysis devices (e.g. firewalls, tion Protocol DPI devices, proxies) to pass some of the traffic to another devices for additional analysis. For example, a firewall without mail processing capabilities may recognize SMTP traffic and pass it to a spam/virus filter. LAN Local Area Network A relatively small network usually managed by a single authority such as private person, company or organization and usually consisting of a single layer 2 switched network. HTTP HyperText Transfer Most widespread layer 7 (application) protocol in the Internet. Protocol Intended for delivery of web page content, but today also serves as a basis for many other protocols including video streaming. HTTPS HTTP Secure Encrypted version of HTTP. Works by encapsulation of HTTP in SSL. L2TP Layer 2 Tunneling Protocol Encapsulation protocol to carry layer 2 (e.g. Ethernet) traffic transpar- ently over IP network MD4, Message-Digest algorithm Family of cryptographic digest algorithms developed by Ron Rivest MD5 4, 5 (released versions: MD2, MD4, MD5 and MD6). The algorithm computes a fixed-size signature of a binary data block without prac- tical possibility of reverse computation. The MD2/4/5 algorithms are currently considered not sufficiently secure for some applications. MPLS Multi-Protocol Label A versatile and efficient routing architecture primarily used in core Switching networks of providers and carriers. P2P Peer-to-Peer Class of protocols, usually in file sharing area, where data transfers primarily occur between clients, as opposed to conventional client- server communication. PAC Proxy Auto-Configuration File containing proxy auto-configuration information. The file can be supplied by network operator and retrieved by browsers supporting one of the mechanisms. PADE Protocol and Application A DPI analysis engine internally used by ipoque in their series of DPI Decoding Engine products. PIC Procera’s PacketLogic Intel- A component of Procera PacketLogic solution. ligence Center PLR PacketLogic Real-Time A component of Procera PacketLogic solution. Enforcement PLS Procera’s PacketLogic A component of Procera PacketLogic solution. Subscriber Manager

10 Introduction

Abbr. Meaning Explanation POP Point of Presence location Location of an ISP’s equipment providing interface to the Internet, as opposed to the access network providing the connection between the subscriber and the nearest POP. PPP Point-to-Point Protocol Protocol primarily used to authenticate and transport the traffic of DSL subscribers to the BRAS PPPoE PPP over Ethernet PPP transport over Ethernet links PPPoA PPP over ATM PPP transport over ATM links Q-in-Q 802.1q - in - 802.1q Introduction of a second layer of VLAN segregation by adding a second VLAN tag SFTP Secure FTP SSH-based protocol for transferring files over encrypted SSH tunnel SHA1, Secure Hash Algorithm Family of cryptographic digest algorithms developed by National SHA2 Security Agency (released versions: SHA0, SHA1, SHA2, SHA3 with variants). SHA1 is currently considered insufficiently secure for some applications and a move to SHA2 algorithms is urged. STP Spanning Tree Protocol Protocol used in the switched networks in order to prevent loop connections SMTP Simple Mail Transfer Main protocol used to transfer e-Mail messages between mail servers Protocol in the Internet. SNMP Simple Network Manage- A protocol and associated specifications that is used for setting and ment Protocol retrieving of configuration and statistics, as well for asynchronous noti- fications/alarms from devices and services. SOCKS (no specific acronym, but Networking protocol for transparent proxying of TCP connections. written in capital letters) Works transparently compared to HTTP, thus allowing proxying of any TCP-based protocols. Versions 4, 4a and 5 are widespread and supported in numerous software, e.g. web browsers. SSL Secure Socket Layer /Authentication protocol capable of encapsulating other application layer protocols, most notably HTTP SSH Secure SHell Encrypted protocol for accessing remote hosts. Can be used to estab- lish an encrypted tunnel for transport of any other TCP-based protocol TCP Transmission Control The most widespread layer 4 (transport) protocol in the Internet. The Protocol vast majority of application protocols uses it for transmission of their data. Characterized by being connection-oriented, reliable data delivery and automatic adjustment of traffic rate to network conditions (flow control). TCP RST TCP Reset A value in TCP packet header indicating that connection is being closed by the sending party. UDP User Datagram Protocol Second widespread layer 4 protocol. Primarily utilized by real-time application protocols, such as video/audio streaming and gaming. Characterized by being connectionless and unreliable data delivery. URL Uniform Resource Locator A string uniquely identifying location of a file or resource on the Internet. Consists of VDSL Very-high-bitrate Digital A new DSL specification with higher data rates than ADSL. Various Subscriber Line VDLS variants are capable of reaching data rates of up to 200 MBit/ s downstream. VLAN Virtual LAN Method of segregation of the same switched network into many parallel virtual ones. Packets of each virtual network are identified by the VLAN tag added to the packets. VoIP Voice over Internet Protocol VPN A network implemented through tunneling protocols over public Internet, but appears as a local Layer 2 or Layer 3 network to the users. WPAD Web Proxy Auto-Discovery A method of automatic proxy configuration supported by some of the web browsers

11 Introduction

1.3 Contacts

Warner Bros. Entertainment GmbH, Humboldtstrasse 62, 22083 Hamburg Christian Sommer, Director EMEA Anti-Piracy Operations, [email protected] +49.40.22650 366, +49.172.453 71 59

Motion Pictures Laboratories Inc., 130 Lytton Avenue, Suite 120, Palo Alto, CA 9430, United States of America Raymond Drewry, VP EMEA Operations, Principal Scientist, [email protected] +44.149.481 42 36

EANTC AG, Einsteinufer 17, 10587 Berlin Thomas Sladek, Project Manager, [email protected], +49.30.3180595-32, +49.178.458 32 04 Eduard Bröse, Test Engineer, [email protected], +49.30.3180595-34, +49.179.13 17 875

12 2 Technology Overview

2.1 File distribution techniques

This section of the report offers a catalogue of the current file sharing tech- niques commonly used in the Internet. Given the nature of new techniques developments no such list can ever be 100% complete - protocols and new file sharing solutions are quickly developed as soon as a blocking mecha- nism exist for a legacy file sharing system.

HTTP and FTP downloads Files are located at a conventional HTTP or FTP server and may be down- loaded using any browser without a need for additional software. The users search for these files mostly by following links posted in Internet forums or in chat rooms. Unless indexing is explicitly forbidden by the server adminis- trator, the files may also be found using search engines like . The upload to the server may be done by the server administrator, by users explicitly entitled with upload rights, or in some cases by anyone if the server allows public upload of files. Server-based illegal file sharing that are open to the public are seldom used these days. Such server has a specific location that is easy to determine and therefore prone to be shutdown by authorities. On the other hand, this method is common for first-stage distribution of the content in closed non- public groups. In this case, the server most probably will be secured against public access. It should be noted that in some cases legitimate, but poorly maintained servers, could be hacked and used for distribution of content.

13 Technology Overview

FIGURE 1. HTTP/FTP downloads Internet Forums File Server (HTTP/FTP)

Uploader

Downloaders Announcements, information Content downloads Content

POSSIBILITY OF ANAL- Traffic to and from servers such may be detected by traffic monitoring in the YSIS Internet core if the transmissionunencrypted. is By intercepting the down- load or upload requests, it is possible to determine the file names, sizes and the advertised type of the content.certain Under circumstances it is also possible to determine the website from which the user has accessed the file. Finally, the complete payload or fragments of it can be captured for content analysis, e.g. for automatic detection of copyrighted content.informa- This tion is accessibleunencrypted in transmissions, regardless ifserver the uses authentication or not. In some cases, the actual data transfers may occur out of reach for the moni- toring device. So, for example, someservers FTPsupport so called FTX tech- nique that allows an FTP client toserver instruct to aretrieve and store a file from another FTPserver. In this case, the client avoids the transmission of file data to and fromservers the and only maintains a control connection. This connection can still be monitored for filenamesdirectory and information.

ENCRYPTED TRAFFIC When secure hyper text transmission protocol (HTTPS) is used to access a web server, and server the certificates are correctly configured, no feasible methods exist to eavesdrop on the connectiondetermine and the content of the transferred files. Ifserver the does not use certificatesproperly, the connection may be monitored, but this requires an intrusive man-in-the- middle cyber-attack, which could be mounted by a device located in the traffic path. Similar considerations are valid for Secure FTP / SSH access. When monitoring HTTPS or SFTP/SSH traffic, only the IP address of the server is known. For large websites that use dedicated IPs or IP ranges, it is easily possible determine to the website domain/host through reverse DNS lookup, it is not possible however to decrypting tell without traffic, which exact URLs/files are requested, information as this is concealed in the encrypted data. In case of co-hostedservers, where multiple small websites are hosted on the server same and under sameIP, it is also not possible to tell which of the hosted websites is visited using necessary HTTPS, as the information (“Host” HTTP header value) wouldencrypted. be also

14 Technology Overview

Direct Downloads The filesharing trends in the last few years showpeer-to-peer that while the (P2P) protocols traffic is stagnating or in some regions even declining in rela- tive terms as a percentage ofinternet overall traffic, the use of direct- down load services (such as Rapidshare, ) is steadily increasing. While the uploader to the static servers HTTP/FTP described in the previous section also usually plays administrative role, direct services download are administered by unrelated companies and provide public, and in many cases anonymous access for both uploaders and downloaders.- A registra tion is not required on most such services in order to use them, although non- paying users often meet restrictions for the traffic amount and speed. Such services are also usually limit the maximum size of the files, forcing the uploaders to split large files into several fragmentsindividually. uploaded

FIGURE 2. Direct Downloads Internet Forums File hosting Service A File hosting Service B

Downloaders

Announcements, information Uploader Content downloads Content uploads

The monitoring and filtering of illegal files sharedservices over is such similar to the HTTPservers. Compared to arbitrary the HTTP traffic moni- toring, suchservers are located at well known IP ranges and- HTTP transmis sions in the Internet and could therefore be easily identified as access to the known direct download sites. The direct download sites also seldom allow HTTPS for transmissions.

OBFUSCATION In order to conceal the identity of the content, the uploaders often use featureless file names encrypted and archives. This prevents an automatic detection of illegal files by party the third or by the direct download providers. The nature of the content in this casedetermined can be by only manual search of such linksInternet in forums dealing with filesharing. In thisform of file sharing some of the users downloading the content may also spread further it to other filesharingservices or reupload parts file that were deleted.

Centralized P2P Architecture Several popular P2P protocols including the conventional , eDonkey, and Direct Connect, use a central server in order to search for files

15 Technology Overview

and to locate suitable peers for transmission. Such architecture usually allows for simple detection of hosts sharing a specific content. In such protocols, a user wishing to download a specific file will send a request to server the containing unique identifier of the file and receive a list of known hosts offering this file. This infrastructure can be - exploited to auto matically locate the users sharing illegal querying content by the central server.

FIGURE 3. Centralized Architecture Internet Forums Central Server(s) Central Server (alternative)

P2P Client (Uploader)

P2P Clients Announcements, information (downloaders) Peer search, content search

P2P data transfers

ENCRYPTED PROTO- Some of these P2P protocols alsoencrypted have versions, such as COLS encrypted BitTorrent orencrypted eDonkey. Theencryption does not provide protection against the aforementioned searchingquerying by the central server and is used only for the purpose of concealing the traffic between peers from Deep Packet Inspection (DPI) devices. Using HTTPS or other encrypted protocols query to the centralserver also does not provide such protection.

CONTENT IDENTIFICA- For an automatic monitoring and filtering device locatedInternet in and the TION designated to monitor traffic of such P2P protocols for illegal content, only limited information is usually available. The traffic exchanged between two peers sharing a file usually does not contain the fileinforma name- or other tion. However, the traffic exchange may contain the unique ID of the file. Such ID in most protocolscryptographic is a hash (e.g.MD4 ineDonkey, SHA1 inBitTorrent) calculated over the file contents information or similar (e.g. inBitTorrent - over some fragments of the.torrent file). This ID allows an unambiguous identification of a specific file in the P2P traffic, but must first be identified as an illegal content.This could be achieved manually, or using a semi-automatic searchInternet of forums. The use encrypted of variants of the P2P protocols will informa conceal- this tion and a feasible method to extract it from monitored traffic may require a similar complexity as for the monitoringHTTPS/SFTP, of i.e. may require a man-in-the-middle attack on the conversation.

16 Technology Overview

P2P with Decentralized Architecture Some P2P protocols (including BitTorrent) have introduced decentralized peer search that does not require server a central to find nodes sharing a specific file. Decentralized P2P architecture is usually capable- of reorga nizing itself dynamically by building tree-like search networks- and by auto matically selecting nodes with higher network bandwidth as “hubs”. This feature is useful against the failure or the blockadeserver, of the central and also may prevent the centralized search for the users sharing illegal contents. Nonetheless, similarinformation can be automatically gathered from the distributed network, albeit witheffort. more The decentralized matchmaking has no effect on the actual data transfers between clients, so the same characteristics as described in the previous section apply.

FIGURE 4. Decentralized Architecture

P2P Client (Uploader)

Internet Forums

P2P Clients (Downloaders) Announcements, information

Peer search, content search

P2P data transfers

P2P-based Streaming The conventional P2P protocols are intended to transmit static files and often transfer data blocks not in sequence. P2P distributionhowever, principles, can also be utilized for streaming audio and video media. Instead of using a central streamingserver farm or multicast routing for media distribution, both methods generally inaccessible to generalInternet users, streaming data can be transported from user to user in a manner similar to P2P downloads. This kind of video distribution gained popularity primarily in China, with the most prominent applications being PPlive/PPstream. The client supports both presentation of static movies and live transmissions, where the media is

17 Technology Overview

sourced in real time from TV channels. PPlive also gained popularity in the western countries, mostly due to broadcastssport for events live that were not available in free TV broadcast in Europe or America.

FIGURE 5. P2P-based video streaming

Streaming Client (Stream Source)

... Streaming Clients

...

Peer search, content search

P2P data transfers

Anonymized Distributed Architectures

WINNY, , In the last years, several P2P protocols have emerged that allow for PERFECT DARK complete anonymity of the users when exchanging content. In Japan, strict copyright laws and their rigorous enforcement gave rise to- several anony mizing P2Pprotocols-, Share and Perfect Dark which now dominate Japanese P2P traffic. far, So the efforts of the police and copyright holders to uncover the anonymity of the users were only possible via side channel attacks such as exploiting security holes in the client software or discovering users via web forum posts. All traffic of such protocolsencrypted is and impossible to analyze in the network. In addition, data transfers could be led through multiple nodes and stored inencrypted form in the caches. A singlemay, therefore, be unable to determine which content it forwards for other nodes, or be able to tell the originating source of the data it downloads. Stochastic data transfers instead of persistent connections may be used to conceal the data transfer behavior of the nodes.

ONION, In the western hemisphere, the multi-purpose anonymizingTor networks and Freenet were developed for the purpose of combating censorship laws and in order to provideinformation free exchange for Internet the users under totalitarian regimes. network allows creation of so called “hidden services”, typically webservers only reachable throughTor the network via special IDs resembling domain names. There are no feasible methods to determine the actual physical location of suchservice hidden server. Tradi- tional P2P protocols and other communication can be proxiedTor over the network, which makes it impossibledetermine to the physical location of a node. For this purpose the software (e.g. a BitTorrent client) only needs to support SOCKS proxyinterface, which is providedTor by daemon.

18 Technology Overview

However according to Wiki the page of Tor the project (https:// trac.torproject.org/projects/tor/wiki/TheOnionRouter/TorFAQ) file sharing is widely unwanted inTor the network and exit nodes are configured to block file sharing traffic by default. Freenet is another anonymized network with possibility of hidden content hosting and anonymized access. Freenet primarilyserves access to hidden web content, but also can be used to distribute files.

DISADVANTAGES The disadvantage of anonymous networks is a significantly lower throughput as the data is retransmitted through a chain of peers. This disadvantage is likely to be resolved over time as residential users are receiving more and more upstream bandwidth from their providers (e.g. VDSL standard is capable of up to 16 Mbit/s). A new node may also require considerable time before the connection to the network can be established, for example started a freshly Freenet node will reach its full connectivity and speed only after Tor several is usuallyhours. capable of near-instantconnectivity, but in some cases may still need to spend up to a minute or two to find suitable neighbor nodes.

Steganographic Protocols Although no practical examples for filesharing networks currently exist, it is conceivable and expected that with the increased suppression of P2P traffic through DPI filtering solution, new P2P protocols can be developed that mimic other traditional protocols and transfer data in their payload.- DPI solu tions would fail to correctly classify this type of traffic or would require much more extensive analysis. The steganographic techniques may have a drawback of - increased over head in the transmissions, which again will be mitigated by- growing band width available to the broadband users.

2.2 Detection Techniques

Individual solutions deploy a variety of different techniques to analyze the traffic and the transferred contentsvarying with degree flexibility, of reli- ability and coverage of existing networking Fundamentally, protocols. the automatic detection and filtering devices can be separated into the following three major classes: payload-agnostic filtering, protocol-based DPI devices, content recognition and content analysis devices.

Payload-agnostic Filtering These devices provide basic filtering mechanisms for Internet traffic that rely exclusively on information the available in the packets transport for up to layer and do perform not analysis of the payload.

FIGURE 6. Analyzed areas in payload-agnostic solutions

Ethernet IPv4/6 TCP/UDP Payload

Analysis

19 Technology Overview

The typical application area of such devices is the protection of local networks from malicious activities from the Internet by limiting the access to specific services and/or addresses, i.e. the function of a firewall. Firewall filtering can provide only basic function of blocking filesharing: • Ports associated with popular P2P applications can be blocked. This technique no longer provides any significant protection against file- sharing, as all modern P2P applications can use arbitrary ports. • The number of concurrent connections for each distinct subscriber IP address could be monitored and limited. • Traffic bandwidth for each distinct subscriber IP address could be moni- tored and limited. • IP addresses associated with popular P2P servers (e.g. BitTorrent trackers and eDonkey servers), direct download services (e.g. Rapidshare servers) and related filesharing forums or search engines can be blocked, thus limiting the connectivity of P2P protocols with centralized architecture and direct download services. In any case, such devices are not able to perform content-dependent filtering, they will affect transmissions of any content, including legitimate, and possibly other traffic not related with filesharing. Applying such payload agnostic filtering techniques to Internet traffic is akin to amputating a patient‘s leg when only the toe is suffering. On the other hand, this detec- tion/control method can be used to perform a very coarse heuristic detec- tion of filesharing-like user behavior, including also obfuscated and encrypted protocols, for example by monitoring or limiting the number of concurrent connections. Payload-agnostic detection, therefore, can be used as a preliminary stage for identifying possible file-sharing, but then require more sophisticated detection methods are applied. Benefits • Most modern routers have this functionality built in. • Well integrated into existing infrastructure. • Most modern routers perform well with such filters active. Drawbacks • Lack of intelligence in the system forces the network administrator to completely block services and treats all downloads as illegal. • Management of such filters can require high effort in some cases. • The solution is crude and handles all data traffic the same way - legal usage of direct download sites or P2P networks can not be exempt from blocking.

DPI-based Protocol Detection The more sophisticated class of monitoring and filtering devices are Deep Packet Inspection (DPI) devices. These devices are able to analyze the payload of packets and recognize various application layer protocols. These devices are capable of accurately detecting and filtering specific application protocols, but are usually agnostic to the data transmitted therein. The analysis of the payload contents (e.g. recognition of the application protocol) is performed by various methods:

20 Technology Overview

FIGURE 7. Analyzed Areas in Protocol-oriented DPI Solutions

Ethernet IPv4/6 TCP/UDP Payload

Assists analysis Analysis

SIGNATURE MATCHING Many protocolscarry distinct stringsbinary or data structures in their packets that can be recognizedpattern by matching. Most DPI solutions use a database of such signatures to analyze each packet of a conversation between two peers.

CROSS-REFERENCING In many P2P protocols, peers perform separate conversations with a central server or other peers in order to select peers for download or establish a P2P network structure. By detecting such conversations and extracting addresses and other data, a DPI device may associate a - following connec tion to these addresses with the same protocol. For example, a BitTorrent client will firstperform a request to a tracker to retrieve a list of candidate peer carrying specific content. Following connections to these peer are likely for the purpose of data transfer.

HEURISTIC ANALYSIS Signature matching will likely fail inencrypted case of protocols. However, the specificpattern in which a client establishes connections, the typical amount of data transferred in requests and responses and other behavioral parameters can be detected and associated with a protocol. Different types of analysis can be used as a fallback to another method that did not deliver a confident detection result, or used together to improve the detection confidence and to eliminate possible false positives. The detection of the protocol occurs in the early stage of the TCP or UDP conversation. Once successfully recognized, the flows are no longer analyzed and only tracked until the connection is closed. - For many proto cols, this principle greatly improvesperformance, as usually only few packets actually need to be matched against a signature database.

DPI-based Content Detection The previously described techniques are only able to recognize specific protocols, but cannotdetermine whether transferred data is legitimate or not. An extension of the protocol-based DPI detection is the content-aware detection. Such devices must possess the same capabilities to- identify proto cols, and in addition must are able to extract or generate the identity of the transferred data.

CONTENT IDENTIFICA- As previously described, many P2P protocols use unique identifiers for each TION shared file, usuallycryptographic a hash of the file contents or other kind of digital signature.Other, less sophisticated protocols may identify files by their name. In any case, these identifiers are usually included in the requests from the clients to the servers, central and in communication between the clients. Similarly, content available via HTTP/FTP can be identified by the URL. A content-aware DPI device must be able to extract these IDs from a monitored conversations and use themdetermine to whether the data is a

21 Technology Overview

legitimate transfer. This decision is made by a lookup in a database of known illegitimate IDs. This database is, in most cases, maintained- exter nally and regularly updated on the device much in the same way that virus signature databases in antivirus scanners are updated.

FIGURE 8. Analyzed Areas in Content-aware solutions

Ethernet IPv4/6 TCP/UDP Header ID Data

Assists analysis Analysis

DATABASE MAINTE- The maintainer of the database may scan popular filesharing sites for new NANCE P2P or direct downloads and verify their legitimacy either manually or using an automated method described in the nextAlternatively, section. live traffic can be scanned for shared files IDs that are unknown in the database in order to locate files not appearing on manually scanned public filesharing forums. A content-based DPI solution must provide ID extraction methods for most popular file-sharing protocols in order to stay effective. The extraction method must be implemented individually for each protocol and- thus funda mentally differs from the signature-matching methods of the protocol-based detection which usually can be easily or even semi-automatically created by the vendors.The ID extraction may be problematic in many cases: • The ID of the file may not be present in a flow used for the actual data transfer. The detection solution might need to match a request made in a separate conversation, e.g. implement a cross-referencing functionality which may not necessarily be needed for simple protocol detection. • Some protocols including HTTP and FTP file transfers do not provide a secure file identification, only name. In some cases an unambiguous file identification may not be possible, for example if the uploading user has chosen avery generic filename for the upload. This may - limits the detec tion accuracy or produce false positives. • Encrypted and obfuscated protocols in most cases make a passive detec- tion impossible. Simplerencryption schemes may require the DPI device to perform a man-in-the-middle attack on the protocol in order to gain access to the data transmitted between two peers. A more sophisticated encryption schemes provide sufficient security against such attacks making identification of data impossible.

ADVANTAGES AND The advantages of the content-based detection can be summarized as DISADVANTAGES follows: • The detection is able to distinguish between files deemed illegal for distri- bution and files that are in the public domain or distributed under creative commons GNU or licences. • Content-based detection produces a high level of detection confidence, with very low probability of false positives. The accuracy of detection is mostly dependent on the quality of the file ID database, which allows quick elimination of false positives. • A homogeneously structured database can be maintained for many different filesharing protocols.

22 Technology Overview

• Due to naturecryptographic of hashing, a new file appearing on one P2P network can be automatically blocked on other networks even before it appears there, simply by recalculating the checksum. The disadvantages are: • The implementation of content-based detection is much more complex than protocol-based detection. This can impactperformance the and stability of such solutions. A solution must be tested for performance and stability for anyparticular deployment. • The device must maintain and efficientlyquery a much larger database than the signature database of the protocol-based detection solutions. Such databases also cannot be converted directly to executable code to improve performance. • Encrypted filesharing protocols requirefurther a increase of the complexity or prevent detection completely. encrypted While protocols may be accurately recognized by the protocol-based solutions, the content-based detection will fail to identify the transferred data. • Although smaller scale solutions across multiple vendors and rights holders exist with a proven record of success on a smaller scale, the maintenance of a signature and/or fingerprint data-base- is a chal lenging task for rights holders and requires close collaboration between rights holders, solution providers and vendorsparticular in on a larger scale. Keeping such a database up to date and also - requires the defini tion of standards for scanning and verification process.

Content Analysis Additional technologies were developed in order to assist recognition of copyrighted content. Unlike the content-based detection described above, these techniques are aimed at actual analysis of audiovisual content data and are able to recognize different versions of the same material. Usage of such technology on live traffic is not practicalperfor - due to high mance demand and difficulty to extract data from traffic. - Instead, the anal ysis is performed offlinedetermine to which files offered on filesharing forums andservers contain copyrighted material. The analysis can be used to automatically maintain and create- file ID data bases for use with the content-based filtering solutions. The existing solutions very are specialized to specific types of content. Typi- cally, only analysis of audio and videosupported. files is

FIGURE 9. Analyzed Areas in Content Analysis solutions

Ethernet IPv4/6 TCP/UDP Header ID Data

Assists analysis Analysis

2.3 Blocking Techniques

The goal of the technologies described in the previous sections is to automat- ically classify the data transmittedInternet in traffic. In addition to statistics collection, many such devices are capable of controlling the traffic

23 Technology Overview

according to the classification and the policies established by the service provider. The techniques described in this section can be used to block the undesirable flows.

PORT FILTERING This blocking technique requires a stateful or stateless non-DPI packet filtering (please see "Payload-agnostic Filtering" on page 19). The filtering device blocks the known ports used by P2P clients. The use of this technique today is infeasible, as most P2P applications currently allow setting of arbitrary port numbers and encourage the users to do so. Blocking of conventional ports like HTTP or FTP will interfere with many legal applications, therefore this technique cannot be used against direct-download-based filesharing as a matter of principle.

IP FILTERING Requires stateful or stateless non-DPI packet filtering. The filtering device blocks packets directed to hosts known as P2P trackers, filesharing forums or direct download servers. This method is infeasible in most cases. Technically prepared users will be able to use one of the numerous proxy services to circumvent a IP-based blockade of the central server. For many P2P protocols, only a tiny amount of traffic needs to be exchanged with the tracker, in order to search for peers or files. Most modern P2P clients offer the possibility to automatically use a . For direct download services, an IP-based blockade may be somewhat more feasible. On one hand, proxy servers may forbid transmission of large files through them. On the other hand, most direct download services impose download limits per client IP and so a proxy used by many users is most likely to have exhausted it. The latter limit however only applies to free users; paying users of most direct download services usually have no limitations on the number of downloads. Blocking popular P2P trackers and direct download services will also impact legitimate traffic. IP blocking of a specific server may lead to simultaneous blockade of other, completely unrelated web sites. This situation is possible if web sites with different domain names and from different customers are hosted under the same IP and separated by the web server through the “virtual host” tech- nique.

DNS FILTERING Requires a DNS server configuration of each specific provider. The DNS entries for the popular P2P trackers, forums and direct download services are replaced with a bogus address of harmless sites or to a site containing a warning notice to the users. This blocking method is the most inexpensive to realize for providers and usually does not require any additional equipment. This blocking method is easily circumvented even by unexperienced users by configuring a different DNS server instead of the one supplied by the provider. However, using an alternate DNS server is something most consumers may not be able to do. Even if half of them can do it, decreasing traffic to those bad sites by half for this cheap cost is well worth it. Similarly to the IP filtering method, this method also affects legitimate traffic. Moreover, the DNS blocking method affects entire websites and does not discriminate between individual sections or content items stored on it. So, for example, a shared hosting service may contain numerous user accounts

24 Technology Overview

under the same domain name, a hosting solution typical for many free hosting and services.

TCP CONNECTION This method requires a protocol- or content-based DPI detection. The connec- RESET tions of the P2P protocols are forcefully closed by sending a forged TCP Reset packet. This clearly requires a device, sitting in the middle of the network, able to generate such packets. These packets can be send by the endpoints of a TCP connection in order to force a disconnect. This technique can be utilized to forcefully terminate TCP connections identified as filesharing traffic by the provider. In practice, this technique can be easily identified at the client side. The affected clients can choose to block the reset packets completely in order to neutralize this tech- nique, which also does not impede the normal operation of the P2P trans- fers, as the P2P software can close the connection locally using P2P-specific signaling between the clients. However, blocking TCP RESETS can not be circumvented by the average users, they need to get a more sophisticated download tool which takes over this job. All of the techniques listed above share similar traits - they are simple to implement, but can just as easily be circumvented even by a novice user. Since these techniques affect legitimate traffic, they are likely to lead to complaints from users. None of the presented techniques is able to differen- tiate between legal and illegal file sharing.

PROXY-BASED Proxies can be used for filtering of specific protocols, most widespread of FILTERING them being HTTP and SMTP (e-mail). A proxy server has the entire control over the content as it completely separates the communication between the clients and the servers and terminates both segments. The location of the proxy server also allows for decryption of communication, e.g. in case of HTTPS. A proxy server is technically capable of performing many types of content analysis, filtering and modification. The proxy-based solutions with filtering functions presented in the following chapters provide fine-grained control over filtering, which includes filtering by the domains, individual URL, and even by external filtering solutions (e.g. antivirus software). The downside of the proxy-based solutions is the lack of traffic transparency, limitation to specific protocols, need for additional configuration and low performance.

DPI-BASED FILTERING With DPI-based detection solutions, it is also possible to selectively terminate the flows identified as filesharing traffic by dropping packets. In this case, the connection is usually allowed to be opened and to transmit some of the traffic until the traffic nature can definitely be identified. After this point, the device can stop the traffic analysis and simply drop all following packets associated with the flow without major performance demand. DPI-based filtering provides a more fine-grained control over traffic blocking compared to the IP- or Layer4-based filtering of the conventional firewalls, as DPI analysis makes it usually possible to recognize the actual transported protocols instead of trusting the TCP/UDP port numbers. If the DPI analysis is extended by the content and/or URL recognition, it provides even more fine- grained control over filtering, at the same time handling traffic transparently, unlike proxy-based solutions.

MULTI-STAGE SOLU- Some filtering solutions available on the market employ multiple techniques TIONS to optimize and narrow blocking. As an example, the Cleanfeed content

25 Technology Overview

blocking system is capable of blocking individual elements or subsections matching URLs on a blacklist. For this purpose, the IP addresses related to the URLs on the black list are matched in the first stage of analysis by a high-performance IP filter. Instead of blocking the traffic completely, it is forwarded to the second stage for a more precise analysis. The second stage works as a transparent HTTP proxy capable of matching the URL against the blacklist. The matched elements are blocked or redi- rected to warning pages, while unmatched requests are forwarded to the desired destination. The solutions like Cleanfeed intend to provide a solution capable of blocking HTTP traffic by URL blacklist, that is more cost-efficient, but less flexible than full-fledged DPI filtering devices.

2.4 Traffic throttling techniques

An alternative to completely preventing file sharing traffic is throttling of the traffic to a fair amount (as deemed by the service provider). Throttling allows the providers to prevent massive bandwidth consumption and ensure the unaffected operation of conventional protocols by limiting and deprioritizing the filesharing traffic. At the same time it does not impede with the ability of customers to use filesharing in general. The throttling mechanism can be configured to adjust to the changing amount of the used bandwidth during the day. This way, the filesharing traffic can be throttled more during the peak hours allowing other protocols to function normally, and allowed in the nightly hours when the conventional traffic is lower. For a viable solution, the throttling should be combined with a protocol- based detection solution. Combination of throttling with any non-DPI-based detection is unreliable, as it can be easily circumvented and on the other hand can easily affect legitimate traffic. When combined with a content- based detection however, throttling is not a desired function, as the illegal content should be completely filtered. Similarly to filtering, the device must identify the type of traffic transmitted in a flow or a conversation between two hosts and decide whether the throt- tling function should be applied to this flow. The throttling itself may be performed in different ways:

MARKING The device does not impede with the packets, but instead sets the DSCP field of the packets. The actual throttling function may then efficiently occur in the core network or by the peered carrier network. The marking of the traffic serves in this case the purpose of prioritizing the filesharing traffic below the conventional. This way, the flow of conventional traffic is likely to be preserved in a congestion situation, while the filesharing traffic will more likely suffer drops.

SHAPING The device impedes with the filesharing traffic by partially dropping the packets to a specific rate. For this purpose, a specific bandwidth may be configured per flow (single transmission from one user to another), per source or destination IP (limit for a specific user, e.g. a broadband customer) or per interface (all traffic flowing through the detection device from many customers). In most cases, the devices utilize a simple “token bucket” algo- rithm to enforce a specific average bandwidth maximum independently from the packet sizes, and at the same time allow and control small traffic bursts. As the most transmissions use TCP or some other type of flow control, their

26 Technology Overview

bandwidth will automatically adjust to the rate enforced by the traffic shaping.

JITTER GENERATION This type of traffic impediment is mostly used by the providers to prevent use of VoIP in their networks. The device can add jitter to the packet flows iden- tified as VoIP audio streams and so negatively affect the quality of the call. This kind of impairment usually has no effect on the filesharing transfers. Non-interactive video and audio streams are also mostly unaffected, as a larger fragment of the stream can be buffered in order to cancel out the effects of the jitter.

2.5 Solutions Based on HTTP Proxy

Three of the solutions evaluated in one of the following chapters of this survey utilize a very specific method of network attachment and content analysis that we would like to evaluate in detail. The aforementioned solu- tions act as a proxy for few widespread protocols, primarily HTTP, but in many cases support HTTPS, FTP and may also act as a mail gateway. The primary use of such devices is within networks of companies and orga- nizations, where they may serve as a security enhancement measure. These devices can also enforce the acceptable usage policies. The operators of such corporate networks may easily enforce policies and perform necessary client configuration. Further, the solution is very likely to be assisted by existing firewall. The use of proxies in Internet Service Providers (ISPs) is not widespread, due to high administrative efforts required to maintain such a proxy, high perfor- mance requirements that serving a large number of subscribers has, and lesser ability to enforce specific usage policies on their customers. Small ISPs sometimes use proxies where content caching is performed by the proxy to mitigate the effects of a poor connection to the Internet.

Device Classification An HTTP proxy serves as an active network component that actively termi- nates TCP connections from clients and servers. HTTP proxies fall into same category as Intrusion Detection Systems (IDS), with primary specialization in HTTP protocol. Support for other application protocols (e.g. various-P2P flavours, Instant Messaging and streaming) may also be offered on the same device.

Principle of Operation The main principle of HTTP proxy operation is to accept an HTTP request from the subscriber (with optional authentication), and either to forward the request to the actual web server, or to serve the HTTP content locally from a previously cached version of the content. The proxy may also modify parts of the HTTP request or the delivered content. When used voluntarily, a proxy is usually utilized to improve the subscriber’s experience, either by improving the web browsing experience through local caching of the content, or by providing useful filtering func- tions, such as virus scanning, ad removal or content optimization, relevant both from security and from performance perspective. If the proxy is to be used for explicit traffic policing, the network operator must take further precautions to ensure that all user traffic will be forwarded through the proxy. The proxy device must either assume the Internet

27 Technology Overview

gateway role, or any possibility to bypass the proxy server must be prevented by a firewall configuration. In the case where the proxy is used as a content policing device, compared to the conventional pass-through DPI devices, a proxy has many distinctive features. On one hand, the explicit termination of connections allows for more precise and reliable control of the traffic, on another hand, such solu- tions suffer from performance issues. Below, we describe the relevant char- acteristics more specifically:

ADVANTAGES • The accuracy and effectiveness of proxy solutions is not affected by impaired traffic (e.g packet reordering), as the direct termination of TCP connections by the proxy will actively mitigate the effects of lost or misor- dered packets. • The proxy device does not need to forward the packets as soon as possible and may collect larger portions of traffic for more precise or complex analysis. • The proxy device may rewrite parts of requests and responses in order to assist analysis and/or blocking of the traffic. • A proxy may directly deliver notifications to the user without needing additional mechanisms. Moreover, this information may seamlessly be included into content of returned web pages. • The proxy location and session termination facilities provides a perfect possibility for performing a Man-in-the-middle attack on SSL authentica- tion and so allows the proxy to gain access to the cleartext data trans- mitted within encrypted connections such as HTTPS. • Proxy may provide another optional level of authentication for the users, requiring them to enter their user name and password for the proxy use.

DISADVANTAGES • Proxy servers usually offer much lower performance than DPI solutions running on the comparable hardware. • Proxy servers may interfere with custom authentication and encryption mechanisms between clients and servers. • Proxy servers break the end-to-end Internet principle. The communication model used in the Internet trusts that the client is in direct contact with the server. Interfering with such fundamental operation of the Internet is likely to cause protocol incompatibilities and upset users that feel their privacy infringed upon.

Network Connection For the purposes of compulsory traffic filtering, proxy solutions can be oper- ated in two different modes - as a conventional and as a transparent proxy. Some of the proxy-based solutions in the market are capable of selecting the appropriate operational mode suitable for a specific network environment.

Conventional Proxy The conventional HTTP proxy can be placed at any position in the protected network or even outside of the network. It does not need to actually separate the controlled network and the Internet in a way similar to the firewalls or pass-through DPI devices. The only configuration needed is for the end-user to point the web browser to the proxy.

28 Technology Overview

FIREWALL CONFIGURA- The proxy server should either be directly reachable for the clients, or in the TION case that it is placed at an external location, the firewall must be configured to allow users’ access. In order to enforce users to use the proxy, the firewall must be configured to block all HTTP traffic from the network, except from the proxy server itself, or other hosts that require direct access. This type of proxy operation requires clients to explicitly configure the proxy address in all applications using HTTP, primarily the web browser, but also other applications that may need to download content from the Internet. In most cases, such applications can rely on the system-wide proxy configura- tion and do not need to be explicitly configured separately. In a networking environment of a company or organization, where worksta- tions can be controlled by central system administration, and is typically the property of the organization, system-wide proxy configuration is easily achieved. However, there are cases where administration is more relaxed and the users maintain their own workstations, for example, in educational institutions that provide Internet access for students. In many cases, a proxy auto-configuration is desirable.

PROXY AUTO- Several Proxy auto-config (“PAC”) techniques exist, however, they do not CONFIGURATION provide a reliable method for all environments and clients.

MANUAL AUTO- Semi-manual configuration method is implemented in most web browsers CONFIGURATION and requires the user to enter a URL of a file containing proxy configuration information. Such file may be placed at a company’s internal web server, so that the client will have direct access to it. The autoconfig URL needs to be manually entered on all workstations. With this method, the change of proxy server location or exclusion rules, do not require any reconfiguration by the clients. This auto-configuration could also be utilized for a simple load-balancing mechanism by returning autoconf file containing different proxy server IPs to the clients. When on proxy is too busy handling users’ requests, another proxy could be chosen to facilitate web access.

WPAD Another widespread method is Web Proxy Auto-Discovery Protocol (WPAD), which contains two auto-discovery methods - by DNS or via DHCP. The WPAD never emerged as a complete standard, but the methods described below may be supported by some browsers.

DISCOVERY BY DNS The DNS-based discovery method is supported by many popular web browsers such as Firefox and Internet Explorer. The client will attempt to derive the location of the auto-configuration information from the domain the client currently resides in. The client will attempt to guess a possible web server location within its network by removing parts of its own domain name until the minimal form such as domain.com is reached. If the web server responds, the client will attempt to download a file called wpad. from it. If the client’s access to the network is done via PPP or DHCP, the operator may, and is likely to, attach an attribute with the address of the preferred DNS server to the PPP or DHCP response. The client will then be able to determine its host name and the domain of the network by performing reverse DNS lookup on its own IP address. This method requires the network operator to maintain a web server with the appropriate auto-configuration file within the network and ensure that the discovery process does not cause any adverse effects or can be exploited.

29 Technology Overview

The clients must chose the auto-discovery method in their proxy configura- tion.

DISCOVERY BY DHCP If the clients obtain their IP via DHCP upon connection to the network, the location of the proxy auto-configuration information can be supplied in the form of a non-standard DHCP attribute. DHCP method takes precedence and if no appropriate attribute was found in the DHCP response, DNS method is attempted as fallback. Currently, only few browsers support this method. Use of WPAD methods is unreliable due to lack of standardization, poor support by many HTTP clients and possible configuration issues.

Transparent Proxy Transparent proxies represent a second variant of HTTP proxy operation mode. As opposed to the conventional proxy methods we discussed above, the clients do not make explicit proxy requests, instead traffic is intercepted and processed by the proxy transparently. In order to accomplish this, all user traffic must pass through the proxy, requiring it to be placed similarly as a gateway or a firewall. A transparent proxy must posses basic DPI capabilities to recognize HTTP protocol independently from the port. It should be able to accept IP packets promiscuously, as the clients will not direct them to the proxy server itself, but to some web server’s IP address on the Internet. In the opposite direction, the proxy must be able to transmit packets with spoofed IP address, so they appear as coming from the web server directly, otherwise they cannot be associated to correct connection by the client. Proxy auto-discovery or configuration is no longer necessary and therefore the clients do not require explicit configuration in this case.

PROXYING HTTPS In order to proxy HTTPS traffic, the proxy must act as a man-in-the-middle, TRAFFIC masquerading as the target website. In so doing, it decrypts traffic from the client and re-encrypts it for transmission to the website. It performs the same for traffic flowing in the reverse direction. In order to masquerade as an HTTPS server to the client, the proxy needs to provide it with a certificate containing different keys than those in the official site certificate. Since this new certificate is not signed by a trusted certificate authority, many browsers and secure applications will not trust it and will pop up a warning to the user. Typically, the solution is to install an additional trusted root certificate in the browser, which can be done either manually or through centralized corporate IT management systems. The management of HTTPS proxying is further complicated by the fact that some applications and devices do not allow the user to click through a warning or to configure an additional root of trust. Background software update programs are a common example. While in many cases this can be mitigated by white listing trusted sites to bypass the man-in-the-middle decryption, doing so adds additional administrative burdens.

2.6 Conclusion

We presented various options for illegal file sharing suppression. Based on the information presented in this chapter we provide an overview of the various solutions‘ effectiveness on the different types of traffic analysis. The color coding used in the table is as follows:

30 Technology Overview

• The solution is effective for this type of filesharing and is also unlikely to affect legitimateservices. • The solution partially is effective, but has many drawbacks, such as increased effort. May affect legitimateservices. • The solution very is ineffective due to infeasibleeffort, or lowaccuracy. May affect legitimateservices.

TABLE 1. Overview of Technology Effectiveness

Solution Type File- sharing Payload- Protocol Content- Content Type agnostic Detection aware Analysis (online) HTTP/FTP Affects many legal Affects many legal Filenames provide Relatively simple to services services only ambiguous way access content of content identifica- tion HTTPS Affects many legal Affects many legal Requires man-in-the- Requires man-in-the- services services middle attacks middle attacks Direct Down- Only possible to blockOnly possible to blockEasily possible to May be impossible load DD sitesaltogether, large downloads alto-identify the specific for encrypted content, affects all legal mategether.- Likely to affectcontent by URL requires password rial as well many legal services scooping. P2P Central- Possible to block Only possible to blockEasily possible to Content isvery diffi- ized (unen- major trackers and P2P protocols alto- identify the specific cult to scoop from crypted) forums, but can be gether. content by IDs/hashestraffic alone. circumvented. used in the protocol. P2P Decen- P2P mostly unaf- Only possible to blockEasily possible to Content isvery diffi-- tralized fected by blockade ofP2P protocols alto-- identify the specific cult to scoop from (unen-- central servers. gether. content by IDs/hashestraffic alone. crypted) used in the protocol. P2P P2P traffic easy to Reduced detection Normally requires Content usually not Encrypted conceal. accuracy. man-in-the-middle possible to scoop effi- attacks. Only in fewciently. protocols crypto- graphic weaknesses can be exploited to reconstruct the key without MITM. Anonymized P2P traffic easy to Difficult to detect. Practically impossible Practically impossible conceal. to analyze. to analyze. Stegano- Impossible to distin- Difficult to distinguish Practically impossible Practically impossible graphic guish. from other traffic. to analyze. to analyze.

31 3 Service Provider Challenges

As we discussed in the previous chapter most solutions that are aimed to address file-sharing must be installed in the network. The use of a moni- toring/filtering device in a live network is a cause of concern to network operators. The limitations and the effects such a device might have on the healthy operations of a network must be thoroughly analyzed before a suit- able device is selected. In the following sections we present various aspects and considerations applicability for different network scenarios as well as common practices amongst DPI device vendors.

3.1 Network Technology Perspective

Integration into Service Provide (SP) networks The first important question is whether the monitoring/filtering device acts as an active component in the network and may require additional planning and configuration and potentially affect the behavior of the network.

MONITORING-ONLY In some cases, a DPI device is not intended to be used for actual filtering, OPERATION but only for the analysis of traffic. The data to be analyzed could be used not only for user activities monitoring, but also for billing of individual users without the need to affect the users‘ traffic directly by blocking or throttling it. Most DPI solutions designed with filtering/throttling functionality can be easily used in monitoring-only mode as well. In this operational mode an existing switch or router only need to provide a copy of the traffic (commonly referred to as mirroring) to the DPI device. This methods is in essence passive - the act of monitoring can not adversely affect the traffic being monitored. Most DPI solutions are also able to operate in monitoring-only mode by transparently passing traffic between two interfaces.This methods renders the DPI solution active within the data path in the network. Depending on the implementation, traffic could still be negatively affected by the device

32 Service Provider Challenges

even if no filtering or throttling is performed. Upon reaching the DPI perfor- mance capacity, depending on the implementation and configuration, the device may start dropping excess frames, or pass them through unpro- cessed.

FRAME REORDERING Other negative effect of such solutions may be reordering of the frames, or AND DELAY variable delays being introduced to the traffic. In most cases, the processing of the traffic must be parallelized and spread across multiple DPI processors. The distribution process should occur in such a way that the frames of the same bidirectional conversations are always processed by the same DPI units. Incorrect implementations may lead to reduced detection accuracy, and due to small differences in processing time, to reordering of the frames within a flow.This will negatively affect the user‘s traffic. Even without reor- dering effects, some packets may experience higher forwarding delays than other due to more complex processing they require. It is recommended to measure the reordering and delay variation issues in a multi-protocol mix when evaluating a pass-through DPI device. For example, an increase in forwarding delay or delay variation could cause voice over IP (VoIP) calls to drop or to add echo effects or clicks to the conversation. Clearly, at an age that many services providers are trying to convince customers to switch to VoIP such negative effects should be avoided. Compared to DPI processing, non-DPI filtering solutions are less likely to produce similar issues. The processing delay per frame is usually constant and the frames are likely to be processed sequentially.

ACTIVE SESSION A third class of monitoring devices is known from the area of Intrusion Detec- TERMINATION tion Systems (IDS). This class of devices actively terminates TCP connections and UDP conversations and is therefore no longer fully transparent to the traffic. This type of traffic monitoring is likely to have high impact on the throughput and latency of the network and also most likely to have high performance demand. Such devices can be adapted for file sharing preven- tion, but are only suitable for small installations such as protecting a local network with high security requirement. Nevertheless, the technique of inter- cepting the TCP connections may be used in some solutions in order to perform man-in-the-middle attacks on encrypted traffic of some protocols as described in the previous chapter. Typically such solutions do not fit large installations at service provider networks and should only be considered for small to medium company networks.

INTERFERENCE WITH Another important issue that may arise from some monitoring/filtering solu- NETWORK INFRA- tions is the solution‘s unintended role as an active component in the network. STRUCTURE Some DPI devices may utilize built-in switches as load-balancers for their multiple processing modules (splitting the traffic for efficient processing). In practice this design might result in an active Ethernet switch physically connected to multiple ports of the elements already deployed in the network. Without precautions and careful considerations, such as correctly config- ured Spanning Tree Protocol (STP), Ethernet loops may appear in a previ- ously healthy network after addition of monitoring/filtering devices. The solutions, therefore, must be evaluated for presence of such active compo- nents.

Resiliency All modern networks, be it residential, mobile or business, place high value on the ability to recover from failure quickly - without the users realizing that a failure occured. This concept is referred to as resiliency. For illustration

33 Service Provider Challenges

purpose we point out that voice networks typically are designed to recover from failure within 50 milliseconds. Such requirements find themselves, sometimes with even higher standards (e.g. 16 ms for video traffic), into triple play, mobile and business networks.

TRAFFIC BYPASS From a resiliency standpoint, DPI solutions that operate in pass-through mode represent isolated network components and must be able to protect the traffic against failure of the device itself. A failure may be a result of a hardware or software problem and in general would result in failure to forward the traffic between two interfaces. Most DPI solutions implement a bypass mechanism triggered by interruption of traffic flows in order to allow traffic to flow through the device even when the device stopped functioning. The bypass may be implemented internally by disabling the DPI processing and directly interconnecting the input and output ports of the device. Even such solution still represents a single point of failure.A more reliable mecha- nism is an external passive optical bypass. In this case, the input and output ports are bypassed completely by physically redirecting the passage of light impulses from the input to the output. In case of IDS-based solutions, a failure of the device will instead lead to complete interruption of traffic, as the connections are terminated locally and the traffic flow cannot be restored by simply bypassing the frames. This is also not the intended function of such devices, as by default they should block any unrecognized traffic.

INTERFERENCE WITH A failure situation for a transparent pass-through device from the point of OTHER MECHANISMS view of the surrounding network infrastructure may appear as a link failure. If the surrounding infrastructure implements its own resiliency mechanism, it can also be triggered. In case of a correct failover procedure on the DPI device, for example by means of an activated optical bypass, the connec- tivity will be restored after a short delay. In some cases this may lead to conflicts and undesirable effects through interaction with the higher level resiliency mechanism. The interaction of the two separate resiliency mecha- nisms therefore should be evaluated in each concrete network setup. Since networks‘ surviveability and reliability are a premium concern for operators, the introduction of a device, one that is not required for the oper- ation of the network, that might fail and with its failure cause service disrup- tion to a potential large number of customers, is clearly undesired. DPI solu- tions must first prove their ability to withstand failure before they can be accepted by network operators.

Network performance considerations The inclusion of a pass-through or an IDS-type device on a network link could lead to network performance degradation depending on the device‘s implementation efficiency. The devices operating in an out-of-line monitoring mode cannot directly influence the traffic flow, however, they still require a network component to provide a copy of the traffic. If this is realized by port mirroring on an active network component like a switch or a router, mirroring function may negatively influence the performance of this device. The only fully performance-neutral solution option is out-of-line monitoring where optical splitter is used for mirroring. The potential network performance degradation can be parameterized by the following negative effects: decrease in throughput, increased forwarding delay and packet delay variations, packet loss, and concurrent flows limita- tion.

34 Service Provider Challenges

HANDLING OF UNREC- An in-line, transparent DPI or IDS device may exceed its processing capacity OGNIZED TRAFFIC at high traffic load. DPI devices may react differently to high traffic load. Some solutions will react by allowing unprocessed frames to pass through unanalyzed, or drop them, others might stop processing traffic all together, while less savory implementations could block traffic from passing through the device. The exact reaction is implementation-dependent and may also be configurable on some devices. On the other hand, IDS-based devices are designed to strictly block any unidentified traffic from passing through and will always drop packets after reaching their performance limit. When packets are allowed to pass, the accuracy of detection and filtering may be reduced under high load preserving the throughput performance of the network. Nonetheless, traffic can still be influenced negatively through higher delay and packet delay variations. Packet loss could also occur. If the unrecognized traffic is dropped, the throughput of the device will be reduced. As most Internet traffic consists of Transmission Control Protocol (TCP), the affected hosts will automatically reduce their transmission rate through flow control mechanism. This should lead to an equilibrium state where the analysis device limits the capacity of the link through its perfor- mance, but the traffic roughly maintains its other performance aspects. The end user will identify this behavior as decrease in available bandwidth and is likely to complain to the operator.

DPI PROCESSING The throughput performance of DPI devices should be tested with realistically PERFORMANCE simulated TCP traffic mix. Unlike routers‘ and switches‘ forwarding perfor- mance, primarily handling each packet separately, DPI devices performance is dependent on the number of flows and their connection establishment rate. Moreover, small packet loss is a normal occurrence for dynamically controlled TCP flows. The throughput rate therefore cannot be determined as the point where no loss occurs, but must be established through proper simu- lation of TCP protocol. The testing of DPI devices is similar to performance tests on the firewalls, that have many similar characteristics as the Layer 4 type devices, rater than the switches and routers (Layer 2 and 3 processing devices). In addition, the performance demand of DPI devices may be affected by the payload contents. It can be expected, that detection of some application protocols may have higher processing demand than the others. Simple and widespread application protocols like HTTP can be easily recognized by the signatures found in the headers, while complex P2P protocols may have obfuscated packet format and therefore require more sophisticated recogni- tion process.

DELAY AND PACKETS Increased forwarding delay and packet delay variations (commonly referred DELAY VARIATION to as jitter) are also one of the expected side effects of an overloaded device. Interactive and real-time traffic such as VoIP and online gaming requires optimal network conditions - low delay and minimal jitter. Providers typically are very careful to add any elements to the network that might increase network delay and delay variations.

Network security considerations A DPI device installed in a network may also effect the network’s security and safety.

35 Service Provider Challenges

ACCESS TO TRAFFIC As devices with rich traffic analysis functionality, DPI devices may serve as a tempting target for hackers wishing to collect information from the users. Many DPI solutions allow monitoring of specific users and even individual flows. A hacker, who is able to gain management access to the device will be able to collect sensitive information transmitted over the Internet. Solu- tions capable of protocol decryption may provide access to even more sensi- tive data. HTTPS, as the most prominent and most widely supported encrypted protocol will be a very attractive protocol to eavesdrop on as it is most likely to carry sensitive data such as online banking, shopping and other secure web services. Since most DPI solutions utilize an independent network interface for management access and work transparently for the traffic, it would typically not be possible for attackers to gain direct access to the device from the public Internet. The security of sensitive subscribers data, therefore, relies on proper security design of the management network and hardening the DPI solutions own security stance.

DENIAL OF SERVICE DPI devices could also serve as an attractive Denial of Service (DoS) attack ATTACKS targets. Compared to the conventional network devices, DPI analysis requires complex code which consequently is expected to contain more bugs. A mistake, or poor optimization in such code may lead to abnormally decreased performance or even crash of the device. The ability to crash or „own” such DPI solutions in a service providers network in an attractive proposition to hackers. Every DPI solution provider does its best to harden and test the software on the devices. A DPI device is usually designed to handle numerous different application protocols, and to support a variety of network conditions. This leads to a high number of very specialized code fragments handling specific protocols, or aspects of traffic.As even the layman sees in common off-the- shelf operating systems, no vendor is able to test all possible combinations of code and to locate errors in a seldom used code segment. An attacker may attempt to exploit possible bugs in the software of DPI devices by transmitting traffic with elements atypical to normal Internet traffic. A handful of such examples could be: • Unusual or incorrect encapsulation formats. For example MPLS encapsu- lation only typically used in core networks, but may also be included in plain IP traffic. Another option would be for an attacker to create complex nested encapsulations expecting the DPI solution to decode these encapsulations and eventually fail. • Impaired, damaged or incorrect packets. As example could be frag- mented IP or incorrect TCP sequences. • Malformed data elements in the application traffic, for example incorrect length fields, or unterminated stings It should be noted that even network devices from well known vendors are at times compromised or are identified to have security holes. These devices are, however, essential to the operations of the Internet and are the bread and butter of network operators and service providers. DPI solutions, on the other hand, are not required to the operations of a network and therefore are treated with suspicion by operators that are forced to use them. At the same time, DPI solutions operating in pass-through or proxy mode represent a single point of failure for a large number of customers. Unlike typical online services where a denial of service attack may be mitigated by load-balancing mechanisms, firewalls or redirection to a different location, a

36 Service Provider Challenges

DPI device stays exposed to malicious traffic and can only avoid the effects of an attack by enabling bypass mechanisms. When choosing suitable DPI devices for the installation in provider networks, the security and stability aspects, as well as built-in failover mech- anisms should be also thoroughly tested.

Copyright database handling Protocol-based DPI recognition solutions can mostly operate autono- mously.Software updates are primarily directed to ensure support for new application protocols, improve accuracy, performance and stability of the system. All these aspects are the responsibility of the solution vendors and can be handled by them without extended interaction with the operator of the device or other companies. The vendor announce that a new code is available and the network operator, on his or her own time, validate that the new software is not harmful to the network and then performs the installa- tion. Content-based detection solutions, however, open a fully new aspect of device updates – handling of the file ID database. This database should be maintained separately from the other software components of the device such as firmware and DPI signature definitions due to its very different nature. The following sections discuss the less-technical, yet very real concerns such ID Databases bring to the discussion.

RESPONSIBILITIES On the one hand, the content of ID databases is defined by the legality of the files and is not a technical decision. DPI systems vendors cannot be held responsible for the correct maintenance of the database content. Hence, a legal entity charged with maintaining such a database is typically respon- sible for the content of ID databases. The vendors are, however, responsible for the data import process from the external sources which may require processing of the data sets in order to make them compatible with the internal database of the devices and the entire platform. On the other hand, the same database of legal/illegal file IDs, or black- listed/whitelisted URLs may be used by different providers and even by different content-based DPI solutions, which again may require appropriate data conversion. These considerations make it clear that the contents of such databases should be handled by an external entity, and the DPI solution vendor should only provide a necessary interfaces to import and manage this data and other supporting functions. Decisions about which content should or should not be blocked must be done by an authoritative entity such as a trade group, association of rights holders, or an administrative body; this entity may need to be specific to individual legal jurisdictions. The database must be unequivocally reliable, secure, and regularly updated and maintained. The vendor should only be responsible for the technical aspects of the solution, such as importing the database and accepting updates from it.Supporting Functionality The location of the analysis device and direct access to the necessary data elements could serve as a basis for optional, but useful features the DPI solu- tions could provide in parallel to its main purpose. Such features could assist investigators with detailed analysis of content items currently observed on the network. The following list presents some examples: • Extraction of file IDs from the live P2P traffic. The content shared on the file sharing networks is constantly updated and thousands of new items

37 Service Provider Challenges

may appear for share every day. The announcements of the new releases are usually made on various Internet forums, and usually in many different forms and languages. Manual or semi-automatized scooping of such information in order to detect illegal content requires high effort. Often such forums cannot provide sufficient information on the volume of the exchanged content. As the content-based DPI solutions already extract file IDs from traffic, they can provide the functionality of collecting the file IDs not known in their database as potential new files that need analysis of their legality. The device can also collect statistics on the traffic volume and the number of users which can be used to quickly identify the most popular items currently shared on the P2P networks. This system is however reactive - only once files gain popularity and have been downloaded by a large number of users will they be identified. • Extraction of direct download URLs. Same system can be utilized, to some extent, to collect the direct download links from the HTTP traffic, as long as they maintain easily parseable naming format. • Extraction of content data. This functionality is imaginable for some protocols, but may be difficult to achieve in many cases. The device could perform extraction of binary data from the payload of P2P protocol packets and could even technically reconstruct the transmitted file, even if only partially. This way, the content could be analyzed by offline tools. For example, audio recognition could be used to automatically deter- mine whether an audio file contained material under copyright. In practice, such functionality may be much easier realized by additional software that implements specific P2P protocols and uses the extracted file IDs to automatically download the file from the P2P network. Similarly, files downloaded from direct download services are much easier to download manually using the extracted URL instead of attempting to extract data from live traffic. • Access portal for copyright holders. The database can be maintained semi-automatically by the vendor, or by a specialized company dedi- cated to file ID database administration using a portal for copyright holders to present the currently observed file sharing items and allowing quick analysis of the content for its copyright status.

OFFLINE CONTENT The maintenance of the file ID database can be, at least in part, automa- ANALYSIS tized by content analysis solutions. In the past, EANTC performed tests of such systems that were designed to work in-line, similarly to the other DPI solutions. The tests have shown extremely low performance and accuracy of such solutions. In fact, in-line analysis of the content (e.g. audio analysis) is counter-productive due to various reasons: • Difficulty of content data extraction in live P2P traffic • Inability to access compressed content (i.e. music albums distributed in archives) • Inability to prevent the distribution early - the solution may require a large portion of content to be transferred before analysis is complete • Repeated analysis of the same content being transmitted multiple times The conclusion of these shortcomings is that the content analysis solutions are best utilized in off-line mode in order to provide automatized support for file ID database maintenance. It should be also noted that due to limited accuracy of such solutions and difficulties determining exact legal status, all content items recognized by

38 Service Provider Challenges

such automatized solutions still need to be verified manually, for each terri- tory, before being entered into a blacklist database. Although online analysis is the only solution witch can immediate react on new content distribution, it can have performance and reliability issues. Whereas offline analysis does not have such performance issues and is more reliable with the disadvantage of having new content distributed for hours before it can be detected.

Potential service provider design From experience gained with installation of DPI devices in provider networks, vendors of such solutions have outlined several points in the provider networks where the device should be typically installed.

PEERING/TRANSIT Providers can filter the traffic transmitted to or from other providers’ POINTS networks. The typical distribution of P2P traffic for a single content item often shows that most users sharing the same file are located in different network areas and not within a single provider‘s network. Although the device will be unable to prevent sharing of content between locally close users, i.e. subscribers located in the same network, it would be likely able to disrupt the entire distribution. Of course, such peering points (typically referred to as Autonomous Systems Borders) are also the points in the network where the highest amount of data is being exchanged. Currently many service providers are discussing inserting 100 Gigabit Ethernet to these network areas. Expecting DPI solu- tions to be able to deal with such amount of traffic is, at this point in time, not possible.

AGGREGATION Typical for most broadband access networks, as well as for mobile architec- POINTS tures, is the aggregation of subscribers behind a single routing device, such as BRAS or GGSN. Normally, all subscriber traffic towards the Internet flows through this single point where a filtering device can be placed. At this point in the network traffic is also unlikely to be mixed with other traffic carried by the same provider such as transit and business services, and therefore allows for more cost- and performance-efficient filtering.

CORE NETWORK Deployment in the core network is impractical and has several disadvanta- geous: • High traffic volume requires a filtering device with an adequately high performance, and many solutions may prove incapable. • The core network may carry traffic of many types, including residential, broadband traffic, business, transit traffic and other unrelated services that can be negatively affected by the filtering. • A failure of the filtering device may affect a larger number of network users than the potential target group that is being monitored. • Core network is likely to be organized in a mesh structure, allowing traffic flows over multiple paths. This circumstance may negatively influ- ence the detection accuracy and blocking efficiency and require a large number of DPI devices to be deployed. • The presence of a device capable of discarding packets on what the surrounding infrastructure considers a direct physical link may interfere with the existing resiliency and performance monitoring mechanisms which may consider the link faulty.

39 Service Provider Challenges

FIGURE 10. Potential Placement of Filtering Devices in Provider Networks

Mobile Users

Core Network GGSN

Broadband Users

DSLAM

BRAS

Companies or Institutions

Gateway Peering Connection

DSL, Cable, Backhaul

Aggregation DPI device Peering Provider Backbone IP Routers ...

...

Encapsulation Internet traffic in provider networks is likely to be form. in encapsulated A DPI device utilized within the infrastructureInternet where traffic was aggre- gated must be ablesupport to the complete stack of encapsulation protocols used by provider. the The most widespread examples are: • VLAN: used in Ethernet-based the access aggregation infrastructures to identify traffic for specificports user • Q-in-Q or 802.1ad double-tagged VLAN frames: additional tag may be added when trafficfurther is concentrated in the provider infrastructure • PPPoE, PPPoA, PPPoEoA: EthernetATM and based encapsulation typical for the ADSL access • L2TP: tunneling protocol used to transport broadband subscriber traffic to the provider’s Point of Presence location (POP) • MPLS: transport of the tunneled traffic in the network backbones or as a leased service

40 Service Provider Challenges

Depending on the position where a DPI device is to be installed, the device must support one of these encapsulations, or a mix thereof. While the config- uration can be easily adjusted for each specific provider, users are also able to transport encapsulated traffic. In some cases, file sharing can be performed within a VPN, or traffic transported over a tunnel to a proxy located elsewhere in the Internet in order to conceal the file sharing traffic from monitoring and filtering. The DPI solution therefore must be able to use adaptive traffic decapsulation, i.e. must be able to recognize the start of the encapsulated IP packet in any frame, regardless of static encapsulation used by the provider itself. Performance-wise, processing of encapsulated traffic should not cause a significant performance impact, as the regular expression driven recognition of signatures typically used in DPI solutions should work transparently on data blocks with arbitrary prefix (i.e. added encapsulation protocol header in the packet). The monitoring and filtering performance of a DPI solution must be evaluated in comparison with unencapsulated IP traffic in order to prevent unexpected performance regressions in a real network. Exact detail on supported encapsulation types in various products is difficult to obtain, however it can be easily assumed that a solution that supports some types of encapsulations requiring header removal is likely to support other similar encapsulation types, or is easily to adapt. Note that the firewall- or IDS-type devices are unlikely to be suited for moni- toring of the encapsulated traffic. As active network components, they typi- cally expect plain IP traffic and would require decapsulation of traffic to performed for them by surrounding network components. This circumstance makes it difficult for such device types to be deployed in core networks, where encapsulated traffic (VLAN or MPLS) is common.

Link Aggregation In the provider networks, aggregated traffic from the broadband customers and in the backbones is often transported over multiple Ethernet links using link aggregation. Most pass-through monitoring and filtering solutions should be able to operate on such aggregated link without issues as long as the link aggrega- tion balancing is performed correctly by the network infrastructure. The main requirement for the correct operation of DPI is the ability to correctly and completely identify the packets belonging to the same bidirectional flow between two hosts, e.g. a TCP connection. Link aggregation specification mandates that the implementation should transmit the frames of a single conversation (i.e. a TCP connection or a bidi- rectional UDP flow) to the same link. The reason is to prevent accidental reordering of the frames within same connection, which may lead to slow- down of the traffic or even corruption of communication in case of UDP. In order to fulfill this requirement, the implementations usually compute a hash value from the relevant fields of the packet, such as source and destina- tion IP addresses and ports, and which produces same result for each packet of a specific flow. It should be noted that the exact algorithm for link selection is not defined and may differ from one implementation to another. Moreover, the opposite side of the link may use a different algorithm and may align the opposite flow direction to a different link. A DPI device inserted into such connection may face difficulties recognizing and policing the traffic, if the opposite flow directions of a TCP or UDP conversation appear on different Ethernet links. The effect is dependent on the internal architecture of the DPI device. Some implementations equipped

41 Service Provider Challenges

with multiple Ethernetinterfaces and fully interconnected architecture may be able to process the flows detected on interfaces any of1 , the while devices with loose modular design, e.g. on a basisserver of 2, a or blade multiple individual units may only see one direction of the traffic. before utilizing a DPI device in a link aggregation scenario, the impact of bidirectional misalignment on the detection accuracy should be tested, even if the link aggregation itself is known to work without issues.

FIGURE 11. Example of Misaligned Flows in Link Aggregation Link 1

DPI Device

Link 2

Figure 11 shows an example situation where both switches distribute packets of the unidirectional flows correctly in accordance with- the specifi cation, but the DPI device in between does not always see both directions of a bidirectional flow on the same channel. Under circumstances, the load-balancing between aggregated links may be suboptimal, which leads to one or some of the links to carry more traffic than the other(s). When a DPI implementation is evaluated, it should be verified that the device is ableperform to additionalinternal load-balancing on its processing modules in order to optimizeperformance. the

Asymmetric Traffic In some cases, asymmetric routing is used in provider networks, for example, if a part of the network is organizedtopology. in On a ring some links, Internet traffic will flow only in one direction, while the opposite flows are being transmitted through other network segments. Utilization of DPI devices on such links is problematic and provedvery ineffective. to be Detection of most P2P protocols is unreliable, if only one direction of the traffic can be seen by the monitoring device. Many DPI solutions are known to explicitly notsupport asymmetric traffic detection. Filtering of unidirectional traffic is however generally possible, as a blockade of just one direction of a TCP flow will completely disrupt the opposite direction as well, due to TCP’s flow control procedures.

Monitoring in Impaired Traffic Flows Under circumstances, traffic processed by the monitoring and filtering devices may arrive with various impairments introduced on the user side, or in provider networks. impairment The may be inform the of packet loss, reordering and IP packet fragmentation.

1. An example beingProcera’s PacketLogic architecture described in one of the following sections. 2. The counterexample being ipoque’s the PRX solution on basis of the- IBM Blade Server.

42 Service Provider Challenges

In case of firewall- and IDS-based solutions, packet loss and reordering generally should not lead to significant problems. The firewall will filter packets by the IP or transport layer header and the IDS-based solutions are able to recover the correct data flow through default TCP mechanisms. A DPI solution’s accuracy may be impacted if the packet loss occurs at the begin- ning of conversation. Most protocols can be recognized by the data located in the first packets of the conversation. After the protocol is recognized, the residual flow can be tracked just by the information from the IP and transport layer headers, which is sufficient to perform statistics collection or filtering/ throttling. Packet loss or reordering occuring in this phase is unlikely to have negative effect. IP packet fragmentation maybe be more problematic for all types of devices, as it considerably increases the processing overhead. For many operations that a networking devices have to perform, the frame rate plays a more important role than the data rate and so fragmentation of IP packets can easily double the required performance for the same amount of transferred data. The occurrence of packet loss, reordering and fragmentation in modern networks in low, so it is unlikely to cause performance issues with the DPI devices. Nevertheless, the evaluated devices should be tested for stability in these conditions. Many networks of large organizations already utilize firewalls to protect their internal network. The firewalls can be used in parallel to implement simple non-DPI filtering against file sharing. In broadband access networks however, there is usually no such device to control the traffic from the broad- band users to the Internet, so additional planning would be necessary. Considering a poor protection of such devices against the current P2P proto- cols, utilization of firewalls as a countermeasure against file sharing traffic is only feasible as a supplement. While some networks are protected by fire- wall systems, in broadband access networks there is however usually no such device to control the traffic from the broadband users to the Internet, so additional planning will be necessary here as well.

3.2 User Perspective

From a network user perspective, complete blocking of file sharing traffic is quickly noticed. Users intending to use file sharing, upon encountering blocking techniques, could resort to conceal the traffic using encrypted protocols and proxies. A fair bandwidth shaping on the file sharing proto- cols in peak hours, when bandwidth starvation occurs, will receive much better acceptance by the network users. It should be also noted that DPI-based detection techniques are not perfect and may interfere with legitimate applications of the same or other users. Most notable example is the interference of the BitTorrent blocking with Bliz- zard content distribution system, which is based on BitTorrent protocol. The use of Peer-to-Peer file distribution systems can not automatically be labeled as illegal - free software is often distributed using P2P systems. Blocking of file sharing traffic does not produce better traffic conditions for other subscribers of the same provider not involved in file sharing. The necessary condition for worsening of the traffic propagation would be a congestion in the provider’s network, which occurs rarely due to high phys- ical capacity in the network and limited bandwidth per single user connec- tion.

43 4 Protocol-oriented solutions

In this section we evaluate two similar solutions, Procera’s PacketLogic and ipoque’s PRX. Both solutions operate in pass-through mode and designed for carrier-grade performance. These solutions are primarily oriented to detection of a wide range of application layer protocols and have no, or very limited capabilities for content recognition. Other solutions fitting in the same functionality and performance category are Cisco CSE, Allot SG Sigma, Sandvine PST and CloudShield Blade Center PN41.

4.1 Procera PacketLogic

Device classification Procera Networks’ product, the PacketLogic series of devices represents a high-performance, scalable and extensible DPI solution. PacketLogic is primarily designed for detection, filtering and throttling of specific protocols, e.g. P2P, therefore should be considered a protocol-based DPI device. However, the flexible architecture of the software also allows content- oriented classification of the traffic to a certain degree.

Hardware/software platform The PacketLogic series products, specifically the PacketLogic Real-Time Enforcement (PLR) are available in various hardware configurations. The PL5600 model is the entry-level device suitable for small organizations and capable of handling bandwidths of up to 100 Mbit/s. The midrange models PL7720 and PL8720 are suitable for large organizations like univer- sity campus networks. Finally, PL10000 represents the high-end perfor- mance level model suitable for large ISPs and capable of handling up to 80 Gbit/s of traffic.

44 Protocol-oriented solutions

Regardless of hardware type, all PacketLogic models have the same set of software features and use the same firmware. This allows for easy upgrade management in a network with different PacketLogic devices used simultane- ously.

PL10000 The high-end PL10000 model is available in two base configurations differing in size and performance. Both configurations have modular design and are capable of using same type of modules. The complete configuration has two or more network interface modules which can carry gigabit or ten gigabit Ethernet ports, management modules, and multiple flow processor units (FPs). A distinctive feature of PacketLogic platform is the ability to utilize varying number of processing modules suitable for the expected perfor- mance. The device is able to automatically distribute the flows to the avail- able processing modules. In our previous tests with the device, we could show that the processing performance scales linearly with the number of installed modules.

CONTROL INTERFACES The platform provides several software components to manage and monitor the operation of the device described in detail in the following section. All software components are integrated and used through the same user inter- face. In addition, the platform can be used via CLI and SNMP interfaces and also provides Python API that makes it possible for the operators to develop their own scripts and applications for automation purposes.

OTHER COMPONENTS Procera’s PacketLogic solution is supplemented by additional components Subscriber Manager (PLS) and Intelligence Center (PIC). Subscriber Manager is able to integrate the PacketLogic devices with provider’s AAA architecture and so provides correlation of the IP addresses detected in the traffic with specific user accounts. It also makes it possible for the platform to apply account-dependent policies for traffic monitoring, filtering and shaping. The Intelligence Center component serves aggregation of the statistic reports from multiple PLR units and provides extensive tools for statis- tical analysis and report generation.

Principle of Operation The PacketLogic platform is able to classify the traffic through various methods. Specifically, it is able to utilize pattern matching and behavioral analysis techniques.

PATTERN MATCHING The pattern matching functionality in PacketLogic is provided by Procera’s advanced identification engine DRDL (Datastream Recognition Definition Language). The main principle in this concept is the definition of recognition rules for each protocol or a specific aspect of it by the programmer, which is then compiled to a highly optimized pattern matching algorithm that can be executed on the hardware. The pattern-matching process is not only able to recognize specific applica- tion protocols, but is also able to extract some of the data in form of attri- butes, specific for each protocol. So, for example, attributes such as URL, User Agent and so on can be extracted the HTTP flows, attributes like user name and basic statistics can be extracted from some gaming protocols. Finally the analysis can also theoretically extract information identifying the transferred content from some of the P2P protocols. While not all attributes are currently extractable from the protocols, the software can be easily extended by Procera if need arises.

45 Protocol-oriented solutions

BEHAVIORAL AND In addition to the pattern recognition, PacketLogic is able to classify traffic HEURISTIC ANALYSIS by its behavior. Most protocols have very specific pattern of data transmis- sion. Typical aspects are direction of the traffic, typical data rate, periodicity and burstiness of the traffic. For example, unidirectional traffic is character- istic for many protocols designed for file transfer, while bidirectional constant traffic with relatively low bandwidth is characteristic for VoIP appli- cation. If such traffic is transmitted through an encrypted connection, the analysis by pattern matching will not be possible, but the common form of the traffic may give hints on the protocol transmitted therein. Finally, the PacketLogic software provides some generic classification methods, such as randomness of the data, which is an indicator for encrypted or compressed data.

INFORMATIONAL The flexibility of the PacketLogic platform is based on its ability to use any of ELEMENTS EXTRAC- the classification results, extracted data (e.g. URLs), auxiliary data (such as TION port numbers, IP ranges), or any combination of them to produce control rules of the flow. Technically, PacketLogic can be used to block specific P2P and web content, as long as the identifiers can be extracted and appro- priate rules are configured. However since integration of specific matches is not performed by a generic database but in global configuration and firm- ware, this type of filtering is not the primary task of this solution and we expect that it won’t be able to scale for a large number (e.g. thousands) of blocked items. Therefore we strongly suggest to verify the scalability of solu- tion, should it be considered for content-oriented filtering.

POLICING According to its classification, each flow can be logged, filtered, or shaped to desired maximum bandwidth. Normally, the classification of the flow can be done after few first packets, after which the flow is either allowed to pass through, shaped, or filtered without need for further analysis. When a flow is filtered by the device, few initial packets will usually pass through, however this will still effectively prevent the data exchange through blocked P2P protocols. Logging and statistics collection can be performed locally on the device to the extent of available capacity, or redirected to a separate logging/statistics server.

Network Connection

TRANSPARENT OPERA- The PacketLogic devices operate in pass-through mode. The PL10000 solu- TION tion is equipped with up to 8 ten gigabit Ethernet ports, organized in 4 transparent “channels” for passing data between two ports in both direc- tions. The device does not have any switching function, so the frames arriving on one port are always passed through to its counterpart and do not leak to another ports. The device therefore can be used as a transparent component on a connection using link aggregation with up to 4 ten gigabit links, or on four completely unrelated ten gigabit links. No additional config- uration is necessary to utilize the device in an environment with Ethernet link aggregation. Alternatively, the device can be operated for monitoring-only purpose and fed with traffic from a mirrored port.

PLACEMENT IN In a provider network, Procera PacketLogic devices can be attached in the PROVIDER NETWORKS same way as all DPI devices that operate in transparent mode, and in accor- dance with the supported performance. Many organizations and institutions that utilize Procera’s solution in their network installed an appropriately dimensioned PacketLogic model on the link between the access gateway to

46 Protocol-oriented solutions

their networks and the service provider. For broadband service provider networks, low- and midrange devices can be utilized on the aggregation nodes and the high range models on the peering points.

RESILIENCY Since the PacketLogic devices act as fully transparent elements, they can be easily integrated into any resilient architecture of the provider’s network without need to adjust the surrounding infrastructure. In addition, Procera provides an active bypass switch, that is able to detect the main unit’s failure and switch traffic optically to a bypass connection within 10 ms.

TRAFFIC ENCAPSULA- The device is able to automatically recognize encapsulation of traffic without TION need of explicit configuration. In our tests, PacketLogic device showed no performance or accuracy issues when analyzing encapsulated traffic.

UNIDIRECTIONAL The device is not capable of analyzing unidirectional traffic. In our asymmet- TRAFFIC rical routing test, all such traffic was put to generic “Unidirectional” class and no further analysis was performed.

Supported Protocols As of 2010, Procera firmware had over 1,000 signature definitions for many application layer protocols and variants. In the tests conducted at EANTC, PacketLogic was successfully able to recognize all widespread P2P protocols, including the encrypted variants, and also many other application protocols from other areas, like gaming, instant messaging, video streaming etc. The platform also allowed a fine-grained recognition for the services based on HTTP, by classifying for example interactive, download and video streaming HTTP sessions. Procera was also able to show the ability to quickly integrate recognition of new protocol signatures into firmware.

Additional potential advantages for the service provider From the perspective of the broadband service providers and large organi- zations, the PacketLogic solution could significantly reduce the amount of P2P traffic in the network. The traffic shaping capabilities provide a good compromise suitable for maintaining a high quality of service level for the subscribers or users. The utilization of the platform against direct download services is problem- atic. While the platform has a capability to tell apart regular, download and online video HTTP traffic, a blanket blocking of such traffic is likely to have negative consequences for the subscribers as it will interfere with many legal downloadable items and web services.

47 Protocol-oriented solutions

4.2 ipoque PRX

Purpose

ipoque1 PRX-10G presents a high-performance protocol-based DPI solution that is implemented on the basis of relatively inexpensive hardware. ipoque’s platform provides multiple hardware options suitable for different performance demands. The high-end model has modular design and load- balancing capabilities that makes possible for the operator to smoothly and simply scale the performance and the price with the number of installed modules.

Platform

OVERVIEW ipoque offers a wide range of models of their filtering solution, with varying performance from ~40 Mbit/s on the entry level device, up to 75 Gbit/s detection performance on the high-end model PRX-10G. The high-end variant is based on stock IBM BladeCenter hardware. By default, the server chassis can fit up to 14 blades each equipped with a dual AMD Opteron CPUs and the network connectivity is implemented through built-in load balancing switches.The PRX series of devices provides protocol-oriented DPI- and behavioral analysis of traffic. Filtering and shaping can be applied to selected traffic according to the protocol policies. The devices can be coupled with the provider’s AAA infrastructure in order to provide subscriber group policies.

EXTENSIBILITY Third party value-added services can be integrated into the platform with the PRX devices used to transparently redirect specific protocols and portions of user traffic to it. Examples of such services can be on-the-fly virus and spam protection, parental control or data optimization for low-bandwidth connec- tions.

Provider Network Integration

OPERATION MODE The PRX10G operates as most DPI solutions in pass-through mode. The device has in total 12 ten gigabit Ethernet interfaces located on the separate network interface modules. Two built-in load balancer switches distribute the traffic to the processing planes over the backplane connections. The distribu- tion by default occurs by a hash value calculated from the source and desti- nation IP addresses. Load balancing is configured such way that the packets of the same IP-flow are always distributed to the same processing blade, guaranteeing optimal performance. At the same time, it guarantees that the pairs of network interfaces on both sides of the device act as transparent channels and the frames transmitted over one channel do not leak to another interfaces. The distribution can be configured in full-mesh, or as partial mesh between groups of interfaces and blades. Each processing blade has an internal ten gigabit connection to each of the switches and has an estimated processing capacity of approximately 5 Gbit/s, which was confirmed during our tests in 2009 for P2P and HTTP traffic mix.

LOAD BALANCING In the tests performed at EANTC in 2009, we were able to determine a slight deficiency of the load balancing mechanism, which led to slightly

1. The company name ‘ipoque’ is correctly written uncapitalized

48 Protocol-oriented solutions

uneven distribution of traffic when the links were saturated. In practice however, it should not lead to problems. The current version of the device needs to be re-tested to determine whether the problem was eliminated by the vendor. Additionally, we monitored that the load-balancing switches may lead to problems when integrating the device into existing switched network, as they act as active components and will require a mechanism to prevent loops.

PROVIDER NETWORK Similarly to other pass-through DPI solution, ipoque PRX-10G is suitable for SUITABILITY use with link aggregation consisting of multiple ten gigabit Ethernet links, as well as for processing of traffic on multiple unrelated links. The PRX-10G device was tested by EANTC in 2009 for its suitability for provider networks. PRX-10G was successfully able to handle encapsulated traffic and showed no reduction in accuracy or performance.

UNIDIRECTIONAL In the asymmetric routing scenario test, the device demonstrated the ability TRAFFIC to detect some of the application protocols in unidirectional traffic, however the detection accuracy was too low to be practical.

PLACEMENT IN The ipoque PRX devices have similar placement possibilities in provider PROVIDER NETWORKS networks like the previously described Procera’s PacketLogic platform. The high-end model PRX-10G performance can be flexibly adjusted by config- uring different number of modules to match the required performance.

Principle of Operation ipoque PRX platform is a DPI classification device primarily designed for the recognition of application layer protocols. The recognition is primarily performed through pattern-matching techniques, but can also utilize behav- ioral analysis for the protocols able to evade the pattern analysis, such as encrypted protocols.

RECOGNITION ENGINE Unlike most other solutions, the recognition is performed entirely in software that is able to run on common x86-architecture platforms. This way, ipoque is able to create a wide range of products, not only for the purpose of P2P traffic interception in provider networks, but also probes for lawful intercep- tion and added value services platform. All these solutions are based on the same recognition engine software, and are able to directly use the common set of signature definitions. This way, all products can be kept updated for new demands that may arise with the appearance of new networking protocols. The detection engine PADE (Protocol and Application Decoding Engine) can be also licensed separately for integration into other networking products, or for creating network services that require accurate and real-time protocol analysis. The vendor describes the operation of the recognition engine as a cascade of classification mechanisms: • The traffic is separated by the individual flows and transport protocols • Data streams are reassembled • The protocols are recognized and necessary information extracted • Protocol events are analyzed for their behavior

URL FILTERING ipoques PRX platform has capability for URL filtering based on URL string pattern matching. The PRX-10G device is capable of holding millions of URL

49 Protocol-oriented solutions

filter entries and technically serve as an engine for specific direct download filtering. The platform however does not offer supporting functions, compo- nents or work processes to assist the maintenance of the URL database beside raw data import. However according to ipoque, the URL filtering functionality was only used to minimal extent by some providers for filtering dozens to hundreds of URLs for special purposes.

POTENTIAL FOR According to ipoque, the functionality of the platform can be extended for CONTENT RECOGNI- basic content recognition in P2P traffic. This functionality was not developed TION further due to lack of interest from the side of service providers.

Additional potential advantages for the service provider ipoque’s solution has similar advantages and drawbacks from the service provider’s perspective as the Procera’s solution described in the previous section. The PRX platform, however, has potential to be extended in the future to include basic content recognition possibilities and already provides HTTP filtering by URL matching.

50 5 Content-Oriented Solutions

In this section, we evaluate to date unique DPI solution from Vedicis, that is capable not only of recognizing various protocols, but also the content transmitted therein. The detection and filtering device, as well as the supporting framework are more interesting for the copyright holders as a tool to selectively block illegal content distribution without blindly blocking all P2P traffic. During our research we did not encounter other products in similar function- ality and performance class.

5.1 Vedicis V-Content Smart Switch

Device Classification The Vedicis filtering solution is a multi-purpose DPI device designed to be easily integrated and managed in provider networks without significant effort. Unlike most other DPI solutions designed to monitor or suppress the P2P and other file sharing traffic, Vedicis solution not only recognizes the file sharing protocols, but also the actual content transferred therein. Content recognition can be performed for a variety of P2P protocols, but also for the traditional HTTP traffic. Specific content is recognized using meta-information contained in many protocols relevant to file sharing to unambiguously identify transferred files. These could be a binary hash value in P2P protocols like BitTorrent or eDonkey, or URL strings in HTTP traffic. This allows for a highly efficient real- time filtering of Internet traffic restricted to specific content, instead of just protocol. The vendor claims that the file-ID based content filtering is more versatile and efficient in contrast to other content-based filtering solutions., For example, the system can use audio analysis of exchanged data in order to identify illegally shared music. As the shared content is usually transferred

51 Content-Oriented Solutions

many times over the same channels, there is no particular need to perform automated content analysis on every transfer. Instead, the audio- and video- based content analysis solutions can be utilized for precise content analysis in ideal “offline” environment and can support the file-ID-based filtering solu- tion by maintaining the database of such file IDs. Moreover, this type of filtering is agnostic to the type of transferred content, and is suitable not only for filtering of illegally shared music or videos, but also for any kinds of data, including software, images and documents. Many conceptually different techniques of content search and recognition, each specializing on tight specific area of content can be utilized together for file-ID database maintenance, while the file-ID based solution will perform the actual filtering.

Platform

OVERVIEW OF Vedicis’ content detection and filtering solution contains different elements, COMPONENTS including optional components. In a minimal form it can be operated as a standalone filtering device, V-Content Smart Switch VP10G, with a separate PC used for database updates, configuration and statistics. In a larger setup, multiple filtering devices can be automatically controlled from a central server called V-Director. New items to be blocked can be added to the hash and URL database and will automatically be synchro- nized to all VP10G devices. The central server will also automatically collect the detailed traffic statistics and alarms.

DATABASE MAINTE- At the very least, the database can be updated manually, or data could be NANCE imported from externally submitted information, for example collected by the copyright holders or from separate companies specialized in P2P content investigation. Vedicis’ claims that the platform allows for easy cooperation with content providers and copyright holders in order to quickly identify illegally shared files. The V-Content Smart Switch devices are able to continuously collect statistics on new content items currently observed in the network and supply a periodic report to the central V-Director system. From this information it is possible to generate a list of popular items shared in the last hours or days. It is obvious that the latest illegal releases of popular movies, music or soft- ware are very likely to be encountered on this top list. The system therefore automatically locates likely copyright infringing files with the potentially highest transfer volume. The potentially infringing files can be automatically or semi-automatically collected for manual review. The items can be uploaded to the global Vedicis Media Services Portal accessible for affiliated copyright holders for review. The copyright holders can then provide an authoritative answer whether a content item is in fact infringing copyrights and therefore should be filtered. The information about the newly identified content then can be distributed to the V-Director and Smart Switch units in order to update the database in real time. In addition, the Vedicis platform can be equipped with additional compo- nents for the purpose of automatized data collection and analysis. The filtering device is able to collect file identifiers not found in the database. This way, new files distributed in file sharing networks can be quickly identi- fied in order to perform analysis of their legality. This data can be used in various analysis components offered and used by the platform. The analysis is performed without direct involvement of the filtering modules and can be done by external companies. The analysis of content legality should involve not only the technical aspects, but also legal verification of each item with

52 Content-Oriented Solutions

the potential copyright holders in order to prevent inclusion of false positives into the database. From the technical perspective, the following activities are possible using Vedicis’ platform: • Automatic collection of new content being exchanged in P2P networks or via direct download sites. The hashes and URLs not found in the data- base can be automatically collected by the filtering modules and aggre- gated on the V-Director server. • Audio and video analysis of the content. Audio and video fingerprinting and analysis technologies are being developed by several companies and have been proven to be infeasible for a real time high-performance online traffic analysis. Instead, this technology can be utilized “offline” for precise analysis of new P2P items in an attempt to determine the copyright status of the content. Vedicis platform can be used to semi-auto- mate this process by detecting new content items on the Internet. • Automatic collection of auxiliary information from Internet forums involved in file sharing. Automatic analysis of some files, especially those posted on the direct download sites is not always possible, as the files are often posted as encrypted archives. Similarly to the problem of encrypted P2P transfers, the encryption primarily serves the obfuscation of the content, and not a protection from public access. Similarly, the passwords to the archives are posted in the file sharing forums along with the links to download pages. Vedicis’ platform is able to detect new direct download links in HTTP traffic, and use the HTTP Referrer string to detect the origin of the link. The resulting webpage could be scooped for potential password strings, which allows for automated archive decryp- tion with a high probability of success. However, relying on HTTP referrer makes sense to some extent but in some territories, like Germany, almost every linking site makes use of “intermediary” sites (like linksave.in or other “redirectors”) that obfuscate the origin of the link. However, the fact that a link encryptor is used can be also seen as an indicator and treated like a known forum site.

Provider Network Integration

LINK AGGREGATION The link aggregation is not directly supported, however, the variant of the SUPPORT device equipped with 4 links can be utilized on an aggregated link. In this case, each frame will be passed only through specific pair of ports and can not be placed on another link. However, the device expects the frames of each bidirectional flow to be transmitted over the same link, which should be the normal behavior of link aggregation implementations.

UNIDIRECTIONAL The device is not efficiently capable of analysis of unidirectional traffic TRAFFIC flows. In our previous tests with the vendor, the device was able to detect only a small portion of traffic in an asymmetrical routing scenario correctly (less than 5%).

ENCAPSULATION Vedicis’ V-Content Smart Switch is capable of processing encapsulated SUPPORT traffic in providers’ networks in a variety of encapsulation protocols including VLAN, MPLS and tunneling protocols such as L2TP and GRE. In our previous tests, we could verify the correctness of detection in these conditions, however, the encapsulated traffic produced a significant impact on performance. We recommend re-evaluation of this feature with the current state of the Vedicis’ platform.

53 Content-Oriented Solutions

RESILIENCY The device does not support resiliency mechanisms directly. For a resilient setup in case of the device failure, the provider should utilize an optical bypass with the capability of detecting traffic failure on the managed link.

Principle of operation

MODE OF OPERATION The Vedicis network component, the V-Content Smart Switch VP10G, is utilized in a provider network in the pass-through mode. The device itself does not act as an active network component and will appear to the surrounding infrastructure as a direct physical link. The device can also be used for passive monitoring of the traffic. For this purpose traffic should be mirrored by the external means such as optical splitter or a pair of mirroring ports. A single VP10G device is equipped with two or four 10 Gigabit Ethernet ports and so could be utilized on one or two 10 Gigabit Ethernet links.

SPECIFIC CONTENT The principle of traffic analysis and filtering on Vedicis’ platform differs IDENTIFICATION significantly from the DPI-based traffic management solutions of other vendors. Instead of classifying the traffic flows only by the protocols, the Vedicis’ solution is able to determine whether the content is known as ille- gally shared. For this purpose, the DPI solution identifies the protocol used in the flow, and extracts protocol-specific information identifying the content. In many simple and conventional protocols used for file transmission, such identifying information can be a filename or in case of HTTP, a URL. State-of-the-art P2P protocols however often identify files unambiguously using a hash identifier. The calculation of this identifier is specific for each protocol and usually involves calculation of a cryptographic hash (using algorithms such as MD4, MD5, SHA1 etc.) over the contents of the file, or (as in case of BitTorrent for example), over the file and additional meta-data such as file names, size etc. This identifier will be well-known to clients searching for specific content and can be used in requests and data trans- missions to unambiguously specify the requested file. Vedicis’ solution is able to extract such identifiers, including filenames, URLs and hash IDs from the streams of various protocols. Further, it maintains a database of known identifiers classified by the legality of the content. By checking the extracted IDs against this database, the filtering solution is able to decide whether this transmission should be allowed or not.

IDENTIFICATION ACCU- The accuracy of the solution mainly depends on the extraction method and RACY on how ambiguous the extracted identifiers are. While the filenames-based identification can be very ambiguous, URLs in HTTP mostly and hash ID in P2P protocols always provide precise classification of the content, as long as the identifier could be successfully extracted. According to Vedicis’ own experience in provider networks, the solution was able to successfully clas- sify about 80% of the traffic. In our own tests in 2008-2009, the solution was able to successfully classify all simulated P2P traffic of protocols BitTor- rent (unencrypted), eDonkey (unencrypted) and . The practical accu- racy of the solution in regard of false positives or negatives is dependent on the quality of the identifier database, which is maintained externally.

POLICING Unlike other solutions, content-based classification clearly defines the legality of each individual flow. The solution therefore does not support traffic throttling and is only able to either passively analyze the traffic, or apply a strict decision to block the flow by dropping their packets or let them

54 Content-Oriented Solutions pass through depending on the detection result (illegal or legal content respectively). In our tests, the solution was able to filter out the illegal content without any successful transmission attempt. At the same time, all transmis- sions for the legal content were successful. As an alternative to filtering and throttling of illegal content flows, the current solution also offers the possibility to mark packets based on their classifica- tion. The marking may be performed using variety of protocol formats, such as: • adding specific MPLS labels • setting VLAN ID • setting DSCP field in IP packets The provider network could use this identification to perform actual filtering, throttling or deprioritization of the traffic. In this case, Vedicis’ solution can work in tandem with the conventional traffic policing mechanisms.

Advanced Features: Protocol Decryption Several modern P2P protocols as well as variants of traditional P2P proto- cols now are able to utilize traffic encryption. Unlike other authenticated and encrypted protocols the encryption usually does not serve the security of the file transfers, but mostly as an obfuscation mechanism. Individual peers on a P2P network in most cases do not have mutual trust, therefore they have to explicitly exchange encryption keys in order to perform an encrypted data transfer. Moreover, there are usually no mechanisms, nor the necessary information to perform any kind of secure authentication. This circumstance allows a monitoring/filtering device located in the network path between two communicating peers to perform a man-in-the-middle attack. It is possible to intercept the key exchange phase and, therefore, be able to access the data transferred over the encrypted connection. Vedicis filtering solution supports algorithms to automatically intercept the encrypted communication of some encrypted protocols including eDonkey, BitTorrent and Ares. In most cases, a man-in-the-middle attack is required, however in case of encrypted eDonkey, the flaws in the protocol’s cryptog- raphy allow for a recovery of the key through passive monitoring of commu- nication. The only goal of this interception is to extract the hash ID of the file being transferred between two peers. This way, encrypted file transfers can be equally verified for the presence of illegally shared content and blocked. While the solution is technically viable and available as an optional feature, this kind of connection interception has legal restrictions in many countries, e.g. as an unsanctioned attack on a secure communication channel.

Additional potential advantages for the service provider Vedicis platform provides service providers an extensive tool to collect the statistics data on content distribution in the network. The statistics collection not only tracks the most popular content items, but also identifies the IP addresses of the users most involved in file sharing. The preventive blocking of the illegal file sharing may be advantageous for providers in order to protect themselves, to some extent, from subpoena requests by the copyright holders and affiliated companies that aim to reveal the identity of the filesharers. Unlike protocol-based solutions, content-based filtering is unlikely to reduce file sharing traffic volume considerably. Therefore, it will have a much

55 Content-Oriented Solutions smaller, albeit positive effect on the overall available bandwidth in network and so improve the overall quality of service available for business and private users. Content-based detection, such as Vedicis’ solution, is capable of blocking only the content explicitly marked as illegally shared and will generally allow unrecognized content. Although several surveys showed that the majority of the P2P traffic carries illegally distributed content, only the popular items are likely to be identified as the “top100 items” and entered to the database. Many content items are shared without reaching very high transfer volumes and are likely to stay under the radar. On the other hand, such items can be under a copyright of numerous small companies and publishers that are difficult to locate and establish contact with.

56 6Web Content Filtering

A separate class of traffic policing solutions emerged as a supporting devices to existing security infrastructure in many companies and organiza- tions. In addition to conventional firewalls, network operator is given addi- tional capability to analyze and police web traffic on the application layer and in regard to content. The presented solutions act as a HTTP/FTP proxy servers are capable of filtering web traffic for malicious and illicit material. In this chapter we evaluate solutions from Blue Coat, Cisco IronPort and eSafe from SafeNet (formerly developed by Aladdin Knowledge Systems). In our market analysis, we also encountered other similar solutions, for example from Exinda.

6.1 Blue Coat

Device classification Blue Coat product palette consists of several appliance types primarily oriented at providing additional protection, performance and enforcement of a service provider’s Internet usage policies. The products are designed for use in the networks of companies and organizations, within their Appli- cation Delivery Network (ADN) infrastructure concept. They can also be utilized by small-scale Internet providers. Some of the solutions and software modules are oriented to analysis and control of application traffic. Specifi- cally, the Blue Coat PacketShaper solution is capable of DPI detection of numerous protocols and shaping of the traffic, while Blue Coat ProxySG is able to analyze and classify content transmitted over widespread protocols HTTP, HTTPS and FTP.

57 Web Content Filtering

Hardware/software platform

DIFFERENT ROLES Both Blue Coat PacketShaper and proxySG solutions come in several vari- ants and classes suitable for different loads and network sizes. The entry- level solution of PacketShaper is capable of handling an estimated 2 Mbit/s of traffic and up to 30 users, while the high-end model designed for use in ISP networks is estimated to handle 300-400 Mbit/s of traffic and up to 20,000 users. The ProxySG solutions are aimed at corporate networks with sizes ranging from just 10 users at the entry level to several thousands in the high-end models. Both presented devices pursuit different goals of optimization. While Packet- Shaper is designed to control protocol traffic and apply policing to it, ProxySG solution primarily serves as an HTTP/FTP proxy. ProxySG is capable of optimizing network performance by caching content and DNS queries, or perform on-the-fly image and HTML compression, which are attractive functionalities for the mobile subscribers Internet access. Blue Coat WebFilter is an additional software component for the ProxySG solution that allows extensive filtering of web content.

OTHER COMPONENTS Additional solutions are available within the platform concept, that allow: • central management of multiple network devices such as PacketShaper, proxying and filtering solutions • e-mail and web filtering appliances with built-in virus and malware detec- tion • traffic analysis solutions capable of monitoring traffic behavior of sepa- rate users and detect frequent violators of Internet usage policies in a company • data leak protection systems capable of detecting when a transfer of sensitive information is done to the Internet

Network connection ProxySG units can be utilized in the network in two different ways. They can serve either as an active HTTP/FTP proxy, or work in transparent mode. When utilized as proxy, the device relies on the network’s firewall to block all traffic not explicitly handled by the ProxySG device, including standard protocols like HTTP. It also requires all users to have the ProxySG device address to be configured in all HTTP/FTP clients in order to obtain access to the Internet. The device is capable of recognizing circumvention of the proxy use by detecting the protocols that are tunneled over HTTP. In transparent mode, the device is able to intercept and classify traffic trans- parently for the users and according to Blue Coat maintains the same filtering functionalities. In this case, explicit configuration of the proxy server is not required at the client machines.

Principle of Operation PacketShaper is a transparent pass-through DPI device. It is capable of recognizing approximately 600 application protocols including many P2P protocols and HTTP traffic types. The solution is capable of blocking unde- sired traffic, or shaping it to a given bandwidth with the legitimate traffic taking precedence. The ProxySG solution however operates primarily as a proxy, and therefore terminates client connections on the device. The HTTP, HTTPS and FTP are

58 Web Content Filtering

supported and may be affected by both optimization and filtering imple- mented on the device.

FILTERING PARAME- HTTP filtering function can operate on the basis of many parameters: TERS • URL strings or match pattern • IP and DNS names of web servers • file types • file sizes • web page category, determined through database lookup • on-the-fly content keyword analysis to determine the approximate content category, if the page is not registered in the database. With the help of the Blue Coat platform administration system, all units can be kept updated in short intervals.

Supported Protocols Blue Coat lists over 600 application protocols from various areas for the PacketShaper solution. The supported protocol areas include standard Internet protocols, P2P, games, instant messaging, multimedia streaming, VoIP and many others, allowing the network administration to selectively suppress or enhance specific activities on the network. The ProxySG solution explicitly supports HTTP, HTTPS and FTP.

Additional features

HTTPS SUPPORT ProxySG solution is capable of handling encrypted HTTPS protocol. The device terminates both segments of the session to the user and to the server and is capable of analyzing cleartext traffic. Similarly to other solutions, the HTTPS interception is done by means of a man-in-the-middle decryption, and requires the clients to trust the certificate presented by the proxy. If clients have already been deployed with trust for a local, e.g. corporate, certificate authority, that authority can be used when proxying HTTPS requests thus avoiding the need to install an additional root certificate in client browsers. ProxySG also supports a whitelist to exclude certain sites (IP addresses) from man-in-the-middle decryption, if additional security and privacy is required.

TRAFFIC OPTIMIZA- ProxySG is capable of web traffic optimization aimed at the low-bandwidth TION user connections. This could be relevant for mobile or dialup users. The solu- tion is capable of cleaning up the HTML code and re-compressing images in order to reach more efficient bandwidth usage.

Additional potential advantages for service provider As already mentioned, the platform is mostly suited for companies, organi- zations and institutions interested in web content filtering for their network. Broadband service providers already utilizing HTTP proxy servers for their subscribers can extend this functionality by web content filtering rules.

59 Web Content Filtering

6.2 Cisco IronPort

Device classification Cisco’s IronPort solutions primarily provide additional protection to the corporate or campus networks against spam, malware and illicit material. The solutions are designed to perform automatic and real-time scan of e-mail and web traffic. The solutions are extendable through various software modules which provide filtering and detection functions. In this section we will primarily describe the web filtering solutions, named “S-Series” appli- ances. The e-mail filtering appliances are principally different in their func- tionality and the utilization within networks and will be only touched briefly.

Platform Cisco IronPort S-series devices are designed to be integrated into existing security architecture of corporate networks. They can be placed in the network immediately behind existing firewall. Their goal is not to replace the firewall functions of controlling the traffic according to IP- and port-based rules, but to enhance these with the filtering of traffic on application layer. The hardware platform comes in several variants suitable for different traffic load and number of clients, and in some cases optimized for specific func- tionalities. The actual filtering functionality is performed in software. The high--end solution IronPort S670 is able to handle up to 5 GBit/s of traffic, according to the vendor. Multiple devices can be managed through a centralized administration system that allows quick application of policies and analysis of usage and violations across the platform.

Network Connection IronPort devices act as active networking components and must be config- ured as the HTTP and FTP proxy on the clients within network. The filtering functionality can only be ensured when the devices are used in combination with a conventional firewall configured to block all HTTP/HTTPS traffic by default and only allow web usage through proxy device. This kind of configuration makes the device less suitable for use in provider networks, as it would require the subscribers to explicitly configure the device as proxy and would make it necessary to block conventional HTTP traffic (i.e. by blocking port 80) in order to enforce its use. In a corporate environment however, this configuration would be relatively simple to enforce.

Principle of Operation

PROTOCOL FILTERING Natively, the S-series supports handling of the HTTP, HTTPS and FTP proto- cols by acting as a proxy device. In addition, the device is capable of redi- recting this traffic to third-party filtering solutions that support ICAP interface.

URL FILTERING URL filtering is performed by matching the website addresses against a data- base of over 20 million websites and assigning them into one of over 50 content categories. The classification database is provided and regularly updated by the vendor and allows for easy implementation of company‘s Internet usage rules by choosing categories allowed or disallowed on the network. The category list includes general website classes, for example

60 Web Content Filtering

„business“, „education“, „technology“, „news“, „social networking“, „gambling“, „pornography“, etc. Network administrators can then define the acceptable Internet usage policy by blocking or allowing specific categories globally, on per-user or per- group basis. In addition they may add specific URLs or domains to the black- or whitelists to further refine the policy. The platform is also able to recog- nize traffic being tunneled over HTTP in order to circumvent firewall blocking policies.

CONTENT FILTERING In addition to plain URL matching rules, the S-series IronPort devices allow definition of rules on the content exchanged via HTTP or FTP. A policy may be defined to block transmission of specific file types, or files that exceed specific sizes. So for example, this may serve the prevention of uploading company-internal documents to the Internet, i.e. preventing information leaks. The files can also be verified using the built-in or external anti-virus solution to prevent the employees from downloading viruses and other malware.

Supported Protocols Cisco IronPort S-series is designed to support only few protocols, specifically HTTP, HTTPS and FTP. The HTTPS support is enhanced through built-in hard- ware encryption/decryption capabilities.

Additional Capabilities

MAIL FILTERING The other IronPort series device types, such as C- and X-series, are designed to handle e-mail. They provide rich filtering functionalities against following treats: • Spam - Mail messages containing spam can be recognized according to various methods, including analysis of the message itself, and by veri- fying the message delivery path. • Viruses - mail attachments can be analyzed by an integrated virus scanner • Phishing and malware - obfuscated links in the mail messages attempting to lure the user to fake websites for the purpose of stealing the credentials or installing the malware can be detected and removed • Illicit images - images found in mail attachments can be automatically scanned by a specialized software module in order to detect porno- graphic images exchanged over mail.

Additional potential advantages for the service provider Similarly to the Blue Coat solution described in the previous section, Cisco IronPort is primarily aimed at corporate and institutional networks interested in web filtering and spam and virus protection for the users. Utilization in large service provider networks would require the subscribers to be forced to use the web proxy provided by the platform for all HTTP traffic. The plat- form is therefore only suitable for cases where web proxy solutions are already in use or considered. One of the examples could be service providers oriented for low-bandwidth access where optimization of IP traffic is desirable, such as 2G/3G mobile service providers.

61 Web Content Filtering

6.3 SafeNet eSafe

Device Classification SafeNet eSafe platform represents a flexible and extendable software plat- form for security enhancement and Internet usage policing in company or organization networks. The eSafe platform was developed by Aladdin Knowledge System, which was later acquired by Vector Capital and merged into SafeNet.

Platform Unlike most other similar solutions, eSafe content filtering platform can be optionally bought as a software-only package suitable for installing on customer’s own PC-like hardware. The software package includes a complete hardened Linux-based OS and other necessary software compo- nents. Alternatively, the platform is available in a conventional way, preloaded on a hardware appliance from SafeNet or one of the partners. Some of the appliance solutions offer high-availability option and can be interconnected in a cluster with up to 8 units, for purpose of resiliency or load sharing. A separate software, eSafe Delivery, provides centralized management of multiple units across the network. It is responsible for controlling high avail- ability configurations, collection of statistics, application of policies and update of signature databases. SafeNet provides a constantly updated data- base of URL classification and threat signatures, which can be used by the eSafe solution automatically.

Network Connection The eSafe solution is utilized in company networks similarly to the other web filtering solutions. It belongs to the class of IDS-based systems and designed to actively terminate all network connections to the Internet in order to inter- cept and analyze the traffic. For some protocols, eSafe solution serves as a transparent proxy capable of detailed analysis and on-the-fly modification of content. Other protocols can be classified and either transparently forwarded or blocked in accordance with the configured policies. As an option, eSafe solution can be used in tandem with third-party filtering solution to handle specific analysis needs via ICAP interface. For example, another DPI solution may recognize a specific application protocol and forward it to the eSafe appliance for additional content analysis.

Performance According to the vendor, the content processing engine running on their appliances is capable of handling up to 38 Mbit/s of HTTP traffic and 1500 concurrent connections. In a 8-unit cluster configuration, the platform was able to handle up to 200 Mbit/s of traffic.

Supported Protocols The solution is primarily oriented for analyzing of HTTP, HTTPS and FTP traffic and optionally may also act as a mail (SMTP) gateway. The support of other protocols is mostly limited to the recognition of the specific protocols and a variety of network messages produced by known viruses and

62 Web Content Filtering

malware. The solution is also capable of recognizing the attempts to circum- vent the filtering policy through use of tunnels or foreign proxies.

HTTP eSafe platform provides an extensive recognition for the various types and aspects of HTTP protocol. The policies are tailored not only for web traffic in general, but take the nature of the web service into account. This way, the operator can easily define policies for popular web services like Google, Gmail, Facebook, etc. and combat the security threats specific to these services. So, for example the policies defined for the company-internal mail service can be equally applied to web-based public mail services, or generic HTTP/FTP file transfers, such as automatic virus scanning or preven- tion of document leaks. Furthermore, the eSafe solution provides a URL classification for web access using a categorization database from SafeNet, or own white- or blacklists.

HTTPS eSafe enables the inspection of HTTPS/SSL traffic by means of a man-in-the- middle interception of encrypted communication. The proxy terminates two separate encrypted connections to the client and to the server and is able to observe the contents in cleartext. IP addresses of specific sites that require privacy and security can be whitelisted, so the traffic will be forwarded transparently and not intercepted. The proxy authenticates itself using a valid certificate that can be signed by a local authority, so that the interception of communication will not cause a warning on a client configured to trust that authority. This is easily possible in a corporate environment where software on the workstations is deployed with strict guidelines, and where existing local certificate authority, e.g. the company’s own certificate can be used as such trusted certificate. In addition, the proxy is capable of verifying the certificates of the servers the clients attempt to connect and may enforce strict policies in regard of expired or incorrectly signed certificates, thus lowering the possibility of security violations by careless users.

Additional potential advantages for the service provider As an IDS-based system with relatively low performance, the eSafe platform is designed for use in company networks and will be unsuitable for the use by Internet Service Providers.

63 7 Subscriber Notification

In this section we evaluate a different class of solutions that are not directed to detect and control the file sharing, but instead serve as a reaction tool. The idea is to notify the users with a warning notice instead of directly blocking their traffic. Similar functionality is also available as on some other platforms such as Cisco IronPort described in the previous chapter.

7.1 Front Porch

Purpose Front Porch is a solution for automatic web user notification that works through interception of HTTP requests. Front Porch does not serve the purpose of regulating users traffic or preventing filesharing. Instead, it provides reaction functionality that needs to be triggered by external anal- ysis tools. Front Porch is able to issue warnings and notifications to the users online.

Platform This solution does not make decisions from the analysis of user traffic, and relies on database information provided by external traffic analysis solu- tions, which can be a DPI solution with content recognition capabilities utilized in provider’s network. So, the notification can be a reaction not only to a user’s web use, but also can reflect his or her actions over other proto- cols, e.g. P2P downloads he does currently or did in the past. The notifica- tions can be configured for one-time, periodical or continuous delivery.

Network Capabilities Front Porch relies on one of the networking components in the provider network, such as a switch or a router, to provide it with traffic on a mirrored port. The solution requires monitoring of both directions of the traffic in

64 Subscriber Notification

order to receive the complete TCP establishment sequence when a client tries to access the Internet. Unlike most DPI filtering solutions, Front Porch device does not directly affect traffic. It also can be placed at any network component involved in traffic forwarding from the subscribers and capable of port mirroring with adequate performance. The only requirement is that the Front Porch device has a significantly lower latency to the subscribers than they have to the Internet.

Principle of Operation

TCP SESSION INTER- The Front Porch solution is equipped with limited DPI functionality and is only CEPTION able to recognize and process HTTP traffic. The traffic is not directly modi- fied in any way and therefore the solution can be fed with mirrored traffic. The detection component of the solution primarily serves the goal of detecting establishment of HTTP sessions from the subscribers currently flagged in the database for the delivery of a personalized message. The detection engine recognizes IP addresses of the flagged users and the protocol, and extracts the HTTP header in order to be able to redirect the user back to the intended site later. Once a suitable establishing connection was detected, the Front Porch device will intercept it by sending spoofed traffic back to the client, containing a fake HTTP response to redirect the HTTP request to the notifica- tion server. This does not prevent the HTTP traffic from the original server reaching the client, but due to a smaller delay, the fake response from Front Porch is usually able to reach the clients faster and so successfully redirect them. The interception of the TCP connection is possible through monitoring the TCP connection process, as the faked response requires correct initializa- tion of TCP sequence numbers in order to be accepted by the client.

REDIRECTION In order to redirect the subscriber’s browser to the notification page, Front Porch sends a HTTP “302 Found” response. This type of response instructs the HTTP client to temporarily request the resource under a different URL. This type of redirect is often used in conventional web services, for example for the purpose of sending an unauthenticated user trying to access restricted content to a login page. All browser and other HTTP clients are required to support it. The client is redirected to Front Porch’s own web server address and to a page containing a personalized notification message for this user. The specific contents are defined through a database and set by the external means, Front Porch does not contain any decision functionality about what concrete information or message should be shown. The contents and presen- tation of the message are highly customizable. The notification message can be shown as a stand-alone page, and contain a link to the URL the user initially intended to access. Alternatively, Front Porch is also able to retrieve the content of the accessed web page, and inject the notification message as a pop-up window, or as a text block within the page. The latter case allows the message to be seen even if the subscriber uses pop-up blockers increasingly popular this day. This mechanism can be used to notify the subscribers about detected viola- tions of copyright through filesharing and also can utilized to indicate legiti- mate possibilities to obtain content the user showed interest in.

65 Subscriber Notification

FIGURE 12. HTTP Request Interception User Web Server TCP Session Establishment Extraction of TCP Sequence Numbers

FrontPorch probe

User Web Server HTTP Request Extraction of HTTP Request Header

FrontPorch probe

Fake HTTP Redirection User Response Web Server Real response from the server ignored

FrontPorch Web Server User is redirected to FrontPorchServer

The notification is displayed User The user must acknowledge the message to continue Web Server HTTP Session

The user directed to the intendedserver afterwards

Supported Protocols Front Porch exclusivelysupports plain HTTP connections,performed directly or viaproxy. Support for HTTPS traffic is not available and not intended, as HTTPS would require a more complex access schema and is also associated with access to sensitiveinformation that should not be disrupted.

66 8 Executive Summary

8.1 Solutions Overview

We analyzed 4 different classes of file sharing-related solutions. As we could see, they differ not only in their performance and protocol support, but often have fundamentally different purpose.

PROTOCOL-BASED The solutions from Procera and ipoque primarily aim at detection and classi- DPI DETECTION fication of many types of traffic, including numerous P2P protocols, but also many other applications. These solutions are designed to work in a non-intrusive manner, by transpar- ently forwarding the traffic through their “channels”. This way, presence of a DPI device does not require any configuration on the users’ side and should take little effort for the provider to integrate them into their networks.

CONTENT-BASED DPI The solution from Vedicis takes the Protocol-based DPI analysis as a basis DETECTION and extends it with recognition of individual content items in several popular P2P protocols like BitTorrent, eDonkey and Ares, as well as URL-based clas- sification for HTTP. Although the previously described solutions from Procera and ipoque are not primarily designed for this kind of detection, they still posses a limited functionality to filter the traffic depending on the content. Mostly, this is limited to HTTP only, but in principle, this functionality can be extended in the future.

WEB PROXY SOLU- Blue Coat and Cisco IronPort present a different approach by implementing TIONS a web proxy in order to intercept and filter the web traffic. These kind of solutions are more intrusive for both users and network operators, as they require specific configuration and also network design. Such solutions are best deployed not by ISPs, but by companies and organizations where necessary policy can be easily deployed.

67 Executive Summary

They are also limited in regard of protocol support and therefore are not suit- able for explicit P2P traffic limiting, instead, it is expected that such devices operate in a firewall-restricted environment, that does not allow other types of traffic per default. Operating as proxy however gives them more versatility and control over HTTP traffic. The surveyed devices are easily capable of filtering web content based on URLs and other parameters and are also easily extendable with other analysis plugins for specific needs.

SUBSCRIBER NOTIFI- Finally, the FrontProch solution presented in this study falls into a class of its CATION own. This is not a solution for detection or blocking of traffic, but for submit- ting notifications to the subscribers in real-time. This system obviously needs to be used in a combination with other tools that collect and prepare infor- mation, the FrontPorch solution is only responsible for presenting it to the users by the means of intercepting their web traffic.

8.2 Vendor Comparison

In the following table, we summarize the functionality supported by various solutions analyzed in this study. It should be noted that individual solutions were designed with different intents and so may have different range of supported functionality, but also different interpretation of the support. For example Vedicis’ solution provides support for detection of numerous protocols, similarly to DPI solutions from Procera and ipoque, but only a limited number of protocols are also suitable for content-based filtering.

TABLE 2. Overview of Technology Effectiveness

Solution/Platform Content User Protocol-based -based Web Proxy w/ Content Notifi- Class DPI DPI Filtering Function cation

Functionality Procera Packet Logic ipoque PRX Vedicis Blue Coat ProxySG Cisco IronPort SafeNet eSafe FrontPorch Analysis Type Protocol Detection yes yes yes HTTP/FTP HTTP/FTP yes N/A only only Behavioral Analysis yes yes no informa- no no no N/A tion Content Recognition no noa yesb no no no N/A (P2P only) URL filtering limited yes yes yes yes yes N/A Information Extraction yes no/ no no/ no/ no/ N/A limited unknown unknown unknown Content Extraction no no no no/ no/ no/ N/A unknown unknown unknown Content Analysis no no no no no no N/A

68 Executive Summary

Solution/Platform Content User Protocol-based -based Web Proxy w/ Content Notifi- Class DPI DPI Filtering Function cation

Functionality Procera Packet Logic ipoque PRX Vedicis Blue Coat ProxySG Cisco IronPort SafeNet eSafe FrontPorch Actions Per-user statistics/ yes no infor- yes yes yes yes yes actions mation User notification no no no no yes no/ yes unknown Traffic Blocking yes yes yes yes yes yes N/A Traffic Throttling yes yes no no no no N/A Protocol Support HTTP/FTP yes yes yes yes yes yes HTTP only HTTPS yes yes protocol yes yes yes N/A only HTTP Downloads yes yes yes yes yes yes N/A Online Videoc yes yes protocol yes yes yes N/A only Plain P2P yes yes yes no no yes N/A Encrypted/ heuristic heuristic somed no no no/ N/A ObfuscatedP2P unknown Anonimised P2P heuristic heuristic no no no no N/A P2P Streaming yes yes no no no no N/A Performance Class Large ISPs/Carriers yes yes no no no no no Medium/Small ISPs yes yes yes no no no yes Company/Org. LAN yes yes yes yes yes yes yes Adv. Encapsulation yes yes yes no no no no Throughput, Gbit/s 1-120 1-80 1-10 0.002- 1-5e 0.038f 1 0.3 a. Implementation technically possible in the future b. few selected P2P protocols c. primarily Flash-based video d. Supports encrypted eDonkey, compressed Gnutella e. Estimated from the number of interfaces f. Measured HTTP throughput performance

69