Quantifying the Unexpected Traffic on Major TCP and UDP Ports
Total Page:16
File Type:pdf, Size:1020Kb
Sneaking Past the Firewall: Quantifying the Unexpected Traffic on Major TCP and UDP Ports Shane Alcock Jean-Pierre Möller Richard Nelson University of Waikato University of Waikato University of Waikato Hamilton, New Zealand Hamilton, New Zealand Hamilton, New Zealand [email protected] [email protected] [email protected] ABSTRACT (P2P) applications, in particular, have long been us- This study aims to identify and quantify applications ing random port numbers to avoid having their traf- that are making use of port numbers that are typi- fic dropped or rate-limited by network operators [11]. cally associated with other major Internet applications In response, standard operating practice has moved to- (i.e. port 53, 80, 123, 443, 8000 and 8080) to bypass wards a ‘default deny’ approach to firewalling, particu- port-based traffic controls such as firewalls. We use larly within corporate or campus networks, where only lightweight packet inspection to examine each flow ob- traffic on ports associated with essential services is al- served using these ports on our campus network over the lowed to traverse protected segments of the network. course of a week in September 2015 and identify appli- However, this ignores another problem: what if other cations that are producing network traffic that does not applications start using the ports that are typically as- match the expected application for each port. We find sociated with the essential services? Previous research that there are numerous programs that co-opt the port has already identified that not all TCP port 80 traffic is numbers of major Internet applications on our campus, HTTP [5] [8] [10] [15], although in each study the ‘un- many of which are Chinese in origin and are not recog- expected’ traffic on TCP port 80 was not analyzed in nized by existing traffic classification tools. As a result depth or associated with particular applications. Other of our investigation, new rules for identifying over 20 ports used by essential services have been ignored in new applications have been made available to the re- past research, such as TCP port 443 (HTTPS), UDP search community. port 53 (DNS) and the alternative HTTP ports (TCP port 8080 and 8000). As a result, we know that these ports may be sometimes used by other applications but Keywords little about which applications are doing so and the traffic classification; firewalls; application protocols amount of traffic that these applications are ‘sneaking’ past port-based traffic controls. 1. INTRODUCTION In this paper, we investigate and catalogue unex- pected traffic on the set of ports that are most likely to Practical network operations are still heavily depen- be open on many, if not all, firewalls: port 53, 80, 443, dent on port numbers from transport layer headers (i.e. 8000, 8080 and 123. Our aim is to not just demonstrate TCP and UDP) for traffic control purposes [6]; firewall that unexpected traffic exists on these ports (as this has rules being one of the more obvious examples. Port- already been established in previous literature), but to based controls are popular because they are easy, and also investigate further and try to correctly identify the therefore cheap, to implement in hardware and because applications responsible for as much of the unexpected they are simple for operators to comprehend. The prob- traffic as possible. lem with port-based approaches is that port numbers Using a week-long packet capture from our campus cannot reliably identify all applications. Peer-to-peer network and the libprotoident [1] traffic classification library, we have successfully identified many of the ap- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or plications that were using these port numbers and can distributed for profit or commercial advantage and that copies bear this notice therefore bypass a typical firewall configuration unde- and the full citation on the first page. Copyrights for components of this work tected. We also quantify the effect of these applications owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, or republish, to post on servers or to redistribute to in terms of the number of flows and amount of bytes lists, requires prior specific permission and/or a fee. Request permissions from that each application was contributing. Many of the ap- [email protected]. plications that we identified originate from China and, IMC ’16, November 14–16, Santa Monica, CA, USA. as far as we are aware, were not able to be recognized c 2016 ACM. ISBN 978-1-4503-4526-2/16/11. $15.00 by existing open-source traffic classifiers prior to this DOI: http://dx.doi.org/10.1145/2987443.2987447 research being conducted. Rules for 22 new application The analysis presented in this paper was conducted protocols have been added to libprotoident and made using the lpi_protoident tool that is included with freely available through the latest software release. the libprotoident library. This tool prints a single line Because our analysis is from a single campus network of output for each bidirectional flow found in the in- with a high proportion of Chinese and Taiwanese over- put traffic dataset containing the flow 5-tuple, the time seas students (8.4% of students are from China / Tai- that the flow began, the number of bytes sent in each di- wan), we recognize that it would not be appropriate to rection and the application protocol that libprotoident generalize our observations to the Internet as a whole. has assigned to that flow. TCP flows where a com- Rather, we present this study as an single data point plete handshake was not observed are ignored, as we that can be compared against by people with access to cannot be certain that the first payload-bearing pack- datasets from other networks with different demograph- ets for those flows will have been observed. For UDP, ics. To aid in the continuation of study in this area, the we already have to accept that the first payload-bearing new protocol matching rules that we developed have packets may be missing, even if the flow begins and ends been included in the latest libprotoident release and are within the time period covered by our dataset, so all therefore freely available to other researchers. observed flows are reported. Flows are expired after ei- ther a period of inactivity (2 minutes for UDP flows and 2. METHODOLOGY unestablished TCP flows, 2 hours for established TCP flows) or the observation of a connection termination 2.1 Packet Traces signal (e.g. a TCP RST or a FIN in both directions). The traceset used in this analysis consists of seven To prevent scans, backscatter and other unsolicited days and seven hours of packet capture taken in Septem- traffic from affecting our results, we ignored all flows ber 2015 at the University of Waikato, during a teaching where both endpoints did not participate in the flow. semester at the University. During this period, 8.4% of For TCP flows, the handshake must have been com- enrolled students (1,048 out of 12,502 enrolments) were pleted successfully and at least one endpoint must have international students from either China or Taiwan; the sent a packet bearing payload. For UDP flows, only relevance of this statistic will become apparent when we flows where both endpoints had sent a payload-bearing present our results in Section 3. packet have been included in the analysis. The packets were captured using an Endace Probe Our initial attempt at analysis with the most recent that was installed at the edge of the campus network, libprotoident release resulted in a significant quantity outside the campus firewall. In this location, the Probe of unidentified traffic. As a result, much of the research was able to observe all traffic entering and exiting the effort in this study was in developing new libprotoident University. The packets captured by the Probe were for- rules to reduce the amount of unidentified traffic in our warded to an instance of our custom WDCap trace cap- dataset as much as possible. As a result, we have been ture software [17] that was running in a virtual machine able to add rules to libprotoident for 22 new applica- on the probe. WDCap was configured to anonymize the tions, as well as improve the quality of the rules for local IP address (i.e. the University address) using the a further 10 applications that were already supported. standard prefix-preserving CryptoPAn method [9] and Most of the new applications are Chinese in origin and to truncate each packet to contain no more than four include Kankan, Kakao, Weibo, Kugou and Xiami. bytes of post-transport header payload. Both of these Space limitations mean that we are unable to fully modifications were to preserve the privacy of the net- describe the changes to libprotoident here; instead, we 1 work users. The remote IP address was left unmodified have created a companion webpage for interested read- (with permission from the University); this was primar- ers with more detail on the new applications that we ily to aid us in identifying the source of new applications have identified as a result of this work and the rules encountered during our analysis. that we have written to match them. The new and improved protocol rules have also been made available 2.2 Libprotoident publicly in the 2.0.8 release of libprotoident, which is Libprotoident [1] is a software library for lightweight open-sourced under the GPL.