FreeLab: A Free Experimentation Platform
Matteo Varvello♣, Diego Perino⋆ ♣AT&T Labs – Research, ⋆Telefónica Research ABSTRACT In this work, we set out to build a free experimentation As researchers, we are aware of how hard it is to obtain access platform which can also be reliable and up-to-date. In classic to vantage points in the Internet. Experimentation platforms experimentation platforms applications run directly at vantage are useful tools, but they are also: 1) paid, either via a mem- points; we revert this rationale by proposing to use vantage bership fee or by resource sharing, 2) unreliable, nodes come points as traffic relays while running the application at theex- and go, 3) outdated, often still run on their original hardware perimenter’s machine(s). By leveraging free Internet relays as and OS. While one could build yet-another platform with vantage points, we can make such experimentation platform up-to-date and reliable hardware and software, it is hard to free. The drawback of this approach is the introduction of imagine one which is free. This is the goal of this paper: we extra errors (path inflation, header manipulation, bandwidth set out to build FreeLab, a free experimentation platform shrinkage) which need to be carefully corrected. which also aims to be reliable and up-to-date. The key idea This paper presents FreeLab, a free experimentation plat- behind FreeLab is that experiments run directly at its user form built atop of thousand of free HTTP(S) and SOCKS(5) machines, while traffic is relayed by free vantage points inthe Internet proxies [38]—to enable experiments based on TCP, Internet (web and SOCKS proxies, and DNS resolvers). Free- UDP, and HTTP(S)—and thousand free DNS resolvers [34]— Lab is thus free of access by design and up-to-date as far as its to enable DNS-based experiments. The key challenges of users maintain their experimenting machines. Reliability is a building FreeLab are twofold. First, free Internet resources key challenge due to the volatile nature of free resources, and need to be discovered and constantly monitored to provide the introduction of errors (path inflation, header manipulation, reliable vantage points. Second, relaying an application’s traf- bandwidth shrinkage) caused by traffic relays. fic through a vantage point is not the same as running the application at the vantage point. Additional delays and header 1 INTRODUCTION manipulations are few of the issues that can cause the two methodologies to diverge. FreeLab handle these issues by A large number of measurement platforms [3, 21, 37] and constructing correction filters before each experiment and experimentation test-beds [21, 27, 39] exist today, each with directly applying such corrections both on the fly (headers) different access rules and capabilities. The rise of cheap(er) and after the experiment is completed (packet timestamps). cloud computing has also motivated researchers to build small While FreeLab is not perfect, in this paper we discuss and scale dedicated test-beds, e.g., using Amazon EC2 [22, 23], demonstrate a fair set of experiments that can already be run for their experiments. Luminati/Hola [4, 41] have also been on it. FreeLab’s code is not yet ready for prime time, but our used to gather a large set of distributed vantage points. ongoing work consists in open sourcing it as soon as possible. These experimentation platforms share few limitations. We hope that FreeLab’s novel design will stimulate research First, they are paid either via a membership fee or by do- in how to improve the accuracy of the correction filters and nating hardware resources. Second, they are often unreliable augment the set of experiments it can run. in the sense that nodes are frequently down and exhibit un- outdated predictable behaviors. Third, they are as nodes have 2 MOTIVATION AND RELATED WORK been deployed at some point and both their hardware and software (OS) have not been updated ever since. As an exam- We are not aware of any system like FreeLab. However, ple, our measurements show that only 10% of the PlanetLab many work investigated the design and deployment of ex- (Europe) nodes are active; even worse, active nodes still run perimentation platforms; a complete survey can be found Fedora 8, which will be 10 years old in November 2017. in [1]. Recently, few works have leveraged novel ideas to gather vantage points all around the world, and are thus great Permission to make digital or hard copies of all or part of this work for motivations for our work. We briefly discuss them here. personal or classroom use is granted without fee provided that copies are not Hola [12] is a popular Chrome plugin that provides access made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components to (private) proxies from many locations around the world. of this work owned by others than ACM must be honored. Abstracting with Hola users can either pay for the service or get it for free if credit is permitted. To copy otherwise, or republish, to post on servers or to they contribute by proxying traffic for other users. A com- redistribute to lists, requires prior specific permission and/or a fee. Request mercial version of Hola is also available under the name permissions from [email protected]. Luminati [20]. Tyson et al. [41] use Hola to gather 143k van- HotNets-XVI, November 30-December 1, 2017, Palo Alto, CA, USA tage points in 3,818 Autonomous Systems (ASes) and study © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5569-8/17/11. . . $15.00 header manipulations by middleboxes deployed in the Inter- https://doi.org/10.1145/3152434.3152436 net. In [4], the authors use Luminati to detect violations of In order to be a free platform, FreeLab relies on three key 10k ideas. First, it shifts the experiment logic from the vantage points to the machine(s) of an experimenter, i.e., a researcher 1k aiming to run an experiment. Second, it leverages free re- sources on the Internet as vantage points. Third, it uses cor- 100 rection filters to remove the bias caused by not running an 10 experiment directly at a vantage point. As most experimentation platforms, FreeLab is composed Count [#] 1 of vantage points, control server, and client. Figure 2 shows Downloads FreeLab and how its components are interconnected. We Total users Avg weekly users now describe each component in detail, and then discuss what Automated Downloads we envisage FreeLab can be used for. 03/17 04/01 04/15 05/01 05/15 06/01 06/15
Figure 1: Adoption and usage of our Chrome plugin. 3.1 Vantage Points In a classic experimentation platform, the vantage points de- application-level end-to-end connectivity on the Internet, i.e., fine the footprint or from how many locations experiments NXDOMAIN hijacking, HTTP content manipulation, and can be run. The software and hardware characteristics of a HTTPs certificate replacement. FreeLab shares a similar ra- vantage point define which experiments it can run, e.g., some tionale with [4, 41], in the sense that it aims at using multiple vantage points might run old OS versions where recent li- vantage points without access to such nodes. However, Free- braries cannot be installed. FreeLab’s vantage points also Lab generalizes the idea with the goal to build a free and define its footprint but the experiments they can run solely full-fledged experimentation platform. depend on which traffic type they can forward. We have iden- Perhaps the best motivation to FreeLab comes from our tified three sets of free resources in the Internet which can recent work [25] where we built ProxyTorrent, a distributed forward, respectively, HTTP(S), TCP, UDP, and DNS traffic, system to study and efficiently use free web proxies, free of allowing FreeLab to support a great number of experiments. charge intermediary boxes enabling HTTP(S) connections Free Web Proxies — These are, in theory, hundred of thou- between a client and a server. We made our system publicly sands of hosts [38] that forward HTTP(S) traffic at no cost. available on March 2017 through a Chrome’s plugin which In reality, our previous work (see Section 2) shows that most emulates Hola functionalities across free web proxies [7, 26]. hosts are down, many are volatile, and few even compromise Since its release, our plugin has attracted ∼1,500 users and user’s privacy or attempt to install malware. On average, only it has facilitated the download of ∼2TB of traffic. Figure 1 ∼2 thousand safe and reliable free web proxies are available shows the plugin’s number of weekly users, installations (to- daily (see Figure 3(a)), out of which 46% can also forward tal users), and downloads, or how many URLs have been HTTPS traffic. FreeLab uses reliable and safe free web prox- requested to be proxied, during the months after its release. ies to support experiments based on HTTP/HTTPS. The number of downloads peaked to 8,000 and 12,000 down- loads on 05/10 and 06/05, respectively. Manual inspection Free Socks Proxies — These are publicly available hosts reveals that these extra downloads are due to subsequent ses- on the Internet that support the SOCKS protocol [18], i.e., sions that retrieve the exact same amount of bytes via different a mechanism to transfer TCP/UDP traffic via a proxy. TCP proxies. This activity does not resemble human behavior but forwarding is supported by all SOCKS versions and proxies; rather some automated tools. UDP forwarding is instead supported only by free SOCKS5 Based on the heuristic above, we augment Figure 1 with the proxies. FreeLab leverages free SOCKS proxies to enable number of automated downloads, i.e., plugin’s usage that we experiments based on generic UDP and TCP traffic. suspect being due to a script rather than a human. In addition Free Public DNSs — These are public DNS resolvers that to confirming the two peaks discussed above, about 21%of are reachable by IPv4 or IPv6. Lists of public DNS resolvers the downloads are labeled as automated. While identifying are easily found on-line; for example, [34] contains 44,875 the goal behind these automated downloads is hard, this ob- DNS resolvers located in 222 countries. FreeLab leverages servation suggests that several users are already exploiting free DNS resolvers to forward a client’s DNS request with the free web proxies as distributed vantage points. goal to obtain DNS responses as if the request was originated from the vantage point. Accordingly, free DNS resolvers with 3 FREELAB an anycast IP cannot be used since DNS requests would be This section describes FreeLab, a free experimentation plat- redirected to the closest machine to the testing client. Sim- form with a worldwide footprint. We define an experimenta- ilarly, DNS resolvers supporting the EDNS Client Subnet tion platform as a collection of accessible computers (vantage (ECS) [5] option might “leak” the subnet part of the originat- points) available as a testbed for building distributed systems, ing client’s IP address, regardless of a client’s indication, and perform measurements, etc. should thus be avoided. Legacy and maintaining thousand of vantage points around the globe. Applications Control Further, these operations could be distributed to FreeLab’s Server clients once a critical user-base is reached. FCA Monitoring Repository Module Client 3.3 Client Testing Module Vantage Points An experimentation platform also needs a client component which is responsible of the actual experiments. At high level,
Manager Module this consists of an application to be run, e.g., an instrumented … browser or user code, and a manager module, e.g., a tool to install an application with its dependencies or schedule when and where the application should run. This module can be Figure 2: FreeLab, a free experimentation platform. provided by the platform owners [21], by third parties [31], or left to be implemented by the user. 3.2 Control Server Despite sharing the same high level goals, FreeLab’s client departs from the classic platform clients we hinted above. The role of the control server in an experimentation platform FreeLab does not distribute N instances of the application is to maintain up-to-date information about the vantage points, logic across the vantage points; instead, such instances run handle user access and subscriptions, etc. A control server is locally and only their traffic is relayed by the vantage points. usually accessible via an API, such as the PlanetLab Central In addition, the traffic has to be corrected to account for the API (PLCA) [30]. FreeLab also includes a control server introduction of traffic relays. For these reason, FreeLab’s with the same logical role, apart from user access and billing client also contains a correction and a testing module. since FreeLab is free! However, FreeLab’s control server requires a bigger effort when monitoring its vantage points 3.3.1 Manager Module. This module is responsible to due to the volatile nature of free resources. manage an experiment. Accordingly, it is instructed on how to FreeLab’s control server runs a monitoring module which run an application along with a set of parameters, e.g., target monitors the vantage points and store their availability in a vantage points and number of repetitions. Table 1 provides a repository. ProxyTorrent [25] is used to discover and main- description of the parameters currently supported by FreeLab. tain up-to-date web proxies statistics, and will soon integrate Next, the manager module selects the vantage points based on also SOCKS(5) proxies. At high level, this requires scan- user input (NUM_VANTAGE, COUNTRIES[], PROTOCOL). 1 ning several free proxies aggregator websites and fetching This is achieved in two steps. First, by querying the control through each proxy a “bait” webpage we have crafted as well server via FCA and passing the desired properties. Second, by as few popular websites. We then call a proxy safe if the con- testing each vantage point via the testing module. These tests tent received via the proxy is equivalent to the one received (see below) are used to confirm the reachability and behavior when no proxy is used, HTTP headers were not modified, etc. of the vantage points as well as to build correction filters for For public DNS resolvers, we plan to bootstrap from the the correction module. list in [34] and periodically probe these resolvers to verify Once vantage points and respective correction filters are their state and settings (e.g., enabled optimizations). To de- identified, the manager module sets up NUM_CONT isolated tect DNS resolvers using anycast, we can run traceroute environments where the application instances run. This is from few network locations and compare the path returned. If achieved via Docker [8] but alternative technologies such the last AS prior to the destination is different across paths, as Linux containers [19] and unikernels [42] are also suit- this suggests that the DNS resolver uses an anycast IP address, able. Each Docker container is configured such that the ap- and it cannot be used by FreeLab. For ECS, we resolve a plication traffic is forwarded to the vantage point indicated domain name under our control via each free DNS resolver. by the manager module. This is realized in different ways By observing DNS traffic received at our authoritative name- given the PROTOCOL specified (Table 1). For TCP/UDP server, we can detect which (if any) free DNS resolvers add we use redsocks [35], a transparent SOCKS redirector; for the client IP address via ECS, and discard them. HTTP(S) we use a transparent Squid proxy [40] running on The data collected by the monitoring module is accessi- localhost. For DNS, we set the local recursive resolver ble via FreeLab Central API (FCA). The FCA API allows to the free DNS resolver. As the containers setup completes, a client to query for vantage points with some preferences, the manager module schedules (TIME[]) each application e.g., country of origin, and to report statistics about the van- instance (COMMAND) at its container. Meanwhile tcpdump tage points used for an experiment. Such user-generated data is started in the background since needed by the correction complements the statistics collected by the monitoring mod- module (see below). ule. FreeLab aims to be free, but these operations have a cost. However, this cost is minimal compared to deploying 3.3.2 Testing Module. This module tests vantage points returned by the control server. These tests verify that the 1www.gatherproxy.com, www.samair.ru, etc. vantage points provided are reachable and safe, and allow Name Type Description RTT) plus the cost of returning a data stream to the client (half RTT). The client can easily estimate the network delay, COMMAND String Command/application to be e.g., via a simple TCP handshake. The processing delay is at- run NUM_REPS Int Number of repetitions tributed to operations like packet forwarding, cache insertion, NUM_VANTAGE Int Number of vantage points and HTTP header manipulation. As measured in [23], packet NUM_CONT Int Number of containers to be forwarding is negligible compared to the other operations. In run in parallel addition, caching and header manipulation are not possible COUNTRIES [] List
Figure 3: Preliminary results when comparing FreeLab with PlanetLab (Europe). via curl [6] a simple 1KB page we host. 95% of the nodes different continents. Note that the x-axis is sorted by distance complete the task requested, and we find that the majority of (Km) from the server, i.e., 1-10 and A-D are each sorted by them (60) still run Fedora 8 (2007); only 10 machines run distance. The figure shows that the median download time modern OSs, i.e., Fedora 24 and 25 (2016). measured via FreeLab over PL always fall within the 25- Using the above methodology, we monitored PlanetLab 75th percentile range of the download time measured via every 5 hours for 10 days. Figure 3(a) shows the time evo- PlanetLab. This result is encouraging since it indicates the lution of the number of PlanetLab nodes in boot state, the effectiveness of the correction filters. Further, the download ones for which SSH succeeds, and the ones that can retrieve a time measured via FreeLab mostly fall within the range simple 1KB webpage. Figure 3(a) also shows the availability of values measured over PlanetLab, with the exception of of FreeLab’s vantage nodes. For now, we only report on free vantage point D. We argue that this is an artifact of PlanetLab web proxies, but we are currently extending FreeLab to also where the vantage points are generally connected with very include SOCKS(5) proxies and DNS resolvers. fast links. Indeed, while for FreeLab we observe that the Figure 3(a) shows that there are ∼95 stable working van- download time increases as the distance from the server grows, tage points in PlanetLab, but the remainder are unusable de- a similar trend is less observable for PlanetLab. spite PLCA reports them in boot state. Further, the working Figure 3(c) shows boxplots of the download times of vari- machines run Fedora 8 where installing recent applications ably sized objects (1KB, 10KB, 100KB, 1MB) via a single is challenging. In comparison, FreeLab has larger but less vantage point used à la PlanetLab and à la FreeLab. For Free- stable footprint—between 1,700 and 2,400 working vantage Lab, we also vary the location (Spain and Italy) of the client points—which is expected from free Internet resources. where curl runs. The figure further confirms the previous results, i.e., consistent download times across object sizes, 4.2 Demonstration experimentation platforms, and client locations. We demonstrate FreeLab’s functioning by testing the down- loads of variably sized web objects from [13], an HTTP re- 5 CONCLUSION AND FUTURE WORK quest and response service. For validation, we extend Free- Lab with 10 working PlanetLab machines where we installed This paper has presented FreeLab, a free experimentation polipo [32], a lightweight caching web proxy. In this way, platform with worldwide footprint. FreeLab is founded on we obtain a set of vantage points where an experiment can run the idea that an experiment does not have to run at the van- directly (à la PlanetLab) and that can also relay our traffic (à la tage points of an experimentation platform. Instead, FreeLab FreeLab). Web objects are downloaded using (legacy) curl, moves the experiment logic at FreeLab’s user and leverage and the metric analyzed is download time. For PlanetLab, vantage points just as traffic relays. We use this idea to build we use the download time as directly reported by curl; for FreeLab atop of free web proxies and plan to expand to free FreeLab, we compute the download time from the adjusted SOCKS(5) proxies and free DNS resolvers as well. The paper pcap trace as the time between the beginning of the DNS also discusses traffic corrections (packet timestamps and head- request and the end of the HTTP transaction. ers) needed to ensure the correctness of an experiment over Figure 3(b) shows boxplots of the download time (100 FreeLab. Preliminary results are encouraging and motivate downloads) of a 100KB object retrieved using 10 vantage us to finalize FreeLab and release it to the public. Our long points à la PlanetLab and FreeLab (“FreeLab over PL”). The term goal is to build a community around FreeLab which figure also shows boxplots of the download time for classic will contribute ideas and code to further increase the set of FreeLab when using web proxies (A-D) chosen across four experiments it can run. REFERENCES control for cdn-based live video delivery. ACM SIGCOMM Computer [1] V. Bajpai and J. Schonwalder. 2015. A Survey on Internet Performance Communication Review 45, 4 (2015), 311–324. Measurement Platforms and Related Standardization Efforts. IEEE [23] David Naylor, Kyle Schomp, Matteo Varvello, Ilias Leontiadis, Jeremy Communications Surveys and Tutorials 17, 3 (Apr 2015), 1313–1341. Blackburn, Diego R López, Konstantina Papagiannaki, Pablo Ro- [2] Mike Belshe, Roberto Peon, and Martin Thomson. 2015. Hypertext driguez Rodriguez, and Peter Steenkiste. 2015. Multi-context TLS Transfer Protocol Version 2 (HTTP/2). RFC 7540. (May 2015). https: (mcTLS): Enabling secure in-network functionality in TLS. In Proc. //doi.org/10.17487/RFC7540 ACM SIGCOMM, Vol. 45. ACM, 199–212. [3] BISmark. 2017. The Broadband Internet Service Benchmark. (2017). [24] Netfilter. 1999. The software of the packet filtering framework inside http://projectbismark.net/. the Linux 2.4.x and later kernel series. (1999). https://www.netfilter. [4] Taejoong Chung, David Choffnes, and Alan Mislove. 2016. Tunneling org/. for Transparency: A Large-Scale Analysis of End-to-End Violations in [25] Diego Perino, Claudio Soriente, and Matteo Varvello. 2017. Prox- the Internet. In Proc. ACM IMC. Santa Monica, California, USA. yTorrent: Untangling the Free HTTP(S) Proxy Ecosystem. CoRR [5] Carlo Contavalli, Wilmer van der Gaast, David C Lawrence, and Warren abs/1612.06126 (2017). arXiv:1612.06126 http://arxiv.org/abs/1612. 06126 Kumari. 2016. Client Subnet in DNS Queries. RFC 7871. (May 2016). [26] Diego Perino, Matteo Varvello, and Claudio Soriente. 2017. CIAO: Au- https://doi.org/10.17487/RFC7871 tomated free proxies discovery/usage. (2017). https://goo.gl/NgJmLE. [6] CURL. 1997. Command line tool and library for transferring data with [27] Planetlab. 2017. An open platform for developing, deploying, and URLs. (1997). https://curl.haxx.se/. accessing planetary-scale services. (2017). https://www.planet-lab.org/. [7] Diego Perino and Matteo Varvello and Claudio Soriente. 2017. Ciao [28] Planetlab. 2017. Project. (2017). https://www.planet-lab.org/db/pub/ code. (2017). https://github.com/ciao-dev/CIAO. slices.php. [8] Docker. 2013. The world’s leading software container platform. (2013). [29] Planetlab Europe. 2017. The European arm of the global PlanetLab https://www.docker.com. system. (2017). https://www.fed4fire.eu/planetlab-europe/. [9] Constantinos Dovrolis, Parameswaran Ramanathan, and David Moore. [30] PLCA. 2017. PlanetLab Central API Documentation. (2017). https: 2004. Packet-dispersion techniques and a capacity-estimation method- //www.planet-lab.org/doc/plc_api. ology. IEEE/ACM Transactions On Networking 12, 6 (2004), 963–977. [31] PlMan. 2017. PlanetLab experiment manager. (2017). http://research. [10] Taoufik En Najjary and Guillaume Urvoy Keller. 2008. Passive Capac- cs.washington.edu/networking/cplane/. ity Estimation: Comparison of Existing Tools. In SPECTS. Edinburgh, [32] Polipo. 2015. A caching web proxy. (2015). https://www.irif.fr/~jch/ United Kingdom. /software/polipo/. [11] Google. 2015. QUIC: A UDP-Based Secure and Reliable [33] proxychain. 2017. TCP and DNS through proxy server. (2017). http: Transport for HTTP/2. (2015). https://tools.ietf.org/html/ //proxychains.sourceforge.net/. draft-tsvwg-quic-protocol-00. [34] PUB-DNS. 2017. Public DNS Server List. (2017). https://public-dns. [12] Hola. 2008. Free VPN, Secure Browsing, Unrestricted Access. (2008). info//. http://hola.org/. [35] Redsocks. 2017. Transparent socks redirector. (2017). http://darkk.net. [13] httpbin(1). 2017. HTTP Request and Response Service. (2017). https: ru/redsocks/. //httpbin.org/. [36] Charles Reis, Steven D. Gribble, Tadayoshi Kohno, and Nicholas C. [14] httping. 2017. Measure the latency of a webserver and network. (2017). Weaver. 2008. Detecting In-Flight Page Changes with Web Tripwires. https://www.vanheusden.com/httping/. In NSDI. 31–44. [15] Iperf. 2003. The ultimate speed test tool for TCP, UDP and SCTP. [37] RIPE Atlas. 2017. The largest Internet measurement network ever (2003). https://iperf.fr/. made. (2017). https://atlas.ripe.net/. [16] Van Jacobson, Diana K Smetters, James D Thornton, Michael F Plass, [38] Will Scott, Ravi Bhoraskar, , and Arvind Krishnamurthy. 2015. Under- Nicholas H Briggs, and Rebecca L Braynard. 2009. Networking named standing Open Proxies in the Wild. In Chaos Communication Camp. content. In Proc. ACM CoNEXT. Rome, Italy. [39] N. Spring, D. Wetherall, and T. Anderson. 2003. Scriptroute: A public [17] K. Lai and M. Baker. 2001. Nettimer: A Tool for Measuring Bottleneck Internet measurement facility. In USITS. Berkeley, CA, USA. Link Bandwidth. In Proc. USENIX. Boston, MA, USA. [40] Squid. 2017. Optimising Web Delivery. (2017). http://www. [18] M. Leech, M. Ganis, Y. Lee, R. Kuris, D. Koblas, and L. Jones. 1996. squid-cache.org/. SOCKS Protocol Version 5. RFC 1928 (Proposed Standard). (March [41] Gareth Tyson, Shan Huang, Félix Cuadrado, Ignacio Castro, 1996). http://www.ietf.org/rfc/rfc1928.txt Vasile Claudiu Perta, Arjuna Sathiaseelan, and Steve Uhlig. 2017. Ex- [19] LinuxContainers.org. 2017. The umbrella project behind LXC, LXD ploring HTTP Header Manipulation In-The-Wild. In Proc. of WWW. and LXCFS. (2017). https://linuxcontainers.org. 451–458. [20] Luminati. 2017. The largest proxy network. (2017). https://luminati.io/. [42] Unikernels. 2017. Rethinking Cloud Infrastructure. (2017). http: [21] MONROE. 2017. Measuring Mobile Broadband Networks in Europe. //unikernel.org/. (2017). https://www.monroe-project.eu/. [43] Wen-li ZHOU and Xiao-fei WU. 2006. Survey of P2P technologies. [22] Matthew K Mukerjee, David Naylor, Junchen Jiang, Dongsu Han, Srini- Computer Engineering and Design 1 (2006), 022. vasan Seshan, and Hui Zhang. 2015. Practical, real-time centralized