Towards Geolocation of Millions of IP Addresses∗ USC/ISI Technical Report ISI-TR-680, May 2012

Towards Geolocation of Millions of IP Addresses∗ USC/ISI Technical Report ISI-TR-680, May 2012 Zi Hu John Heidemann Yuri Pradkin USC/Information Sciences Institute fzihu, johnh, [email protected] ABSTRACT geolocation such as Skyhook [14]. We focus on measurement- Previous measurement-based IP geolocation algorithms have based systems here, since they provide better coverage focused on accuracy, studying a few targets with increas- and accuracy than database approaches and are inde- ingly sophisticated algorithms taking measurements from tens pendent of the target. Measurement-based algorithms of vantage points (VPs). In this paper, we study how to scale all depend on vantage points (VPs) to actively probe up existing measurement-based geolocation algorithms like the geolocation targets. We study both Geoping- and Shortest Ping and CBG to cover the whole Internet. We CBG-like algorithms. show that with many vantage points, VP proximity to the Our goal is not to invent a new geolocation algorithm, target is the most important factor affecting accuracy. This but to understand how existing algorithms can scale up observation suggests our new algorithm that selects the best to many millions of targets and the entire IPv4 address few VPs for each target from many candidates. This ap- space. We encounter several problems in scaling-up ex- proach addresses the main bottleneck to geolocation scala- isting algorithms to the whole Internet. First, all exist- bility: minimizing traffic into each target (and also out of ing work uses a relatively small set of VPs, typically tens each VP) while maintaining accuracy. Using this approach of VPs [5, 8]. Second, existing work is tested on a rela- we have currently geolocated about 24% of the allocated, tively small set of targets, typically hundreds of targets. unicast, IPv4 address-space (about 55% of the addresses in Typical targets are selected with known ground truth the Internet that can be directly geolocated). to evaluate algorithm accuracy. With a dozens of VPs and hundreds of targets, current algorithms each have all VPs send many probes to each target. While this 1. INTRODUCTION product is reasonable when both are small, with hun- IP geolocation is the process of finding geographic lo- dreds of VPs and a billion targets, the product is large. cations of Internet Protocol addresses. IP geolocation is The result is a huge amount of traffic out of each VP, widely used today. For example, companies use IP ge- burdensome traffic into each target, where hundreds of olocation to limit the content to certain countries (for probes arrive to each IP address in a target block, and example, television and movies that are often licensed a heavy load to bring this data together. differently by the viewer's country) and to customize ad- To scale geolocation to the entire Internet, our first vertising based on location. Internet researchers use IP contribution is to study what factors affect geolocation geolocation to relate network phenomena to countries, scalability and accuracy for the measurement-based ge- such as studying the cultural impacts on social network- olocation protocols. We show that traffic, both out- ing, or rates of computer crime by country and policy. bound from VPs and inbound to targets, is a signifi- Moreover, IP geolocation is essential in law enforcement cant limitation to full-Internet geolocation, and show to identify the appropriate jurisdiction to handle en- that fewer VPs can make inbound traffic manageable. forcement of computer crime statues. We then show that most VPs provide little benefit to Several research and commercial geolocation systems geolocation, suggesting that one can select only a few exist, exploring many different approaches. They can VPs to geolocate each IP address, getting reasonable ac- be roughly divided into three categories: systems driven curacy while greatly reducing traffic. We develop three by databases (such as NetGeo [9] and GeoCluster [12]), conjectures factors that affect accuracy and show that measurement based geolocation such as Geoping [12], good accuracy with a few VPs is possible (Section 4.1). Shortest Ping [8], CBG [5], TBG [8], and target-assisted Our second contribution is to define new algorithms to choose the right few VPs (Section 4.2). Our idea ∗This work partially supported by the United States Depart- ment of Homeland Security contract number D08PC75599 is to select the closest VPs to the target, since closer All conclusions of this work are those of the authors and do VPs provide stronger constraints on location. We show not necessarily reflect the views of DHS. 1 9 Unfortunately, it is intractable to probe billions of 9 9 addresses from hundreds of VPs. Assume we have 500 / 9 / VPs and each VP probes every IP address 10 times to 9 / 9 get the minimum round-trip time (RTT). The number 9 9 of all allocated, unicast IPv4 addresses is about 3:7 bil- / 9 9 lion. Simple probing of all addresses 10 times by all VPs 9 9 therefore would generate 1:8 × 1013 records, too many to process centrally a large amount of traffic. Cover- Figure 1: Vantage points (VP), targets and landmarks ing this number of targets is difficult because of the the in Geoping (left), with RTT vectors for Geoping (right). amount and diversity of incoming traffic at the targets, and the amount of traffic required from the VPs, com- that VP selection using trial measurements to each /24 binined with the need to probe relatively quickly so that address block works well (Section 4.2). Our experimen- observations can be combined. We expore each of these tal results show that representatives can identify close constraints below. VPs and provide accuracy almost as good as many VPs. The primary challenge is incoming traffic to the tar- For Shortest Ping, the median error is the same with 10 gets, partially because of its size and moreso because it close VPs compared to all 400 VPs, and for CBG me- can be misinterpreted. The amount of traffic for geolo- dian error is only 11% worse. cation is not huge, but can be noticed. Ten probes from 500 VPs is only 320 kB of traffic, equal to about a 10 2. PROBLEM STATEMENT seconds of a Skype video call (at 250 kb/s). While not a huge amount of traffic by itself, the network admin- Our goal is to geolocate every allocated, unicast IPv4 istrator of a /24 network (or her users) can easily be address, with similar accuracy as basic Shortest Ping alarmed at this traffic rate as it covers all 256 addresses and CBG. While geolocation algorithms are well known, over much of an hour. Even a few percent of very vig- the main constraint in scaling them up to cover the en- ilant network operators can result in abuse complaints tire Internet is probing traffic. We next review geoloca- due to concerns of denial-of-service attacks; we must tion elements and these constraints. minimize complaints in our use of a shared measure- 2.1 Geolocation terminology ment infrastructure. The challenge of outgoing traffic at the VPs is that it Measurement-based geolocation systems send probes requires sustained, high-rate traffic. While sustaining from vantage points to geolocate targets in the Internet. 1 Gb/s of traffic for 5 hours is not impossible, geoloca- Some systems also probe landmarks at known reference tion requires geographically distributed VPs. Current location. Figure 1 shows those entities: V1, V2 and V3 public measurement infrastructure, such as PlanetLab, are VPs, T is the target, L1, L2 are landmarks. caps sustained outgoing traffic to less than 10 Mb/s, Targets are IP addresses on the Internet that we wish thus pushing measurement time to more than 500 hours, to geolocate. In this paper we assume targets have fixed or much longer when the nodes are shared. A simple so- physical locations (for example, if associated with a mo- lution to traffic problems is to spread probes out in time. bile phone), although with mobile IP and wireless net- Slowing probing can reduce traffic arbitrarily, poten- works they can have multiple locations. tially to the same level as \background radiation" [16]. Vantage points (VPs) are hosts in the Internet, al- Potentially probing traffic could be spread out in time ways with known locations. They send messages (probes) to reduce the negative effects on VPs and targets. How- to targets to determine the targets locations. ever, measurements must be coherent, that is consistent Landmarks are IP addresses with known locations in the face of path or target chagnes. Probes cannot used by some IP-geolocation algorithms such as Geop- easily be combined if they cross changes in routing or ing. Unlike VPs, landmarks do not actively send mes- target movement. According to Paxon, most paths have sages. Since VPs are usually in known locations, they routes that are stable for days [13]. In our experiment, often also serve as landmarks. we assume that most paths are stable for more than Address blocks, (or blocks), are groups of consecutive two days (48 hours), so we keep measurements of each addresses. Block size is usually identified as the num- address to 48 hours or less. ber of leading bits in common, so all addresses in a /24 block have the same leading 24 bits. Because of IP address allocation policies, blocks are usually managed by 2.3 Our Approach the same organization and often have similar character- To make our goal tractable, we must reduce the work- istics [1], presumably including physical location [12]. load by using fewer targets or fewer VPs. Prior work such as GeoCluster selects a single target to represent 2.2 Problem Constraints an entire cluster of IP addresses [12], which can signifi- 2 cantly reduce the number of targets.

Towards Geolocation of Millions of IP Addresses∗ USC/ISI Technical Report ISI-TR-680, May 2012

Using Whois Based Geolocation and Google Maps API for Support Cybercrime Investigations

The Case of Shared Public Ips at Hotspots

IP Geolocation Through Reverse DNS

Towards IP Geolocation with Intermediate Routers Based on Topology Discovery Zhihao Wang1,2,Hongli1,2*,Qiangli3,Weili4, Hongsong Zhu1,2 and Limin Sun1,2

Sumarul Revistelor Străine Abonate În Anul 2008

Geographically Restricted Streaming Content and Evasion of Geolocation

Identification of IP Addresses Using Fraudulent Geolocation Data

Secure Client and Server Geolocation Over the Internet

Secure Client and Server Geolocation Over the Internet (PDF)

Smartphone Data Transfer Protection According to Jurisdiction Regulations

The Future of Participatory Approaches Using Geographic Information: Developing a Research Agenda for the 21St Century Steve Carver

How to Catch When Proxies Lie Verifying the Physical Locations of Network Proxies with Active Geolocation