Geocompare: a Comparison of Public and Commercial Geolocation Databases
Total Page:16
File Type:pdf, Size:1020Kb
Geocompare: a comparison of public and commercial geolocation databases CAIDA Tech Report, May 16, 2011 Bradley Huffaker, Marina Fomenkov, kc claffy fbradley,marina,[email protected] ∗ CAIDA, University of California, San Diego ABSTRACT ads, enforce digital rights management restrictions or data We attempt a systematic quantitative comparison of currently privacy requirements, and assign incoming requests for con- available geolocation service providers. We add depth to tent to the nearest data center storing it. previous contributions by analyzing inconsistencies across One can broadly categorize geolocation techniques based databases for different geographic (RIR) regions and orga- on the main source of knowledge driving the geolocation: nization (Autonomous Systems) types. We compare results delay, database, or topology. Delay-based methods typically on a country granularity, using a methodology that compares gather delay data from a collection of known geographic each database against the majority vote across all databases landmarks and use that knowledge to triangulate a target IP with answers for a given IP address. On a finer granularity address [29, 34, 31, 37, 40, 30]. Database-driven methods than country, rigorous formal comparison gets trickier. Un- collect and aggregate static mapping information from pub- like the discrete country labels, coordinates can have nom- lic and private databases [36, 32]. Topology-driven meth- inally different values yet still represent approximately the ods infer geolocations by assuming topologically close ad- same location, We compare the databases at a lat-long gran- dresses are physically nearby each other [33, 34]. ularity using an 80 km threshold for two lat-longs coordi- These techniques involve non-trivial hurdles that make nates to be in the same geographic region. We describe our building one’s own service prohibitively difficult for the av- process for selecting this threshold, and our centroid-based erage user. Private databases, such as a web site’s listing algorithm for comparing database lat-long results against a of its users’ IP addresses and contact information, are gen- majority of responses from the set of databases we evalu- erally inaccessible. Public databases, such as the WHOIS ated. While not a foolproof methodology – the databases [27] and DNS [35], are difficult to parse, hard to keep cur- could all be converging to the same wrong answers over time rent, and also may have access controls. Topology and delay – it assumes that database providers successfully work to- data require extensive measurement infrastructure, and are ward improving the accuracy of their databases over time. challenging to collect, process, and interpret. Therefore, the In the absence of substantial ground truth, our method of- majority of users rely upon third-party geolocation services. fers a systematic way to study the geolocation databases to Most geolocation databases, both publicly [14, 16, 19, 25] reveal insights, summarized at the end of the paper. We in- and commercially available [6, 1, 11, 7, 8, 17, 18, 9, 10], tend to re-run the comparison experiment using additional map blocks of consecutive IP addresses to geographic loca- databases later in 2011; we welcome constructive feedback tions, usually at the country level. Some providers support on the methodology so we can further improve our next ex- city mappings and/or resolve to latitude and longitude coor- periment. dinates. Most services offer little if any documentation on which techniques they employ in the creation of their geolo- 1. INTRODUCTION cation databases, thus complicating systematic attempts at evaluation and comparison. Previous comparison attempts [38, Governments, researchers, and commercial entities share 39] have also noted that the lack of a large and diverse set of an interest in mapping Internet resources to physical loca- ground truth further challenge rigorous comparison. tions, a process termed geolocation. For example, govern- In this study we attempt a systematic quantitative compar- ments use this data to prepare and plan for adverse events ison of currently available geolocation service providers. We as well as to tax and regulate. Academics use this data to compare only the geographic components of the participat- more accurately capture the geographic deployment and uti- ing databases; most providers offer additional features that lization of Internet resources. Commercial interests use this we did not evaluate. We add depth to previous contributions data to provide better localized services, target pricing or by analyzing inconsistencies across databases for different ∗ Support for this work is provided by DHS N66001-08-C-2029 and geographic regions and organization (Autonomous Systems) NSF 05-51542. types. We use Regional Internet Registries (RIR) delegation about. GeoBytes [7] and IP2location [8] did not respond to data to classify addresses by the RIR to which the address our inquiries. Quova [9] and Akamai [6] were prohibitively block was first delegated. We infer the organization type expensive, requesting more than $10,000 a year for their ser- corresponding to an IP address based on the characteristics vices. We used free services from Software77 [25], Max- of its origin AS in the AS graph, building on the approach Mind GeoLiteCity [19], HostIP [14], and IPInfoDB [16] and in [28]. Section 2 describes previous comparative geoloca- purchased access to Cyscape [1]. CAIDA has an ongoing tion studies. Section 3 discusses the datasets we used in our research agreement with Digital Envoy [11] as our primary work. We present data analysis and results in Section 4 and geolocation data provider. conclude in Section 5. In this paper we refer to the database with the name of the organization providing it, e.g., NetAcuity is labeled Digital 2. RELATED WORK Envoy. Table 1 lists free and commercial databases evalu- Despite the unavoidable obstacles to systemic compari- ated in this study and their basic statistics: the date of the son, several research groups have made efforts to evaluate database snapshot; the cost of service; the percentage of geolocation services. Siwpersad et al. [39] examined the the IPv4 address space covered (relative to the range dele- geographic resolution of MaxMind GeoLite [19] and Hexa- gated in the RIR delegation files, described in Section 3.1.1); soft [8]. They calculated the distance between locations as the number of unique address blocks; and the numbers of determined by each service and found that for 50% of ad- countries, cities, and latitude/longitude values found in each dresses the difference was smaller than 100 km. They also database. compared the database locations with inferences from active RIR delegation files, though never intended as a geoloca- measurement data collected by PlanetLab nodes probing 39 tion service, are included in the Table 1 as a baseline. Soft- landmarks. The authors inferred geographic locations from ware77’s database is essentially a processed version of the the collected data using a constraint-based approach with a RIR delegation files. Of the databases examined, HostIP series of confidence regions. They found that for 90% of covers the smallest fraction of the IPv4 address space be- probed IP addresses, their location fell outside the confi- cause this free, open database is populated by volunteers dence regions estimated by active measurements. submitting their geographic locations. IPligence is the least Shavitt et al. [38] examined HostIP [14], IP2Location expensive commercial database among those of our study. [8], IPInfoDB [16], MaxMind GeoIP [18], and Software77 We did not receive a full dump of Cyscape’s geolocation [25]. They evaluated these databases through the lens of database, so we sampled its database to infer the full table the DIMES Project’s [26] Points of Presence (PoP) level using the blocks from the largest two other databases: IPli- map. The authors attributed IP addresses to PoPs using their gence [17], and MaxMind GeoIP [18]. If geographic an- own interface-graph-based inference algorithm [38], and as- swers for two contiguous blocks were identical, we merged sumed that IP addresses in the same PoP should map to these blocks into a single larger block; if the answers for the same geographical location. They found that MaxMind the first and the last address of a particular block differed, GeoIP, GeoBytes, and Digital Envoy placed between 74% then we subdivided the block further until every address in a and 82% of a PoP’s IP addresses within 1 km of each other block had the same geolocation mapping. while for HostIP the percentage was slightly less (57%). To IPligence’s database had a larger number of unique cities compare across databases, Shavitt et al. defined PoP coordi- than unique lat-long coordinate pairs, due to some typograph- nates as the median latitude and longitude of all the coordi- ical variance in city names. For example in addition to Up- nate values found in all databases for all IP addresses at the plands Vasby we also found Upplandsvasby and Upplands- given PoP. They considered two levels of proximity: a “city” Vasby. In a few cases many suburb names shared lat-long range (40 km), and a “region” range (500 km). For IPli- coordinates, e.g., Kungsholmen, Stockhom, Bandhagen, Jo- gence, MaxMind GeoIP, and IP2Location, the probability of hannehov, Johanneshov, Stochholm, and Stockholm are all identifying the location of an IP address within the “city” part of Stockholm City and share 59.33, 18.05. range of its PoP ranged between 62% and 73%, while for MaxMind GeoLite is a publicly available less accurate GeoIP, HostIP, and Digital Envoy it was between 33% and version of MaxMind GeoIP. As described on their web page, 47%. MaxMind GeoIP placed over 80% of IP addresses, IPInfoDB derives its results largely from the MaxMind Geo- while Geobytes, HostIP, and Digital Envoy placed 48% to Lite dataset [15], and the two databases performed indistin- almost 60% of IP addresses into the PoP “region” range. guishably for most of our metrics.