Distributed Reverse DNS Geolocation

Distributed Reverse DNS Geolocation Ovidiu Dan∗ Vaibhav Parikh Brian D. Davison Lehigh University Microsoft Bing Lehigh University Bethlehem, PA, USA, 18015 Redmond, WA, USA, 98052 Bethlehem, PA, USA, 18015 [email protected] [email protected] [email protected] Abstract—IP geolocation databases map IP addresses to their often contain abbreviations or ambiguous names. If a hostname geographical locations. These databases are used in a variety of contains the substring mant, does that refer to Manteo, North online services to serve local content to users. Here we present Carolina, or to Mantorville, Minnesota? Or if it contains the methods for extracting locations from the reverse DNS hostnames assigned to IP addresses. We summarize a machine learning term snbr, does that mean the location is San Bruno, CA, based approach which, given a hostname, aims to extract and or San Bernardino, CA, or San Bartolomeo, Italy, or any of rank potential location candidates, which can then potentially be the other tens of locations around the world that could match fused with other geolocation signals. We show that this approach this abbreviation? These types of ambiguous names are often significantly outperforms a state-of-the-art academic baseline, difficult for humans as well. Second, reverse DNS hostnames and it is competitive and complementary to commercial baselines. Since extracting locations from more than a billion reverse DNS sometimes contain strings that are not directly derived from hostnames at once poses a significant computational challenge, the names of locations. For example, Verizon uses nycmny we develop a distributed version of our algorithm. We perform to refer to locations in New York, and GTE uses miasfl experiments on a cluster of 2,000 machines to demonstrate to refer to Miami, Florida. Third, extracting locations from that our distributed implementation can scale. We show that reverse DNS hostnames requires a scalable solution. There compared to the single machine version, our distributed approach can achieve a speedup of more than 150X. are more than 1.24 billion valid reverse DNS hostnames for Index Terms—IP geolocation; distributed algorithms; MapRe- IPv4 addresses alone. For each of these hostnames we have duce; geographic targeting; geotargeting; geographic personal- to evaluate and disambiguate potentially tens of candidate ization; reverse DNS; hostname geolocation locations. Our proposed solution generates on average 48 potential location candidates for each hostname, which yields I. INTRODUCTION close to 60 billion classifier decisions in total. IP geolocation databases are used by online services to determine the geographical location of users based solely on their IP address. They map IP ranges, typically of 256 IP Exact City Admin 1 Japan Top- Name Match Region Match Level Domain addresses in size, to locations at the city-level granularity. Companies such as MaxMind, Neustar IP Intelligence , and ip3801yokohama.kanagawa.ocn.ne.jp IP2Location provide commercial geolocation databases that are considered state of the art. They combine multiple data Fig. 1: Example information extracted from reverse DNS sources such as IP WHOIS information, network delay measurements triangulation through ping and traceroute, network topology analysis from BGP routing table dumps, reverse DNS To solve these problems, here we propose a distributed al- hostnames, and others [1]. However, the exact methods they gorithm for extracting locations from reverse DNS hostnames. use to compile these databases are proprietary. Prior research More specifically, our contributions are: has shown that these databases can sometimes be inaccurate or incomplete [2]. Therefore, it is vital for these types of services 1) We present a distributed approach to reverse DNS to use complete and accurate geolocation databases. geolocation. We build on our ongoing single-node reverse In this work we use reverse DNS hostnames assigned to DNS geolocation work [3] and present a scalable distributed IP addresses as a source of geolocation information. Figure version. Our approach efficiently distributes computation 1 shows the type of location information that can be derived steps among a cluster of machines. from a hostname. In this example we can determine that the 2) We evaluate both versions of the algorithm using IP address behind this reverse DNS hostname may be located the largest ground truth set reported in geolocation in Yokohama, a city in Japan. Our task is to determine the lo- literature. First, we show that our method significantly cation of IP addresses based on their reverse DNS hostnames, outperforms a state of the art academic baseline, and it at city-level granularity. This task poses several challenges. is complementary and competitive with two commercial First, naming schemes used by ISPs are heterogeneous and database baselines. Second, we perform large-scale experiments on a cluster of 2,000 machines to test the distributed ∗Author is also affiliated with Microsoft Bing. version of our algorithm. II. RELATED WORK in March 2018 by randomly sampling the query logs of a The majority of IP geolocation research relies on active large-scale commercial search engine. network delay measurements to locate addresses. These In addition to the ground truth set, we use multiple publicly approaches issue pings or traceroutes from multiple geo- available datasets for feature generation, some of which we graphically distributed landmark servers to a target IP, then summarize here. GeoNames is a free database with geograph- triangulate the location [4], [5], [6], [7]. Network delay and ical information [11]. The CLLI dataset contains codes used topology methods have significant limitations. First, they have by the North American telecommunications industry to desig- scalability issues as they require access to landmark servers nate names of locations and functions of telecommunications distributed around the world and each target IP needs sepa- equipment. UN/LOCODE is a worldwide geographic coding rate measurements. Second, not all networks allow ping and scheme developed and maintained by the UN. It assigns codes traceroute. Third, routes on the Internet do not necessarily to locations used in trade and transport, such as rail yards, sea map to geographic distances. These problems often lead to ports, and airports. Rapid7 Reverse DNS consists of reverse lackluster results, with error distance in the order of hundreds DNS hostnames across the entire IPv4 address space. of kilometers. Some of the earlier research is also plagued B. Single node classification by extremely small ground truth datasets, often focusing on a handful of IP addresses in a few US universities [4], [5]. We begin by splitting hostnames into their constituent terms. Our work addresses several of these limitations. First, using Next, we iterate over each extracted term to find potential reverse DNS hostnames for geolocation does not require any location candidates, and we also compute the primary and sec- network delay measurements. Second, our ground truth dataset ondary features discussed below. We define the list of location is several orders of magnitude larger than the ones used candidates as the union of all locations which match any of the in previous work and it spans the entire planet. Third, our primary features. For example, since the term hrbg matches approach can be performed offline. the location Harrisburg, PA using our Abbreviations feature, In this work we propose extracting IP locations from we select this city among the initial location candidates. their reverse DNS hostnames. Undns is the most well- For each given hostname, we generate primary feature known and widely used reverse DNS geolocation approach categories, which are based on city names, abbreviations, [8]. It consists of manual rules which are expressed as regular concatenations of city names and administrative regions, CLLI expressions at the domain level. The disadvantage of this codes, UN/LOCODE codes, etc. Secondary feature categories approach is that each domain requires manually generated and are then generated in the context of a hostname and location potentially error prone rules. Our approach is more scalable candidate pair and their aim is to support the current candidate. since it does not require human input. It also handles unique For example, if a hostname contains the term allentown, situations better, since it considers the terms of each hostname the candidates found in the first phase include Allentown, PA individually, without requiring domain-specific training. and Allentown, WA. Secondary features are then computed to DRoP, another state of the art reverse DNS based approach, help in disambiguating locations. If the hostname also contains aims to geolocate hostnames using automatically generated the term wa then during the evaluation of the Allentown, rules generated by finding patterns across all the hostname WA candidate, the secondary feature Admin1 matches, which terms of a domain [9]. For example, it may find that for increases the confidence of this candidate. For a detailed list the domain cogentco.com, the second term from the right of features, please consult our paper draft which describes often contains airport codes. These rules are then validated the single-node approach in detail [3]. using network delay information. DRoP places 99% of IPs in We train a binary classifier where the input is a set of 5 test domains within 40 km of their actual location. However, features describing a hostname and location candidate pair, it uses network delay measurements and its method of splitting and the output indicates whether the hostname is likely to hostnames is rudimentary. be located in the candidate location. For this paper we have Finally DDec [10] combines undns and DRoP rules by chosen to train the classifier using C4.5 decision trees. While giving precedence to undns and using DRoP as fallback. this classifier might not yield the best overall results, it is extremely efficient, which is suitable for our goal of parsing III. APPROACH all 1.24 billion reverse DNS hostnames in the Rapid7 dataset. We present two versions of our approach.

Load more