The Landscape of Domain Name Typosquatting
Total Page:16
File Type:pdf, Size:1020Kb
The Landscape of Domain Name Typosquatting: Techniques and Countermeasures Jeffrey Spaulding Shambhu Upadhyaya Aziz Mohaisen SUNY Buffalo SUNY Buffalo SUNY Buffalo Buffalo, NY, USA Buffalo, NY, USA Buffalo, NY, USA Email: [email protected] Email: [email protected] Email: [email protected] Abstract—With more than 294 million registered domain over their domain names and pay Facebook up to $1.34 million names as of late 2015, the domain name ecosystem has evolved in damages. to become a cornerstone for the operation of the Internet. As we will discuss in subsequent sections, other forms of Domain names today serve everyone, from individuals for their online presence to big brands for their business operations. Such domain squatting have emerged that not only uses common ty- ecosystem that facilitated legitimate business and personal uses pographical mistakes, but employs the use of (among others): has also fostered “creative” cases of misuse, including phishing, visually-similar letters [18], similar-sounding words [26] or the spam, hit and traffic stealing, online scams, among others. As a exploitation of hardware errors that store domain names [28]. first step towards this misuse, the registration of a legitimately- In addition, it has been shown that not only are the most looking domain is often required. For that, domain typosquatting provides a great avenue to cybercriminals to conduct their crimes. popular domain names targeted by typosquatters, the “long In this paper, we review the landscape of domain name tail” of the popularity distribution has also come under their typosquatting, highlighting models and advanced techniques sights as potential targets for exploitation [30]. By providing for typosquatted domain names generation, models for their a comprehensive treatment of typosquatting, we hope that this monetization, and the existing literature on countermeasures. paper will catapult research on mitigating this problem. We further highlight potential fruitful directions on technical countermeasures that are lacking in the literature. Organization. In §II we review the anatomy of typosquatting. Keywords. Domain Names, Typosquatting, Defenses. In §III we review the monetization techniques of typosquat- ting. In §IV we review the countermeasures to typosquatting. I. INTRODUCTION In §V we provide concluding remarks and open directions. Ever since the process of the domain name registration be- II. TYPOSQUATTING ANATOMY gan in the 1990’s, cybercriminals have seized the opportunity to profit on the backs of others by misusing such process in While typosquatting as a phenomenon is perhaps known so many ways [24], [33], [31]. As Internet commerce quickly for many years, the term itself has been in use for almost two rose and more companies began registering domain names to decades. Several studies have been conducted to understand get a foothold on the action, certain individuals realized that models of typosquatting, including advanced techniques and they could preemptively register these domain names on a first- squatted domains features. In the following, we briefly review come first-serve basis. These so called “cybersquatters” would the historical background of typosquatting, and follow it by a purchase domain names in the hopes of selling them back technical anatomy of models, techniques and features. arXiv:1603.02767v1 [cs.CR] 9 Mar 2016 to companies and trademark owners for a substantial profit. As these popular domain names attracted more users to their A. Historical Background websites, it was not long before cybercriminals recognized The term typosquatter may have been coined as far back that people often made mistakes when typing URLs into as 1998 by R. C. Cumbow in The New York Law Journal their browsers–thus sparking a new form of domain name (NYLJ) [17], who was one of the first to write about this new exploitation called typosquatting. trend of cybersquatting. One of the first large-scale studies In this paper, we survey the landscape of typosquatting, on typosquatting was conducted in 2003 by Edelman [15], which is the deliberate registration of a domain name that uses who located more than 8,800 registered domains that were typographical variants of other target domain names. Typically, minor typographical variations of popular domain names. these variant domain names are generated in such a way as Surprisingly, most of these domain names were traced back to exploit common typographical errors made by users that to one individual, John Zuccarini, who often redirected users manually type URLs into web browsers. For example, the to sexually-explicit content and even used nefarious tactics to popular social networking site Facebook was the target of “mousetrap” these users from leaving these sites (e.g. blocking several typosquatters who registered domain names such as the ordinary operation of a browser’s Back and Close com- www.fagebook.com and www.facewbook.com [29]. Unfortu- mands). Some of these typosquatted domain names went so nately for these typosquatters, they were ordered to transfer far as to target websites frequentDly visited by children, such as disenystore.com (a typo on disneystore.com) 8) 1-mod-inflate: this typo happens when a typosquatter which redirected to a website with sexually-explicit content. increases the length of a domain name (or URL) by one character. Unlike in [32] characters are added based B. Identifying Typo Domains: Models on distance (e.g., using a keyboard layout), this work considers all characters as potential candidates. Prior experiments conducted on the subject of typosquatting Certain aspects of the techniques proposed in [9] can be typically began their data collection phase by first identifying a viewed as generalization of the techniques proposed in [32]. set of domain names and then generating a list of possible typo For example, rather than substituting adjacent characters on a variations on those domain names. Often these experiments keyboard as shown by Wang et al.’s fourth model, Banerjee et used a subset of the top-ranking domain names according to al. substituted all possible alphabet characters when generating some domain ranking websites, such as Alexa. The rationale of typo domains. In addition, they also experimented with two using such domains is that typosquatters will naturally target and three character modifications for their inplace, inflate the most popular domain names to increase the chances of and deflate schemes thereby generating roughly three million obtaining unsuspecting visitors. Table I summarizes these sev- possible typo domain names starting with a corpus of 900 eral approaches of which authoritative domains they studied, original domain names. the number of possible typosquatted domains they generated, After probing for the existence of a possible typo do- and what percentage of them were active (i.e. resolved to main, Banerjee et al. observed that approximately 99% of an IP address hosting a website). In the following section, the “phony” typosquatted sites they identified utilized a one- we describe the models that generated typos variations of an character modification of the popular domain names they authoritative domain. targeted. Essentially, these are domain names that have a Typo-Generation Models. One of the first and widely cited Damerau-Levenshtein distance [22] of one from the domains approaches in this area was introduced by Wang et al. [32] they target. The Damerau-Levenshtein distance is the mini- where given a target domain (e.g. www.example.com), the mum number of operations needed to transform one string into following five typo-generation models are commonly used: another, where an operation is defined as an insertion, deletion, 1) Missing-dot typos: this typo happens when the dot or substitution of a single character, or a transposition of two following “www” is forgotten, e.g., wwwexample.com adjacent characters (a generalization of Hamming distance). 2) Character-omission typos: this typo happens when one C. Advanced Squatting Techniques character in the original domain name is omitted, e.g., While the two representative studies discussed in §II-B www.exmple.com present examples of systematic typosquatting techniques, other 3) Character-permutation typos: this typo happens when techniques that exploit visual, hardware, and sound similarities two consecutive characters are swapped in the original have been explored as well. In the following, we review those domain name, e.g., www.examlpe.com techniques and their use. 4) Character-substitution typos: this typo happens when 1) Homograph Attacks: Per Holgers et al. [18], the ho- characters are replaced in the original domain name by mograph attack relies on the visual similarity of letters or their adjacent ones on a specific keyboard layout, e.g., strings that might be confused with one another. For example, www.ezample.com , where “x” was replaced by the an attacker can exploit the fact that the lower-case letter ‘L’ QWERTY-adjacent character “z”. (l) is visually confusable with the upper-case letter ‘i’ (I) in 5) Character-duplication typos: this typo happens when sans-serif fonts and register www.paypai.com which targets characters are mistakenly typed twice (where they the popular payment site PayPal. The end result, in san-serif appear once in the original domain name), e.g., font, looks very similar to the original: www.paypal.com vs www.exaample.com www.paypaI.com. Alarming as it may seem, the measure- While this previous study presented the first attempt to ment results of Holgers et al. shows that these homograph systematically understand techniques for typosquatting that are attacks are rare and not severe in nature. However, these most prevalent based on certain usage aspects, later studies types of attacks may continue to be an attractive choice looked at exhaustively generating typo domains using other for would-be cyber-criminals since it can fool most users– methods. For example, a similar approach in 2008 by Banerjee as demonstrated in the user study “Why Phishing Works” et al. [9] suggested the following methods for generating by Dhamija et al.