<<

Case of Identity: Detection of Suspicious IDN Homograph Domains Using Active DNS Measurements

Ramin Yazdani Olivier van der Toorn Anna Sperotto University of Twente University of Twente University of Twente Enschede, The Netherlands Enschede, The Netherlands Enschede, The Netherlands [email protected] .[email protected] [email protected]

Abstract—The possibility to include characters in problem in the case of Unicode – ASCII homograph domain names allows users to deal with domains in their domains. propose a low-cost method for proactively regional languages. This is done by introducing Internation- detecting suspicious IDNs. Since our proactive approach alized Domain Names (IDN). However, the visual similarity is based not on use, but on characteristics of the domain between different Unicode characters - called name itself and its associated DNS records, we are able to - is a potential security threat, as visually similar domain provide an early alert for both domain owners as well as names are often used in phishing attacks. Timely detection security researchers to further investigate these domains of suspicious homograph domain names is an important before they are involved in malicious activities. The main step towards preventing sophisticated attacks, since this can contributions of this paper are that we: prevent unaware users to access those homograph domains • propose an improved Unicode Confusion table able that actually carry malicious content. We therefore propose to detect 2.97 times homograph domains compared a structured approach to identify suspicious homograph to the state-of-the-art confusion tables; domain names based not on use, but on characteristics • combine active DNS measurements and Unicode of the domain name itself and its associated DNS records. confusion tables to detect suspicious To achieve this, we leverage the OpenINTEL active DNS IDN homograph domains. In doing so we introduce measurement platform, which performs a daily snapshot of an early alert for further investigation of these more than 65% of the DNS namespace. In this paper, we domains before they are actively used in attacks. first extend the existing Unicode homoglyph tables (confusion tables). This allows us to detect on average 2.97 times homo- The remainder of this paper is organized as follows. In graph domains compared to existing tables. Our proactive Section 2, the background of IDNs in DNS and homograph detection of suspicious IDN homograph domains provides an domains are given. Section 3 discusses the related works early alert that would help both domain owners as well as in the literature. Section 4 introduces the data sets used in security researchers in preventing IDN homograph abuse. this research. In Section 5, the proposed methodology is presented. Results of our study are presented in Section 6. Terms—homoglyph, IDN, homograph attacks, suspi- Ethical considerations are discussed in Section 7. A discus- cious domains, active DNS measurements sion around drawbacks which still need to be addressed is given in Section 8. Finally, Section 9 concludes the paper. 1. Introduction 2. IDN Primer Domain names in the Domain Name System (DNS) are encoded using the American Standard Code for In- The Unicode system incorporates numerous writing formation Interchange (ASCII). The standard uses 8 bits systems and languages, in which many homoglyph char- to encode alphanumeric characters. The Unicode standard acters exist, such as the Greek capital omicron “Ο” uses a maximum of four bytes to perform the encoding, (+039F), capital letter “O” (U+004F), and Cyrillic allowing for a much larger set to be encoded, capital letter “О” (U+041E). These letters are assigned .., Greek, Cyrillic, , or Chinese characters. The to different code points, but visually appear to be indis- Internationalized Domain Name (IDN) [1] provides a tinguishable, or very similar. The DNS is designed with method for using Unicode characters in domain names, ASCII in mind. In order to keep backwards compatibility allowing the usage of regional in domain names. and avoid the need to upgrade existing infrastructure, IDNs However, a major security risk is introduced along with are converted into an ASCII-Compatible Encoding (ACE) IDNs. The Unicode system contains characters that are string, which is done using the ‘Punycode’ [3] algorithm. visually similar to other Unicode or ASCII characters, This algorithm keeps all ASCII characters and encodes called homoglyphs. An attacker can register a domain the non-ASCII characters alongside their position in the visually indistinguishable from an ASCII counterpart using original string using a generalized variable-length integer homoglyph characters, for example to perform a phishing for each non-ASCII character. Finally, an ‘xn--’ prefix attack [2]. These look-alike domains are called homograph is added to indicate the use of Punycode. This process domains. In this paper we investigate the size of the allows the DNS to accept IDNs without any upgrade and is typically reversed before the domain name is presented in their study. However this is possible when dealing to a user, by a browser for example. with a limited number of domains and is computationally expensive otherwise. Qiu et al. [14] propose a Bayesian 3. Related Work framework to calculate the likelihood a character in a domain name is suspicious (visual spoofing). Existing research about IDN homographs can be di- A group of studies [15]–[17] investigate homograph vided in two main groups: the studies trying to construct IDNs targeting top brand domains by processing the simi- Unicode confusion tables (tables that map homoglyph larity of the domain name images. Although image-based characters), and the ones on detecting homograph domains. methods bypass the problem of needing a homoglyph table, this approach is limited to protecting a limited number of 3.1. Unicode Confusion Tables domains (typically brand domains) and is computationally expensive to be extended to the entrie name . El- Fu et al. [4] have constructed a Unicode Character sayed et al. [18] extract newly registered Unicode domains Similarity List (UC-SimList) using a visual similarity from DNS zone files for ‘.com’ and ‘.net’ TLDs and formula based on pixel overlap, covering English, Chinese replace the Unicode characters by their ASCII homograph and Japanese scripts. A similarity threshold is considered counterparts based on the ‘Unicode Confusables’ list, to to select characters considered as homoglyphs. Roshan- determine possible phishing domains. They also make use bin et al. [5] propose a comparable method to create a of the WHOIS data to differentiate between malicious similarity list using the Normalized Compression Distance and protective domains. Quinkert et al. [19] extract IDN (NCD) metric to determine the similarity between Unicode homographs targeting top 10K from the Majestic top 1 characters. Suzuki et al. [6] build a Unicode homoglyph million domains [20] using the ‘Unicode Confusables’ table called ‘SimChar’ using the pixel overlap of the list. Our method differs from this last group of studies characters. They apply this table to IDNs in ‘.com’ Top because it does not make assumption on the nature of the Level Domain (TLD) to extract homograph domains of name, but considers all existing domain names. Besides, we the top-10K Alexa list. To the best of authors’ knowledge replace the use of WHOIS data with DNS measurement, as there is no prior work evaluating the quality of existing WHOIS crawling is notoriously error-prone and sometimes Unicode confusion tables when applied to domain names. not even feasible due to WHOIS privacy protection.

3.2. Homograph Detection 4. Datasets This section discusses the details of the dataset used in Liu and Stamm [7] use the UC-SimList to detect this work. Our dataset comes from the OpenINTEL, which Unicode Obfuscated spam messages. Alvi et al. [8] focus is an active DNS measurement platform, measuring more on detecting plagiarism where Unicode characters are than 65% of the entire DNS namespace on a daily basis. used for obfuscation. Their method uses the ‘Unicode The platform queries domains for their ‘A’, ‘AAAA’, ‘MX’, Confusables’ list1 and the normalized hamming distance. ‘’ records and more. In this paper we use the data from A tool called REGAP is proposed in [9], where a keyword 2018-01-01 through 2019-11-30 for the ‘.com’, ‘.net’ and level Non-deterministic Finite Automaton (NFA) is used to ‘.org’ TLDs and the ‘.se’, ‘.nu’, ‘.ca’, ‘.fi’, ‘.at’, ‘.dk’ and identify potential IDN-based phishing patterns. Contrary ‘.рф’ country-code Top Level Domain (ccTLD). For com- to our work, this approach requires manual intervention parison purposes we use data from publicly available black- limiting the number of investigated domains. lists which is measured by OpenINTEL, namely ‘Hostfile’, Krammer et al. [10] and Al Helou et al. [11] propose ‘hpHosts’, ‘Ransomwaretracker’, ‘Openphish’, ‘Malware- improved user interfaces for browsers to defend against domainlist’, ‘Joewein’, ‘Threatexpert’, ‘Zeustracker’ and phishing attacks. The client-side anti-phishing browser ‘Malcode’2. In the rest of the paper we refer to this set extension prints characters of the Unicode subsets in as ‘RBL’. The date range of blacklist data is from 2018- multiple colors in the address . Although browser based 01-01 till 2019-12-15 to give domains registered at the solutions are helpful, they do not prevent naive users from end of our dataset a chance of appearing on a blacklist. clicking on a malicious homograph domain link on a web As Unicode Confusion tables, the ‘Unicode Confusables’ or an . list, published by the (version 12.0.0) Shirazi et al. [12] propose a phishing domain classifica- and the ‘UC-SimList0.8’ [4], together with an improved tion strategy which uses seven domain name based features, confusion table based on these two tables are used. Details modeling the relation between the domain name and the of the improved confusion table are given in Section 5.2. visible content of a web page. Considering the fact that not all homograph domains have a web page, this method 5. Methodology cannot be applied at scale. Holgers et al. [13] perform a measurement study by first passively collecting a nine- In this section we present our methodology. A high- day-long trace of domain names accessed by users in their level view of the proposed detection mechanism for suspi- department and then generating corresponding homograph cious IDN homographs is shown in Fig. 1. The approach domains. The subsequent step in their work was to perform is divided in five major steps, (1) through (5). We have active measurements against the confusable domains to applied the above process on each day of our dataset. determine if they are registered and active. Both ASCII Details of these steps are elaborated in the following and Unicode homoglyphs of characters were investigated subsections. 1. https://unicode.org/Public/security 2. https://www.tide-project.nl/blog/unicode homoglyphs TABLE 1. U NICODEHOMOGLYPHCHARACTERTABLES

Unicode UC- Proposed

OpenINTEL DB Unicode Homoglyphs DB Confusables SimList0.8 table Enrichment Data Ground Truth Total character pairs 6296 29880 2627 Characters with an ASCII 2236 536 2627 IDN Extraction Homograph Domain Pairs Existing Pairs Suspicious Domains Blacklisted? homoglyph Character to string mapping    (1) (2) (3) (4) (5)

Figure 1. High-level overview of the proposed method that can not be replaced by a single character. On the other hand, the ‘UC-SimList0.8’ does provide multiple 5.1. IDN Extraction homoglyphs for each Unicode character, if the homoglyphs exist, ordered by the degree of similarity. In this paper we All IDNs start with a ‘xn--’ prefix. In the first step we use the ASCII homoglyph with the highest similarity score. filter out the new registered IDNs on each day from our We realize that a Unicode character may have Unicode dataset. This filtering gives us a first indication on how homoglyphs, but due to computation restraints we focus large the problem of suspicious IDN homograph domains on ASCII homoglyphs in this paper. We compare the could be at a maximum. According to the 2019 IDN performance of the three confusion tables, with respect report provided by EURid [21], there were approximately to our goal, in Section 6.2. 7.5 and 9 million IDNs by the end of 2017 and 2018, respectively, which accounts for approximately 2% of the entire DNS namespace. In 2018 there were 7 million 5.3. Existing Homograph Pairs (78%) IDNs registered under a ccTLD, confirming the importance of including ccTLD in the investigation of At this stage we have a list of IDNs paired with their suspicious homograph domains. ASCII counterparts. Since these ASCII counterparts are ‘fabricated’ domains, we need to determine if it concerns 5.2. Homograph Domains registered domain names. We do so by querying our dataset for these ASCII domains. If the ASCII homograph domain In this step we replace the Unicode characters present is not present in our dataset we discard the entire pair. in the IDNs extracted in the previous step with their that in this research we assume an ASCII domain to be corresponding ASCII homoglyph using the Confusion authoritative since in our setup it has been registered before tables mentioned in Section 4. If no ASCII homoglyph its IDN counterpart. character exists for one or more Unicode characters in an IDN, that domain is not further analyzed. We have noticed some irregularities in the “UC-SimList0.8”. For example, 5.4. Suspicious Homograph Detection the Latin small letter dotless I ‘ı’ (U+0131) is considered a homoglyph of ‘!’ (U+0021). However, we argue that the Latin small letter I ‘i’ (U+0069) would be In this step we investigate if the IDN homograph a better choice for our purpose, since the exclamation mark domain has the same origin as the ASCII counterpart, as is an illegal character in DNS labels. This finding urged us the Unicode domain may be a protective registration. For to explore the quality of the confusion tables, with regard this purpose we use the ‘AS’ number, and the ‘A’, ‘AAAA’, to our goal. We introduce a third table, which mixes the ‘NS’ and ‘MX’ records of the two domains retrieved from ‘Unicode Confusables’ and ‘UC-SimList0.8’, but replaces the OpenINTEL platform. We use these record types since irregularities and adds missing characters. Addition of these are frequently used by domains, and will likely the missing characters was done by manually inspecting to the same addresses in the case of a protective characters present in the IDNs that were not covered by the registration. In order to validate this assumption we have existing two tables. Note that our goal was not to create a compared these records for homograph domains which are flawless homoglyph table, as this is a standalone research already on public blacklists in Section 6.4. A normalized area (Section 3.1) and out of the scope of this paper. The ‘suspiciousness’ score is calculated by counting the number proposed table is publicly available online3. Specifications of differences between the existing parameters divided of each Unicode confusion table is given in Table 1. by the number of existing records. However, a record is Comparing the two existing homoglyph tables, we observe counted as existing only if both sides have an entry for that while the ‘UC-SimList0.8’ contains more characters it. Hence, the normalized suspiciousness score will be a in total, it covers much fewer characters with an ASCII discrete real value in [0,1], where 0 means the Unicode homoglyph than the ‘Unicode Confusables’ table. This domain is likely a protective registration and a score of 1 is because the ‘UC-SimList0.8’ covers many characters suggests suspiciousness. If the score exceeds a threshold from the Chinese and Japanese alphabet for which an we mark the Unicode domain as suspicious. We determine ASCII homoglyph does not exist. Another major difference the threshold by calculating the suspiciousness score for between these two tables is that the ‘Unicode Confusables’ the detected domains which have appeared on the blacklists. table provides homoglyph strings for Unicode characters Based on the distribution we choose a cut-off point, the threshold, which captures more than 75% of the blacklisted 3. https://www.tide-project.nl/blog/unicode homoglyphs domains. edtc upcoshmgahdmisadwe these when and domains homograph suspicious detect we iia eut r civdcnieigtenme of number the considering achieved characters. missing are additional results with Similar tables proposed the both since covers ‘UC-SimList0.8’ on the list times that, 1.5 Confusables’ to ‘Unicode shows up and to times 3 compared 6 to domains Fig. up homograph extracts -axis. table confusion proposed the the average, log-scaled average moving the a and applied in improve have filter To we per characters figures respectively. added the Unicode TLD, of IDNs readability ‘.com’ of of in number number domains total the added the depict and 4 Fig. day and 5.2. Section 3 in discussed Fig. tables confusion Unicode three the Tables Confusion of Comparison 6.2. suspicious domains. of homograph detection dataset IDN performing our Han for making use exists, valuable homoglyph IDNs extremely ASCII of no 48% which that for Additionally, 2017 domains. in reported Unicode EURid all contains of dataset our 30% on EURid, approximately IDNs by reported million as 7.5 2017 the December IDNs Considering 2019 for the TLDs. 13% with of generic line growth under in negative is a This showing to period. report, shrinking the EURid of by end descent and the steepest beginning at the at the 83.7% have to 92.7% ‘.net’ compared in to period IDNs shrink study the and the decrease of end IDNs least Specifically, the the domains. have of sets ‘.com’ four for in the the seen visible of in is any are trend errors in growth IDNs deeps measurement negative to Two A due platform. 2. OpenINTEL are Fig. which ccTLDs in for depicted are ccTLDs Growth IDN 6.1. Results paper. 6. this existed), of (if scope domains the these beyond of is pages this this web do however to early the way studying an One activity. is investigated gives malicious further possible it be discover to domains, to need that malicious domains does detect on method notification to our Although aim blacklists. the not of any when on dates appear the considering data, compare blacklist and data historic OpenINTEL to historic this on method our run Advantage Time 5.5. iue2 eaiegot fIN ‘cm,‘nt,‘og n ccTLDs and ‘.org’ ‘.net’, ‘.com’, in IDNs of growth Relative 2. Figure xrcino oorp oanpisi oeusing done is pairs domain homograph of Extraction and ‘.org’ ‘.net’, ‘.com’, in IDNs of growth The we advantage time achievable potential the explore To Growth of the IDNs (%) 100 85 90 95

Jan 2018

Apr 2018

Jul 2018

Measurement Date Oct 2018

Jan 2019

Apr 2019

Jul 2019 ccTLDs .org TLD .net TLD .com TLD Oct 2019 UioeCnuals al hc aeyapa na in appear rarely which table Confusables’ ‘Unicode characters more contains list the since ‘UC-SimList0.8’ ehv hsnfrti praht ro ntesd of side the on error to approach this for chosen have We iha SI oolp.Hwvr ‘UC-SimList0.8’ the However, homoglyph. ASCII an with hnt akadmi sssiiu h oorp do- homograph the suspicious as domain a mark to when genbr)tedsrbto fsoe fteedmisis domains these of scores of distribution the bars) (green of score normalized The details). for 5.4 Section (see h ancuei h ucuto hrcesfo the from characters the is cause main The TLDs. eeto a iia feto u ehddeto due method our on effect minimal has selection value cr odfeetaebtenpoetv eitain and registrations suspiciousness protective a between compute differentiate we to suspicious, score is it mean ically Selection Value Threshold 6.3. IDNs for TLD. ‘.com’ table and in each characters day by Unicode per covered total added tables. characters of the Unicode number by the the covered plots characters com- 4 of day Fig. number per the IDNs to observed Unicode pared newly be of in can amount present the observation characters investigating This by tables. are quantified domains, existing further homoglyphs extracted by Unicode of makes covered used number not table frequently the proposed that in our implying difference to large characters a missing of set name. domain case. this in table Confusables’ ‘Unicode the the outperforms than domains more extract to able is table fusables’ ‘.org’ and ‘.net’ to corresponding characters and domains epnigt odfeec nrcrso cr fone cor- of Besides, different. score zero being a records of the or score of records all a in to corresponding difference achieving the no range, that to score two seen responding in the is concentrated of it mainly ends glance the are pairs first by domain detected the homograph pairs In domain method. homograph proposed the for scores Results Detection 6.4. mark as to method 0.9 proposed of our threshold suspicious. by a detected score. used domains have suspiciousness homograph we our following of the distribution In specific threshold the the the that selects showing 0.1) (e.g., domains, zero blacklisted hand, to other of close the 93% value we On threshold 0.9), domains. a (e.g., blacklisted selecting one the to of close 90% value select threshold 5 a the Fig. With of In shown. one period. observation on the appeared during have RBL 376 in domains blacklists 2019-12-15. these and of compared 2018-01-01 Out were between dataset approach RBL proposed against the by of detected threshold mains the makes determine score To higher suspicious. A more records. IDN missing score. the of the case of type in calculation record caution two the particular the in a ignored of for is one value type If a record sides. have that both not on does exist number sides the which of by records divided number of values the record summing between by mismatches calculated is suspiciousness records additional ex- the the for queried pairs we domain purpose homograph tracted this For domains. suspicious ic ncd oorp oande o automat- not does domain homograph Unicode a Since modest a of addition the that observe we Furthermore, Con- ‘Unicode the that expectation the draws 1 Table i.5sostedsrbto ftesuspiciousness the of distribution the shows 5 Fig. idw(ndy)btendtcino upcosdomains suspicious of detection between days) (in window hssget eaieyhge upcosesfrIDNs for suspiciousness higher relatively a suggests This rpsdmto obndwt ute oananalysis domain 6 was further the of methods. with using (0.3%) median achievable combined (a domain potentially method days is proposed single detection 21 early average a of On and days) year. year, a 79 after a month, detected between a and difference in month a most with a at detected and were registration (23.4%) least domains their at their after detected 78 as day were day domains, a (53.1%) same detected domains the 337 179 on registration, the detected were of (23.2%) Out appear domains domains blacklists. these which the by time on the and method our the by as advantage time the calculate black- We existing investigated. against is with domains) lists combined abusive (if detect to approach methods detection proposed our using Advantage Time Potential 6.5. spoofing). (web are domains intent homograph malicious IDN a which the [13] for of research used previous 1.3% a only with that line shows in- in malicious is a This detected for yet. of used tent actively majority not a that are considering domains suggests suspicious period, which 0.9, observation of on the threshold appeared during not different. have blacklist are IDNs a records suspicious of DNS (99.37%) their 52986 while resemble visually closely pairs other domain each ASCII investigated, and further that Unicode be strongly these to as need feel and we many suspicious blacklisted, remain are these not there are While which 0.9. sus- domains of normalized threshold the score exceeding piciousness for suspicious as domains ccTLDs. the to compared (61.5%). one ‘.com’ of from score extracted a with has pairs ‘.com’ domain the while of (66.4%), from group zero large pairs of a domain score homograph a achieve the ccTLDs of portion large a nti eto h oeta civbetm advantage time achievable potential the section this In 53323 marked have we period detection our During Count of the IDNs

Count of the Unicode Characters 10 10 10 10 10 10 10 iue4 hrceswt oolp o de IDNs added for homoglyph a with Characters 4. Figure 2 3 4 1 2 3 4 Jan 2018 Jan 2018 iue3 xrce oorpsfraddIDNs added for homographs Extracted 3. Figure

Apr 2018 Apr 2018

Jul 2018 Jul 2018

Oct 2018 Measurement Date Oct 2018 Measurement Date

Jan 2019 Jan 2019 Homographs extractedusing"UC-SimList0.8" Homographs extractedusing"UnicodeConfusables" Homographs extractedusingproposedtable Added IDNdomains Characters withahomoglyphinthe"UC-SimList0.8" Characters withahomoglyphinthe"UnicodeConfusables" Characters withahomoglyphintheproposedtable Total Unicodecharacters

Apr 2019 Apr 2019

Jul 2019 Jul 2019

Oct 2019 Oct 2019 0ACIdmis(33)wt vrl 1 homo- IDN 710 overall with (33.3%) domains ASCII 10 hr ahACIdmi a adu fcorresponding of handful a has domain ASCII each where The domains. homograph IDN corresponding 1969 with ASCII this targeting homographs IDN added 1250 with 1.% ihoeal34INhmgahdmis A domains. homograph IDN 354 overall with (16.7%) IDN corresponding of count highest the with (domains hscaatrsi sntse n‘og n ccTLDs and TLD ‘.org’ in seen not is characteristic This ..TpTree Domains Targeted Top 6.6. n-sr ihasml ehdt epso malicious stop help to IDNs. method and of simple authorities usage is a both paper with among This do awareness end-users domains. we provide these since to behind the specially meant intent to domains, than the acceptable these study ethically source of not not list different is the a it publish highly domain, having are ASCII paper to existing this in due method suspicious proposed the by detected Considerations Ethical 7. homographs. IDN domain. ASCII per comparably domains with homograph TLD of ‘.net’ domains number in ASCII lower seen is 5 behaviour of similar consisting rank providers service third IT the and media achieve social The domains. graph targeting rank second the in are platforms 10 providers crypto-currency that service financial observe to we related ‘.com’, are (33.3%) in 30 domains these domains Considering targeted 2019. highly and service 2018 financial throughout a name domain to domain belongs The ‘.com’, TLD. in ‘.com’ industry most in per the 30 summed targeted top domains this of of count category category. the industry shows the 6 Fig. determined have we homographs), domains a targeted most targeting 30 the homographs From we IDN domain. homographs, of ASCII specific number IDN the suspicious counted by have targeted highly are iue5 upcosessoefretatdadbakitddomains blacklisted and extracted for score Suspiciousness 5. Figure nodrt e etrisgto h oan which domains the on insight better a get to order In lhuhw eiv httenwrgsedIDNs registerd new the that believe we Although Count of domains Percentage of homograph domain pairs 100 10 10 10 10 20 40 60 80 0 0 1 2 3 ...... 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 iue6 o agtddmisi .o”TLD “.com” in domains targeted Top 6. Figure iac rpocrec oilmda&I Other Socialmedia&IT Crypto-currency Finance Suspiciousness score Industry RBL ccTLDs .org TLD .net TLD .com TLD IDN homographs ASCII domains 8. Discussion Internet-Draft draft-masinter--i18n-08, Nov. 2001, work in Progress. [Online]. Available: https://datatracker.ietf.org/doc/html/ The measurement data OpenINTEL provides for IDNs draft-masinter-url-i18n-08 clearly shows the problem of IDN homograph phishing [2] E. Gabrilovich and A. Gontmakher, “The Homograph Attack,” attacks. However, the used dataset shows only the tip of Communications of the ACM, vol. 45, no. 2, p. 128, 2002. the iceberg. As of December 2018 there were 85 ccTLDs [3] A. Costello, “Punycode: A Bootstring encoding of Unicode for supporting IDNs [21]. In this work we have investigated Internationalized Domain Names in Applications (IDNA),” Internet Requests for Comments, RFC Editor, RFC 3492, March 2003. 7 out of the 85 ccTLDs which support IDNs. This makes [4] A. Y. Fu, . Deng, L. Wenyin, and G. Little, “The Methodology and for a good start since we show the proposed methodology an Application to Fight against Unicode Attacks,” in Proceedings works in detecting homograph IDNs, but it does not show of the second symposium on Usable privacy and security. ACM, the complete picture of homograph IDNs. Extending the 2006, pp. 91–101. number of measured ccTLDs which support IDNs would [5] N. Roshanbin and . Miller, “Finding Homoglyphs - A Step towards increase the grasp we have of the problem. Additionally, Detecting Unicode-Based Visual Spoofing Attacks,” in International our proposed Unicode confusion table only covers Unicode Conference on Web Information Systems Engineering. Springer, to single ASCII mappings that is relatively computation- 2011, pp. 1–14. ally inexpensive. With multiple homoglyphs for a single [6] H. Suzuki, D. Chiba, Y. Yoneya, . Mori, and S. Goto, “ShamFinder: An Automated Framework for Detecting IDN Homographs,” in character, either Unicode or ASCII, the Unicode confusable Proceedings of the 19th ACM Internet Measurement Conference table may be further improved and detect more suspicious (IMC 2019), 2019. homograph domains. [7] . Liu and S. Stamm, “Fighting Unicode-Obfuscated Spam,” in We are aware that there might be cases where a Proceedings of the anti-phishing working groups 2nd annual eCrime protective registration does not point to the same records researchers summit. ACM, 2007, pp. 45–59. as the original domain, e.g. because it uses the domain [8] . Alvi, . Stevenson, and P. Clough, “Plagiarism Detection in parking service of the Registrar. However, we believe that Texts Obfuscated with Homoglyphs,” in European Conference on operators as well as domain owners would benefit from Information Retrieval. Springer, 2017, pp. 669–675. our approach by further investigating detected domains [9] A. Y. Fu, X. Deng, and L. Wenyin, “REGAP: A Tool for Unicode- and taking necessary countermeasures in case of a mali- Based Web Identity Fraud Detection,” Journal of Digital Forensic Practice, vol. 1, no. 2, pp. 83–97, 2006. cious registration and simply neglecting the alert if those domains are protectively registered. Besides, since we are [10] . Krammer, “Phishing Defense against IDN Address Spoofing Attacks,” in Proceedings of the 2006 International Conference not proposing to immediately block the detected suspicious on Privacy, Security and Trust: Bridge the Gap Between PST domains, our approach wouldn’t cause any disruptions to Technologies and Business Services. ACM, 2006, p. 32. those domains. [11] J. Al Helou and S. Tilley, “Multilingual Web Sites: Internation- alized Domain Name Homograph Attacks,” in 2010 12th IEEE 9. Conclusions International Symposium on Web Systems Evolution (WSE). IEEE, 2010, pp. 89–92. In this paper we investigate how suspicious domains [12] H. Shirazi, . Bezawada, and I. Ray, “Kn0w Thy Doma1n Name: Unbiased Phishing Detection Using Domain Name Based Features,” using Unicode homoglyph characters can be detected using in Proceedings of the 23nd ACM on Symposium on Access Control active DNS measurements. Combining the unique Open- Models and Technologies. ACM, 2018, pp. 69–75. INTEL dataset with our improved Unicode Confusables [13] T. Holgers, D. E. Watson, and S. D. Gribble, “Cutting through table we are able to detect 53323 domains from ‘.com’, the Confusion: A Measurement Study of Homograph Attacks.” in ‘.net’ and ‘.org’ TLDs and seven ccTLDs which exceed our USENIX Annual Technical Conference, General Track, 2006, pp. ‘suspiciousness’ score threshold. The proposed method can 261–266. be combined with other detection methods such as web [14] B. Qiu, N. Fang, and L. Wenyin, “Detect Visual Spoofing in page analysis and network traffic measurements to detect Unicode-based text,” in 2010 20th International Conference on Pattern Recognition. IEEE, 2010, pp. 1949–1952. abusive IDN homograph domains, with the potential to do so earlier than public blacklists since we rely on DNS [15] D. Chiba, A. A. Hasegawa, T. Koide, Y. Sawabe, S. Goto, and M. Akiyama, “DomainScouter: Understanding the Risks of Decep- characteristics of the domain rather than on use. Further- tive IDNs,” 2019, pp. 413–426. more, we show that the suspicious IDNs frequently target [16] B. Liu, C. Lu, . Li, Y. Liu, H.-X. Duan, S. Hao, and Z. Zhang, domains in the finance and crypto-currency industries, “A Reexamination of Internationalized Domain Names: The Good, followed by domains in social media and IT sectors. the Bad and the Ugly.” in DSN, 2018, pp. 654–665. [17] Y. Sawabe, D. Chiba, M. Akiyama, and S. Goto, “Detection Method Acknowledgments of Homograph Internationalized Domain Names with OCR,” Jour- nal of Information Processing, vol. 27, pp. 536–544, 2019. We would like to thank the OpenINTEL project for [18] Y. Elsayed and A. Shosha, “Large Scale Detection of IDN Domain providing us with valuable DNS data. This work has Name Masquerading,” in 2018 APWG Symposium on Electronic Crime Research (eCrime). IEEE, 2018, pp. 1–11. been funded by SIDN fonds, an independent fund on the initiative of SIDN, the registrar for ‘.nl’ domains. This [19] F. Quinkert, T. Lauinger, . Robertson, E. Kirda, and T. Holz, “It’s Not what It Looks Like: Measuring Attacks and Defensive work has partially been funded by the EU H2020 project Registrations of Homograph Domains,” in 2019 IEEE Conference CONCORDIA (#830927). on Communications and Network Security (CNS). IEEE, 2019, pp. 259–267. References [20] “The Majestic Million.” [Online]. Available: https://majestic.com/ reports/majestic-million [1] L. M. Masinter, M. J. Durst,¨ and M. Suignard, “Internationalized [21] EURid, “IDN World Report.” [Online]. Available: https:// Resource Identifiers (IRI),” Internet Engineering Task Force, idnworldreport.eu/