<<

Case of Identity: Detection of Suspicious IDN Homoglyph Domains Using Active DNS Measurements

Ramin Yazdani Olivier van der Toorn Anna Sperotto University of Twente University of Twente University of Twente Enschede, The Netherlands Enschede, The Netherlands Enschede, The Netherlands [email protected] [email protected] [email protected]

Abstract—The possibility to include characters in Since our approach is based not on use, but on - domain names allows users to deal with domains in their istics of the domain name itself and its associated DNS regional languages. This is done by introducing Internation- records, are able to detect such domains before they are alized Domain Names (IDN). However, the visual similarity involved in malicious activities. The main contributions of between different Unicode characters - called homoglyphs this paper are that we: - is a potential security threat, as visually similar domain names are often used in phishing attacks. Timely detection • propose an improved Unicode Confusion table able of suspicious homoglyph domain names is an important to detect 2.97 times homoglyph domains compared step towards preventing sophisticated attacks, since this can to the state-of-the-art confusion tables; prevent unaware users to access those homoglyph domain • combine active DNS measurements and Unicode that actually carry malicious content. We therefore propose homoglyph confusion tables to detect suspicious a structured approach to identify homoglyph domain names IDN homoglyphs. In doing so we introduce a based not on use, but on characteristics of the domain name potential time advantage between the moment of itself and its associated DNS records. To achieve this, we detecting suspicious domains and being listed by leverage the OpenINTEL active DNS measurement platform, publicly available blacklists. which performs a daily snapshot of more than 65% of the The remainder of this paper is organized as follows. In DNS namespace. In this paper, we first extend the existing Section 2, the background of IDNs in DNS and homoglyph Unicode homoglyph tables (confusion tables). This allows us domains are given. Section 3 discusses the related works to detect on average 2.97 times homoglyph domains compared in the literature. Section 4 introduces the datasets used in to existing tables. Our results show that we are able to identify this research. In Section 5, the proposed methodology is suspicious domains on average 21 days before those appear presented. Results of our study are presented in Section 6. in blacklists. Ethical considerations are discussed in Section 7. A discus- Index Terms—homoglyph, IDN, homograph attacks, suspi- sion around drawbacks which still need to be addressed is cious domains, active DNS measurements given in Section 8. Finally, Section 9 concludes the paper.

1. Introduction 2. IDN Primer

Domain names in the Domain Name System (DNS) The Unicode system incorporates numerous writing are encoded using the American Standard Code for In- systems and languages, in which many homoglyph char- formation Interchange (ASCII). The standard uses 8 bits acters exist, such as the Greek capital omicron “Ο” to encode alphanumeric characters. The Unicode standard (+039F), Latin capital letter “O” (U+004F), and Cyrillic uses a maximum of four bytes to perform the encoding, capital letter “О” (U+041E). These letters are assigned allowing for a much larger character set to be encoded, to different code points, but visually appear to be indis- e.g., Greek, Cyrillic, Arabic, or Chinese characters. The tinguishable, or very similar. The DNS is designed with Internationalized Domain Name (IDN) [1] provides a ASCII in mind. In order to keep backwards compatibility method for using Unicode characters in domain names, and avoid the need to upgrade existing infrastructure, IDNs allowing the usage of regional alphabet in domain names. are converted into an ASCII-Compatible Encoding (ACE) However, a major security risk is introduced along with string, which is done using the ‘Punycode’ [3] algorithm. IDNs. The Unicode system contains characters that are This algorithm keeps all ASCII characters and encodes visually similar to other Unicode or ASCII characters, the non-ASCII characters alongside their position in the called homoglyphs. An attacker can register a domain original string using a generalized variable-length integer visually indistinguishable from an ASCII counterpart using for each non-ASCII character. Finally, an ‘xn--’ prefix homoglyphs, for example to perform a phishing attack [2]. is added to indicate the use of Punycode. This process In this paper we investigate the size of the problem in allows the DNS to accept IDNs without any upgrade and the case of Unicode – ASCII homoglyphs. We propose a is typically reversed before the domain name is presented simple method for proactively detecting suspicious IDNs. to a user, by a browser for example. 3. Related Work framework to calculate the likelihood a character in a domain name is suspicious (visual spoofing). Existing research about IDN homoglyphs can be di- A group of studies [15]–[17] investigate homoglyph vided in two main groups: the studies trying to construct IDNs targeting top brand domains by processing the simi- Unicode confusion tables, and the ones on detecting ho- larity of the domain name images. Although image-based moglyph domains. methods bypass the problem of needing a homoglyph table, this approach is limited to protecting a number of 3.1. Unicode Confusion Tables brand domains. Elsayed et al. [18] extract newly registered Unicode domains from DNS zone files for ‘.com’ and Fu et al. [4] have constructed a Unicode Character ‘.net’ TLDs and replace the Unicode characters by their Similarity List (UC-SimList) using a visual similarity ASCII homoglyph counterparts based on the ‘Unicode formula based on pixel overlap, covering English, Chinese Confusables’ list, to determine possible phishing domains. and Japanese scripts. A similarity threshold is considered They also make use of the WHOIS data to differentiate to select characters considered as homoglyphs. Roshan- between malicious and protective domains. Quinkert et bin et al. [5] propose a comparable method to create a al. [19] extract IDN homoglyphs targeting top 10K from similarity list using the Normalized Compression Distance the Majestic top 1 million domains [20] using the ‘Unicode (NCD) metric to determine the similarity between Unicode Confusables’ list. Our method differs from this last group characters. Suzuki et al. [6] build a Unicode homoglyph of studies because it does not make assumption on the na- table called ‘SimChar’ using the pixel overlap of the ture of the name, but considers all existing domain names. characters. They apply this table to IDNs in ‘.com’ Top Besides, we replace the use of WHOIS data with DNS Level Domain (TLD) to extract homoglyph domains of measurement, as WHOIS crawling is notoriously error- the top-10K Alexa list. To the best of authors’ knowledge prone and sometimes not even feasible due to WHOIS there is no prior work evaluating the quality of existing privacy protection. Unicode confusion tables when applied to domain names. 4. Datasets 3.2. Homoglyph Detection This section discusses the details of the dataset used in Liu and Stamm [7] use the UC-SimList to detect this work. Our dataset comes from the OpenINTEL, which Unicode Obfuscated spam messages. Alvi et al. [8] focus is an active DNS measurement platform, measuring more on detecting plagiarism where Unicode characters are than 65% of the entire DNS namespace on a daily basis. used for obfuscation. Their method uses the ‘Unicode The platform queries domains for their ‘A’, ‘AAAA’, ‘MX’, Confusables’ list1 and the normalized hamming distance. ‘NS’ records and more. In this paper we use the data from A tool called REGAP is proposed in [9], where a keyword 2018-01-01 through 2019-11-30 for the ‘.com’, ‘.net’ and level Non-deterministic Finite Automaton (NFA) is used to ‘.org’ TLDs and the ‘.se’, ‘.nu’, ‘.ca’, ‘.fi’, ‘.at’, ‘.dk’ and рф identify potential IDN-based phishing patterns. Contrary ‘. ’ country-code Top Level Domain (ccTLD)s. For com- to our work, this approach requires manual intervention parison purposes we use data from publicly available black- limiting the number of investigated domains. lists which is measured by OpenINTEL, namely ‘Hostfile’, Krammer et al. [10] and Al Helou et al. [11] propose ‘hpHosts’, ‘Ransomwaretracker’, ‘Openphish’, ‘Malware- domainlist’, ‘Joewein’, ‘Threatexpert’, ‘Zeustracker’ and improved user interfaces for browsers to defend against 2 phishing attacks. The client-side anti-phishing browser ‘Malcode’ . In the rest of the paper we refer to this set extension prints characters of the Unicode subsets in as ‘RBL’. The date range of blacklist data is from 2018- multiple colors in the address bar. Although browser based 01-01 till 2019-12-15 to give domains registered at the solutions are helpful, they do not prevent naive users from end of our dataset a chance of appearing on a blacklist. clicking on a malicious homoglyph domain link on a As Unicode Confusion tables, the ‘Unicode Confusables’ webpage or an email. list, published by the (version 12.0.0) Shirazi et al. [12] propose a phishing domain classifica- and the ‘UC-SimList0.8’ [4], together with an improved tion strategy which uses seven domain name based features, confusion table based on these two tables are used. modeling the relation between the domain name and the visible content of a Web page. Considering the fact that 5. Methodology not all homoglyph domains have a Web page, this method In this section we present our methodology. A high- cannot be applied at scale. Holgers et al. [13] perform level view of the proposed detection mechanism for suspi- a measurement study by first passively collecting a nine- cious IDN homoglyphs is shown in Fig. 1. The approach day-long trace of domain names accessed by users in their is divided in five major steps, (1) through (5). We have department and then generating corresponding homoglyph applied the above process on each day of our dataset. domains. The subsequent step in their work was to perform Details of these steps are elaborated in the following active measurements against the confusable domains to subsections. determine if they are registered and active. Both ASCII and Unicode homoglyphs of characters were investigated 5.1. IDN Extraction in their study. However this is possible when dealing with a limited number of domains and is computationally All IDNs start with a ‘xn--’ prefix. In the first step we expensive otherwise. Qiu et al. [14] propose a Bayesian filter out the domains from our dataset that contain this 1. https://unicode.org/Public/security 2. https://www.tide-project.nl/blog/unicode homoglyphs TABLE 1. U NICODEHOMOGLYPHCHARACTERTABLES

Unicode UC- Proposed

OpenINTEL DB Unicode Homoglyphs DB Confusables SimList0.8 table Enrichment Data Ground Truth Total character pairs 6296 29880 2627 Characters with an ASCII 2236 536 2627 IDN Extraction Homoglyph Domain Pairs Existing Pairs Suspicious Domains Early Detection? homoglyph Character to string mapping    (1) (2) (3) (4) (5)

Figure 1. High-level overview of the proposed method 5.3. Existing Homoglyph Pairs At this stage we have a list of Unicode domains with prefix. This filtering gives us a first indication on how large their ASCII counterparts. Since these ASCII counterparts the problem of suspicious Unicode homoglyph domains are ‘fabricated’ domains, we need to determine if it con- could be at a maximum. According to the 2019 IDN cerns registered domain names. We do so by querying our report provided by EURid [21], there were approximately dataset for these ASCII homoglyph domains. If the ASCII 7.5 and 9 million IDNs by the end of 2017 and 2018, homoglyph domain is not present in our dataset we discard respectively, which accounts for approximately 2% of the the entire pair. entire DNS namespace. In 2018 there were 7 million (78%) IDNs registered under a ccTLD, confirming the 5.4. Suspicious Homoglyph Detection importance of including ccTLD in the investigation of suspicious homoglyph domains. In this step we investigate if the Unicode homoglyph domain has the same owner as the ASCII counterpart, as the Unicode domain may be a protective registration. For 5.2. Homoglyph Domains this purpose we use the ‘AS’ number, and the ‘A’, ‘AAAA’, ‘NS’ and ‘MX’ records of the two domains retrieved from In this step we filter the Unicode domains containing the OpenINTEL platform. We use these record types since homoglyph characters from the set of Unicode domains these are frequently used by domains, and will likely point extracted in the previous step using the Confusion tables to the same addresses in the case of a protective registra- mentioned in Section 4. We have noticed some irregulari- tion. A normalized ‘suspiciousness’ score is calculated by ties in the “UC-SimList0.8”. For example, the Latin small counting the number of differences between the existing letter dotless I ‘ı’ (U+0131) is considered a homoglyph of parameters divided by the number of existing records. exclamation mark ‘!’ (U+0021). However, we argue that However, a record is counted as existing only if both sides the Latin small letter I ‘i’ (U+0069) would be a better have an entry for it. Hence, the normalized suspiciousness choice for our purpose, since the exclamation mark is an score will be a discrete real value in [0,1], where 0 means illegal character in DNS labels. This finding urged us to the Unicode domain is likely a protective registration and explore the quality of the confusion tables, with regard a score of 1 suggests suspiciousness. If the score exceeds a to our goal. We introduce a third table, which mixes the threshold we mark the Unicode domain as suspicious. We ‘Unicode Confusables’ and ‘UC-SimList0.8’, but replaces determine the threshold by calculating the suspiciousness irregularities and adds missing characters. The proposed score for the detected domains which have appeared on the table is publicly available online3. Specifications of each blacklists. Based on the distribution we choose a cut-off Unicode confusion table is given in Table 1. Comparing point, the threshold, which captures more than 75% of the the two existing homoglyph tables, we observe that while blacklisted domains. the ‘UC-SimList0.8’ contains more characters in total, it covers much fewer characters with an ASCII homoglyph 5.5. Time Advantage than the ‘Unicode Confusables’ table. This is because the ‘UC-SimList0.8’ covers many characters from the Chinese To explore the potential achievable time advantage we and Japanese alphabet for which an ASCII homoglyph run our method on historic OpenINTEL data and compare does not exist. Another major difference between these this to historic blacklist data, considering the dates when two tables is that the ‘Unicode Confusables’ table provides we detect suspicious homoglyph domains and when these homoglyph strings for Unicode characters that can not be appear on any of the blacklists. Although our method does replaced by a single character. On the other hand, the ‘UC- not aim to detect malicious domains, it gives an early SimList0.8’ does provide multiple homoglyphs for each notification on domains that need to be further investigated Unicode character, if the homoglyphs exist, ordered by to discover possible malicious activity. One way to do this the degree of similarity. In this paper we use the ASCII is studying the web pages of these domains (if existed), homoglyph with the highest similarity score. We realize however this is beyond the scope of this paper. that a Unicode character may have Unicode homoglyphs, but due to computation restraints we focus on ASCII 6. Results homoglyphs in this paper. We compare the performance of the three confusion tables, with respect to our goal, in 6.1. IDN growth Section 6.2. The growth of IDNs in ‘.com’, ‘.net’, ‘.org’ and 3. https://www.tide-project.nl/blog/unicode homoglyphs ccTLDs are depicted in Fig. 2. Two deeps are visible UioeCnuals al hc aeyapa na in appear rarely which table Confusables’ ‘Unicode characters more contains list the since ‘UC-SimList0.8’ iha SI oolp.Hwvr h ‘UC-SimList0.8’ the However, homoglyph. ASCII an with From 2019-11-30. till 2016-01-22 from million 1 top Alexa h ancuei h ucuto hrcesfo the from characters the is cause main The TLDs. rank of change the is behaviour this behind reason The oanname. domain case. this in table Confusables’ ‘Unicode the the outperforms than domains more extract to able is table fusables’ ‘.org’ and of ‘.net’ number to the corresponding characters considering and achieved characters. domains missing are additional results with Similar tables proposed times the both since 6 covers ‘UC-SimList0.8’ the list to times 1.5 Confusables’ up to ‘Unicode up extracts and to average, propose compared on domains we that, homoglyph shows table 4 confusion Fig. the y-axis. filter average the the moving log-scaled a in improve applied and To we per characters figures respectively. added the Unicode TLD, of IDNs readability ‘.com’ of of in number number domains total the added the depict and 5 Fig. day and 5.2. Section 4 in discussed Fig. tables confusion Unicode three the Tables Confusion of Comparison 6.2. significantly has 2019. domains. Alexa July ASCII on from as IDNs increased popular of as number conclude not the to However, are us Alexa domains leads on IDN which IDN that 0.3%, of one to number a contributes average to roughly The month [22]. one from average Alexa day by used policy calculation observed. be can oscillations extreme onwards 2018-02-01 suspicious domains. of homoglyph detection dataset IDN performing our Han for making use exists, valuable homoglyph IDNs extremely ASCII of no 48% which that for Additionally, 2017 domains. in reported Unicode EURid all contains of dataset our 30% on EURid, approximately IDNs by reported million as 7.5 2017 the December IDNs Considering 2019 for the TLDs. 13% with of generic line growth under in negative is a This showing to period. report, shrinking the EURid of by end descent and the steepest beginning at the at the 83.7% have to 92.7% ‘.net’ compared in to period IDNs shrink study the and the decrease of end IDNs least Specifically, the the domains. have of sets ‘.com’ four for in the the seen of in is any trend errors in growth IDNs measurement negative to A due platform. OpenINTEL are which ccTLDs for iue2 eaiegot fIN n‘cm,‘nt,‘og n ccTLDs and ‘.org’ ‘.net’, ‘.com’, in IDNs of growth Relative 2. Figure al rw h xetto htte‘ncd Con- ‘Unicode the that expectation the draws 1 Table using done is pairs domain homoglyph of Extraction the on domains IDN of number the depicts 3 Fig. Growth of the Unicode Domains (%) 100 85 90 95

Jan 2018

Apr 2018

Jul 2018

Measurement Date Oct 2018

Jan 2019

Apr 2019

Jul 2019 ccTLDs .org TLD .net TLD .com TLD Oct 2019 ehv hsnfrti praht ro ntesd of side the on error to approach this for chosen have We seScin54frdtis.Tenraie cr of score normalized The details). for 5.4 Section (see au eeto a iia feto u method. our on effect minimal has selection value ..Dtcinresults Detection 6.3. IDNs for TLD. ‘.com’ table and in each characters day by Unicode per covered total added tables. characters of the Unicode number by the the covered plots characters com- 5 of day Fig. number per the IDNs to observed Unicode pared newly be of in can amount present the observation characters investigating This by tables. are quantified domains, existing further homoglyphs extracted by Unicode of makes covered used number not table frequently the proposed that in our implying difference to large characters a missing of set 3 fbakitddmis hwn httethreshold the that selects showing 0.1 suspicious. domains, of more blacklisted threshold a domain of selecting 93% the intuition that our makes note with we of score However, line 90% higher We in selects a shown. this while that is because domains 0.9, blacklisted domains the of the these 6 threshold Fig. of a In selected scores score blacklists. of normalized on the appearing distribution IDNs calculated detected we for suspicious as main ccTLDs. the to IDNs compared behind a ‘.com’ intent suggests from malicious This extracted of of (61.5%). group one chance large of higher a score relatively has a ‘.com’ with achieve while pairs ccTLDs (66.4%), domain the zero portion of from large score pairs a a to domain Besides, homoglyph corresponding different. the one being of of records score the no of a to all or corresponding the records zero of in of ends score difference two a homoglyph in achieving the range, concentrated that score seen mainly is are it pairs glance do- first domain the homoglyph the shows detected In the 6 pairs. for main Fig. makes scores score these suspicious. of higher more distribution A domain records. Unicode missing score. the of the case of type in calculation record caution two the particular the in a ignored of for is one value type If a record sides. have that both not on does exist number sides the which of by records divided number of values the record summing between by mismatches calculated is suspiciousness records additional ex- the the for queried pairs we domain purpose homoglyph tracted this and For registrations suspiciousness domains. protective a suspicious between compute differentiate we to suspicious, score is it mean ically utemr,w bev htteadto famodest a of addition the that observe we Furthermore, odtrietetrsodo hnt akado- a mark to when of threshold the determine To automat- not does domain homoglyph Unicode a Since Count of the Unicode Domains 2000 3000 4000 5000 6000 7000 8000 iue3 Dsi lx o Mdomains 1M top Alexa in IDNs 3. Figure Jul 2016

Jan 2017

Jul 2017 Measurement Date

Jan 2018

Jul 2018

Jan 2019

Jul 2019 ih15 de D oolpstreigti ASCII this targeting homoglyphs IDN added 1250 with these that strongly feel we blacklisted, not are which dmiswt h ihs on fcrepnighomo- corresponding of count highest the with (domains agtdtems n‘cm,blnst nnilservice financial a name to domain belongs The ‘.com’, TLD. in ‘.com’ industry most in per the 30 summed targeted top domains this of category. of count industry category the the shows determined 7 have Fig. we IDNs), domains a targeted most targeting 30 the IDNs From we homoglyph domain. homoglyphs, of ASCII specific number IDN the suspicious counted by have targeted highly are domains Targeted Top are 6.4. the IDNs with TLD inline generic of is 81% This pages. that parking yet. report intent 2019 actively malicious EURid not a are for domains that used suspicious suggests observation detected which the 0.9, of during of majority threshold blacklist IDNs a a suspicious considering on period, of appeared (99.37%) compared not when 52986 have are blacklists. detections these how our to discuss of we cover some 6.5 doesn’t earlier Section and much In TLD proportionally. ‘.com’ TLDs RBL towards other the that more suggests originate targeted This domains from is ccTLDs. However, suspicious and ‘.org’ of ‘.org’ ccTLDs. ‘.net’, 18% from from the detections one our from of ‘.net’, 320 domain all listings from single RBL a their 15 the and ‘.com’, while From from visually different. originate other are each domain records ASCII DNS resemble and closely Unicode these pairs domains during as many suspicious, RBL are remain in there (0.63%) blacklists While 337 the period. of domains observation one these the on Of appeared 0.9. have suspiciousness of normalized as threshold domains the score 53323 exceeding marked for have suspicious we During period 2019-12-15. detection and our 2018-01-01 between RBL from nodrt e etrisgto h oan which domains the on insight better a get to order In domains against domains detected compared have We Count of the Unicode Domains

Count of the Unicode Characters 10 10 10 10 10 10 10 iue5 hrceswt oolp o de IDNs added for homoglyph a with Characters 5. Figure 2 3 4 1 2 3 4 Jan 2018 Jan 2018 iue4 xrce oolpsfraddIDNs added for homoglyphs Extracted 4. Figure

Apr 2018 Apr 2018

Jul 2018 Jul 2018

Oct 2018 Measurement Date Oct 2018 Measurement Date

Jan 2019 Jan 2019 Homoglyphs extractedusing"UC-SimList0.8" Homoglyphs extractedusing"UnicodeConfusables" Homoglyphs extractedusingproposedtable Added IDNdomains Chars withhomoglyphin"UC-SimList0.8" Chars withahomoglyphin"UnicodeConfusables" Chars withahomoglyphinproposedtable Total Unicodecharacters

Apr 2019 Apr 2019

Jul 2019 Jul 2019

Oct 2019 Oct 2019 0ACIdmis(33)wt vrl 1 D homo- IDN 710 overall with (33.3%) domains ASCII 10 hr ahACIdmi a adu fcorresponding of handful a has domain ASCII each where The domains. homoglyph IDN corresponding 1969 with 03)wsdtce fe naeae2 asof days 21 average On year. a after detected was (0.3%) A domains. homoglyph IDN 354 overall with (16.7%) hscaatrsi sntse n‘og n ccTLDs and TLD ‘.org’ in seen not is characteristic This ihytree oan n‘cm,w bev ht10 providers that service financial observe to we related ‘.com’, are (33.3%) in 30 domains these domains Considering targeted 2019. highly and 2018 throughout domain upcosdet aigadfeetsuc hnthe to than acceptable ethically source not different is a it highly domain, having are ASCII paper to existing this in due method suspicious proposed the by detected Considerations Ethical 7. blacklisted. already method are proposed that the domains the using achievable considering is a detection with early domain detected single most a were and at year, (23.4%) a and and month domains registration a 79 between their difference after month, de- day were a (53.1%) a in domains least same 179 at the these registration, tected on their detected which detected 337 as were by the day (23.2%) of time domains Out 78 the blacklists. domains, the and on suspicious method appear of domains our detection between by days) domains advantage (in time window existing the the calculate against as We approach investigated. is detection blacklists proposed our using advantage Time 6.5. homoglyphs. IDN domain. ASCII per comparably domains with homoglyph TLD of ‘.net’ domains number in ASCII lower seen is 5 behaviour of similar consisting rank providers service third IT the and media achieve social The domains. glyph targeting rank second the in are platforms crypto-currency iue6 omlzdsoe o xrce n lclse domains blacklisted and extracted for scores Normalized 6. Figure lhuhw eiv httenwrgsedIDNs registerd new the that believe we Although advantage time achievable potential the section this In Count of domains Percentage of homoglyph domain pairs 100 10 10 10 10 20 40 60 80 0 0 1 2 3 ...... 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 iue7 o agtddmiso .o”TLD “.com” on domains targeted Top 7. Figure iac rpocrec oilmda&I Other Socialmedia&IT Crypto-currency Finance Score ofdifferentrecords Industry IDN homoglyphdomains ASCII domains RBL ccTLDs .org TLD .net TLD .com TLD publish the list of these domains, specially since we do [2] E. Gabrilovich and A. Gontmakher, “The Homograph Attack,” not study the intent behind these domains. This paper is Communications of the ACM, vol. 45, no. 2, p. 128, 2002. meant to provide awareness among both authorities and [3] A. Costello, “Punycode: A Bootstring encoding of Unicode for end-users with a simple method to help stop malicious Internationalized Domain Names in Applications (IDNA),” Internet usage of IDNs. Requests for Comments, RFC Editor, RFC 3492, March 2003. [4] A. Y. Fu, X. Deng, L. Wenyin, and G. Little, “The Methodology and an Application to Fight against Unicode Attacks,” in Proceedings 8. Discussion of the second symposium on Usable privacy and security. ACM, 2006, pp. 91–101. The measurement data OpenINTEL provides for IDNs [5] N. Roshanbin and J. Miller, “Finding Homoglyphs - A Step towards clearly shows the problem of IDN homoglyph phishing Detecting Unicode-Based Visual Spoofing Attacks,” in International attacks. However, the used dataset shows only the tip of Conference on Web Information Systems Engineering. Springer, 2011, pp. 1–14. the iceberg. As of December 2018 there were 85 ccTLDs [6] H. Suzuki, D. Chiba, Y. Yoneya, T. Mori, and S. Goto, “ShamFinder: supporting IDNs [21]. In this work we have investigated An Automated Framework for Detecting IDN Homographs,” in 7 out of the 85 ccTLDs which support IDNs. This makes Proceedings of the 19th ACM Internet Measurement Conference for a good start since we show the proposed methodology (IMC 2019), 2019. works in detecting homoglyph IDNs, but it does not show [7] . Liu and S. Stamm, “Fighting Unicode-Obfuscated Spam,” in the complete picture of homoglyph IDNs. Extending the Proceedings of the anti-phishing working groups 2nd annual eCrime number of measured ccTLDs which support IDNs would researchers summit. ACM, 2007, pp. 45–59. increase the grasp we have of the problem. Additionally, [8] F. Alvi, M. Stevenson, and P. Clough, “Plagiarism Detection in our proposed Unicode confusion table only covers Unicode Texts Obfuscated with Homoglyphs,” in European Conference on Information Retrieval. Springer, 2017, pp. 669–675. to single ASCII mappings that is relatively computation- [9] A. Y. Fu, X. Deng, and L. Wenyin, “REGAP: A Tool for Unicode- ally inexpensive. With multiple homoglyphs for a single Based Web Identity Fraud Detection,” Journal of Digital Forensic character, either Unicode or ASCII, the Unicode confusable Practice, vol. 1, no. 2, pp. 83–97, 2006. table may be further improved and detect more suspicious [10] V. Krammer, “Phishing Defense against IDN Address Spoofing homoglyph domains. We noticed, during this work, that the Attacks,” in Proceedings of the 2006 International Conference blacklists we have used do not focus on IDNs. Creating on Privacy, Security and Trust: Bridge the Gap Between PST our own blacklist, which specifically focuses on malicious Technologies and Business Services. ACM, 2006, p. 32. IDNs, may be beneficial for the security community at [11] J. Al Helou and S. Tilley, “Multilingual Web Sites: Internation- large. Since we have good indication that the domains we alized Domain Name Homograph Attacks,” in 2010 12th IEEE International Symposium on Web Systems Evolution (WSE). IEEE, detect, are at least suspicious. 2010, pp. 89–92. [12] H. Shirazi, B. Bezawada, and I. Ray, “Kn0w Thy Doma1n Name: 9. Conclusions Unbiased Phishing Detection Using Domain Name Based Features,” in Proceedings of the 23nd ACM on Symposium on Access Control In this paper we investigate how suspicious domains Models and Technologies. ACM, 2018, pp. 69–75. using Unicode homoglyph characters can be detected [13] T. Holgers, D. E. Watson, and S. D. Gribble, “Cutting through the Confusion: A Measurement Study of Homograph Attacks.” in using active DNS measurements. Combining the unique USENIX Annual Technical Conference, General Track, 2006, pp. OpenINTEL dataset with our improved Unicode Confus- 261–266. ables table we are able to detect 53323 domains from [14] B. Qiu, N. Fang, and L. Wenyin, “Detect Visual Spoofing in ‘.com’, ‘.net’ and ‘.org’ TLDs and seven ccTLDs which Unicode-based text,” in 2010 20th International Conference on exceed our ‘suspiciousness’ score threshold. Additionally, Pattern Recognition. IEEE, 2010, pp. 1949–1952. we have shown that our method can potentially detect [15] D. Chiba, A. A. Hasegawa, T. Koide, Y. Sawabe, S. Goto, and suspicious domains on average 21 days earlier when M. Akiyama, “DomainScouter: Understanding the Risks of Decep- compared to blacklists. Furthermore, we show that the tive IDNs,” 2019, pp. 413–426. suspicious IDNs frequently target domains in the finance [16] B. Liu, C. Lu, Z. Li, Y. Liu, H.-X. Duan, S. Hao, and Z. Zhang, “A Reexamination of Internationalized Domain Names: The Good, and crypto-currency industries, followed by domains in the Bad and the Ugly.” in DSN, 2018, pp. 654–665. social media and IT sectors. [17] Y. Sawabe, D. Chiba, M. Akiyama, and S. Goto, “Detection Method of Homograph Internationalized Domain Names with OCR,” Jour- Acknowledgments nal of Information Processing, vol. 27, pp. 536–544, 2019. [18] Y. Elsayed and A. Shosha, “Large Scale Detection of IDN Domain We would like to thank the OpenINTEL project for Name Masquerading,” in 2018 APWG Symposium on Electronic providing us with valuable DNS data. This work has Crime Research (eCrime). IEEE, 2018, pp. 1–11. been funded by SIDN fonds, an independent fund on the [19] F. Quinkert, T. Lauinger, W. Robertson, E. Kirda, and T. Holz, “It’s Not what It Looks Like: Measuring Attacks and Defensive initiative of SIDN, the registrar for ‘.nl’ domains. This Registrations of Homograph Domains,” in 2019 IEEE Conference work has partially been funded by the EU H2020 project on Communications and Network Security (CNS). IEEE, 2019, CONCORDIA (#830927). pp. 259–267. [20] “The Majestic Million.” [Online]. Available: https://majestic.com/ References reports/majestic-million [21] EURid, “IDN World Report.” [Online]. Available: https:// idnworldreport.eu/ [1] L. M. Masinter, M. J. Durst,¨ and M. Suignard, “Internationalized Resource Identifiers (IRI),” Internet Engineering Task Force, [22] V. L. Pochat, T. Van Goethem, S. Tajalizadehkhoob, M. Korczynski,´ Internet-Draft draft-masinter-url-i18n-08, Nov. 2001, work in and W. Joosen, “TRANCO: A Research-Oriented Top Sites Ranking Progress. [Online]. Available: https://datatracker.ietf.org/doc/html/ Hardened Against Manipulation,” arXiv preprint arXiv:1806.01156, draft-masinter-url-i18n-08 2018.