2010 International Conference on Pattern Recognition

Detect Visual Spoofing in -based Text*

Bite Qiu, Ning Fang, Liu Wenyin Department of Computer Science, City University of Hong Kong [email protected], [email protected], [email protected]

Abstract “а” is mapped to string “xn--80a”. Restriction techniques are usually deployed in domain name Visual spoofing in Unicode-based text is anticipated registration level. Top-level-domain registry may apply as a severe web security problem in the near future as policies [8] to restrict the usage of homoglyphs in IDN more and more Unicode-based web documents will be and specify methods to monitor homographic domains. used. In this paper, to detect whether a suspicious In addition, Liu et al. [5] proposed to color suspicious Unicode in a is visual spoofing or not, characters by a fixed color palette or an adaptive color the context of the suspicious character is utilized by palette when mix scripts are found within a single employing a Bayesian framework. Specifically, two word. Finally, browser vendors may integrate above contexts are taken into consideration: simple context defenses into their browsers and provide options for and general context. Simple context of a suspicious users to customize the desired security level [7]. character is the word where the character exists while Above defenses take security actions for IDN general context consists of all homoglyphs of the without knowing whether it is a real attack or not. character within Universal Character Set (UCS). Even though homoglyphs are potentially exploited for Three decision rules are designed and used jointly for deceptive usage, it is a mistake to conclude that all convicting a suspicious character. Preliminary homoglyphs are malicious or spoofing. For example, evaluations and user study show that the proposed homoglyphs of character ‘’ should be considered as approach can detect Unicode-based visual spoofing spoofing if it exists in context of “PayPal”, but may not with high effectiveness and efficiency. be considered as spoofing in another semantic-less context, such as “-l”. Both transformation and restriction techniques carry inconvenience to end users 1. Introduction and domain name owners. Moreover, these IDN defenses leave plain text in web content unprotected. There are many similar-looking characters in the Therefore, proposed a context-aware method to Universal Character Set (UCS), which can cause detect malicious homoglyphs, which allows browsers severe web security problems. Unicode-based web to take smarter security actions that tackle real attacks homoglyph fraud (a.k.a. homograph attack [1]) is just and minimizes the disturbance to end users. one of such examples. A homoglyph is one of two or A Unicode visual spoofing string is usually more characters with shapes that are either identical or produced by replacing one or more characters of the cannot be differentiated by instant visual inspection legitimate string with their homoglyphs in the [2]. Therefore, the homoglyph becomes useful in many characters’ general context. We assume that the content-based attacks, such as phishing attacks and frequency of occurrences of a legitimate string on the spam attacks. A real case is that a faked “paypаl.com”, Web is higher than that of its spoofing string, as a kind in which the second ‘а’ is a Cyrillic letter (-0430) of prior knowledge. By taking the prior knowledge into instead of the Latin ‘a’ (U-0061), was successfully consideration, we employ a Bayesian framework to registered in 2005. detect a suspicious string. In the Bayesian framework, We classify visual spoofing defenses in IDN into 2 the similarities of homoglyphs, which are adopted from categories: transformation and restriction. Punycode, Fu etc. [6], are also included. Through a series of as proposed by [3][4], is a widely evaluations, the model is witnessed to be effective to adopted transformation technique. It allows the non- identify the suspicious characters as spoofing or not. ASCII characters to be transformed uniquely and reversibly into ASCII characters. For example, Cyrillic

* The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 117907].

1051-4651/10 $26.00 © 2010 IEEE 19491953 DOI 10.1109/ICPR.2010.480

2. The approach calculated; if the resulting probability of SC[i], denoted 2.1. Definitions as P(SC[i]|SC), is larger than any of the homoglyphs in its generally context, SC[i] is considered as a The proposed approach takes full advantage of the legitimate character, otherwise it is a spoofing context of a suspicious character/string with a character. probabilistic Bayesian Model, and the approach can ⎧legitimate∀>cPSCiSCPcSC∈ GC([]), SC i () [ ]|() | =⎪ g g mark (e.g., color) spoofing characters in a given string SC[] i ⎨ , (5) spoofing else and prevent end users from being deceived. ⎩⎪ Specifically, we define and distinguish two types of where, P(SC[i]|SC) and P(cg|SC) can be derived contexts: simple context and general context. The with the following formula: simple context of a character is defined as the set of PSC()()| xx⋅ P PSC()x||==⋅⋅ APSCP()()xx Unicode characters that a word includes where the PSC() character exists. The general context of a character is , (6) defined as the set of homoglyphs of the character in where, x∈GC(SC[i]), i denotes the position of the UCS. They are denoted as follows: suspicious character. A is the constant 1/P(SC). P(SC|x) SC() :=∈∈≠{ c | c w , c w, c c} can be derived as: ss s , (1) f()(,[]) w= ⋅ Sim x SC i GC() c:=∈∈{ c | c= c , c UCS, c UCS} , (2) wi[] x gg g PSCx()| = f ()(,[])wSimcSCi= ⋅ where, c denotes a suspicious character; w denotes ∑ wi[] cg g cGCSCi∈ ([]) the word where character c exists; denotes a g , (7) neighboring character around character c in w and c g where, f(ww[i]=x) is the frequency of ww[i]=x, which is denotes a homoglyph of c; SC(c) denotes the set of valued as the number of returning results from Google neighboring characters of c in w; GC(c) is the set of with ww[i]=x as the query, and Sim(x, cg) is the visual homoglyphs of c. The symbol ‘ ∼ ’ denotes visually similarity between x and cg and is derived with an similar relation under a specified threshold. application proposed by Fu et al. [6]. The simple context and the general context of a The value of P(x) can be obtained as follows: suspicious character are used as prior knowledge to fx() calculate prior distribution. The prior distribution of c Px()= is the proportion of the occurring frequencies of the ∑ f ()c original word w (where c exists) in the total occurring cUCS∈ . (8) frequencies of all the similar (which are referred to as new words in the rest of this paper), where c is 2.3. Decision rules based on context replaced by its homoglyphs.

New words:|=∀∈{ w= c GC() c } In fact, there is some heuristics to examine all wi[] cg g , (3) possible words. Hence, three heuristic decision rules

Prior distribution:,= p() f() w f() w = are developed to prune away the unnecessary wi[] cg , (4) th computations. where, ww[i]=Cg denotes a new word in which w’s i A regular text (e.g., URL/webpage/e-mail) character is replaced with cg; f(·) denotes the occurring generally includes only one or a few languages; its frequency of the corresponding word; p(·) is the simple context is thus limited to a small set of Unicode function of probabilistic density function, that is, the groups/subgroups. For example, English usually adopts normalized occurring frequency. Latin scripts, and Chinese uses CJK scripts. Therefore, we defined the first rule of verifying a suspicious 2.2. Bayesian inference character as follows: Rule 1: A legitimate string tends to involve a For a given suspicious word, we iteratively check limited number of Unicode groups/subgroups in UCS, each character in the word. If certain character is usually only one group/subgroup. If the Unicode group determined as spoofing, it will be highlighted to warn of a character is different from that of its neighboring th users. Each character SC[i] (i.e., the i character in the characters in its simple context, the character is judged simple context/word), will be examined in the as spoofing, or legitimate otherwise. The rule is following way: for each homoglyph cg in the general denoted as follows, context of SC[i], the probability of cg conditioned on ()≠ () = ⎧legitimate UG c UG cs the simple context of SC[i], represented as P(cg|SC), is c ⎨ , (9) ⎩ spoofing else

19501954 where, function UG(c) denotes the Unicode Unicode visual spoofing detection. Moreover, any group/subgroup of c. new rule can also be generated and added to this The above rule is not sufficient to detect spoofing, approach according to user’s experience and because there are many legitimate usages of mixed knowledge. In addition, a larger simple context scripts. Especially, it is quite common to mix English sometimes is necessary, such as the phrase or the words (with Latin characters) with other languages, whole sentence where the suspicious character exists. including languages using non-Latin scripts. Even in English, legitimate product/organization names may 3. Experiments and Evaluation contain non-Latin characters, such as Ωmega, Teχ, Toys-Я-Us, and HλLF-LIFE. Moreover, English also In this paper, we frequently need to know the visual adopt some words from other languages, e.g., “résumé” similarity of two readable characters. This can be done and “naïve”. Based on the above analysis, we define by an image similarity assessment algorithm. As the the second rule as follows, which is also a complement dataset of UCS is quite large, consisting of tens of of Rule 1. thousands of readable characters, considering the Rule 2: Although a suspicious character belongs to overall performance (less than 1 second to process a a different Unicode group from its neighboring short paragraph, not discussed in this paper due characters, if there is no visually similar character in limit), we adopt a simple, but fast and effective pixel- the latter group to the suspicious character, the overlapping algorithm that proposed by Fu et al. [6]. In suspicious character will be judged as legitimate, this paper, we adopt the threshold of visual similarity otherwise, as spoofing. The rule can be described as as 0.9. That is, two Unicode characters with visual follows. similarity over 0.9 will be considered as homoglyph of ≠∀∈∝ ⎧legitimate UG() c UG() css and c', UG () c c c' each other, thus they will be a member of general c = ⎨ , (10) ⎩ spoofing else contexts of each other. For example, under the where, c’ denotes a character in a Unicode group; threshold of 0.9, Latin character ‘a’ (U-0061) have four ∝ represents a dissimilar relationship. members in its general context: ‘a’ (U-0061), ‘а’ (U- Actually, homoglyphs may be found within the 0430), ‘a’ (U-FF41) and ‘ạ’ (U-1EA1). same Unicode group/subgroup. For example, Figures 1-3 show the detection results of our lowercase letter ‘l’ and digit ‘1’ in ASCII are visually prototype system on a sentence at level 1, 2, and 3, confusable, which causes difficulty to distinguish a respectively. We substitute certain characters in the legitimate “paypal” and a faked string “paypa1” (letter sentence with five different Unicode visual spoofing ‘l’ is replaced by digit ‘1’). However, if we search the characters respectively. In the result column, the legitimate word “paypal” and its faked word “paypa1” judged spoofing characters are marked in red color. in Google, the numbers of returning results are Level 1 indicates that only Rule 1 is used to judge a significantly different (326,000,000 and 2,480, suspicious character; level 2 indicates that both Rule 1 respectively). Therefore, in this case, the faked word and Rule 2 are used; level 3 indicates that all 3 rules can be detected based on the above Bayesian inference are used. In Figure 1, three spoofing characters, i.e., (in Section 2.2) in the third rule as follows. ‘a’, ‘m’, ‘b’, ‘е’ , are detected correctly, and one Rule 3: Usually a legitimate Unicode-based word spoofing characters, i.e., ‘I’, is missing. Meanwhile, can be discovered in much more web pages than its there are also a number of false alarms, i.e., “中国银 visual spoofing words. For a suspicious character c in a 行”. In Figure 2, three spoofing characters, i.e., ‘е’ word SC, if P(c|SC) is larger than a threshold or is the (Unicode 0x0435), ‘ m ’ (Unicode 0xFF4D), ‘ b ’ largest among all P(cg|SC), c is judged as legitimate, (Unicode 0xFF42), are detected correctly, and one otherwise as spoofing. The rule is denoted as follows, spoofing characters, i.e., ‘I’ (Unicode 0x0049), is ()= () ⎧ UG c UG cs , and missing. In Figure 3, all the four spoofing characters legitimate = ⎪ ∀∈() ( ) > , (11) c ⎨ cgg GCc, PcSC| Pc() | SC are detected correctly. The precision and the recall of ⎪ level 3 is always higher than or equal to level 2, and spoofing else ⎩ that of level 2 is higher than or equal to level 1. where, P(c|SC) denotes the posterior probability of the suspicious character c; UG(c)=UG(cs) indicates that the Unicode group/subgroup of the suspicious character c is the same with that of most neighboring characters cs included in its simple context. To detect a suspicious word, the above three rules will be used jointly to improve the performance of Figure 1: Results of detection at level 1

19511955 notably improve effectiveness and efficiency in detecting text based homoglyph attacks.

4. Conclusion

In this paper, we first define the simple context and Figure 2: Results of detection at level 2 the general context of a suspicious character, based on which the probability of the character occurring in its context can be calculated. A Bayesian framework is then used to calculate the posterior distribution of the suspicious character. If the probability of the suspicious character is above a threshold or maximal among all the probabilities of its homoglyphs, the

Figure 3: Results of detection at level 3 character is detected as legitimate character, otherwise, as spoofing. We also use three decision rules to In another case, we select top 20 famous domain improve the performance of spoofing detection in a names from: http://www.alexa.com/topsites. All words practical prototype system. generated from the replacements of homoglyphs are The proposed context-based approach can be easily completely detected in level 1, level 2, and level 3 applied as a browser plug-in to benefit end users. It is respectively. The numbers of false alarms are also different from existing solutions, which either give recorded in different levels. Table 1 lists only 5 of some restrictions to users, agents, programmers, and them. registrar organizations, or map the Unicode scripts into Table 1: Number of false alarms of top 5 domain a uniform format but lose some of the original names after replacements of all similar characters semantics of characters. There is no restriction to users in terms of different levels or loss in semantics in our approach. Preliminary replacement Level Level Level evaluations and user study show that the proposed Domain name number 1 2 3 approach can improve the accuracy of Unicode visual google.com 22000 2000 2000 0 spoofing detection and assist human’s judgment of yahoo.com 2000 0 0 0 Unicode visual spoofing detection effectively and facebook.com 187500 0 0 0 efficiently. youtube.com 16000 0 0 0 live.com 2310 210 210 0 References In Table 1, replacement number denotes the number [1] E. Gabrilovich, A. Gontmakher. The Homograph Attack. Communications of the ACM .45(2), 2002. of words after replacements of all homoglyphs; Level X denotes the number of false alarms in the [2] http://en.wikipedia.org/wiki/Homoglyph . corresponding level. Since letter ‘I’ (Unicode 0x0049) [3] http://www.unicode.org/reports/tr39/ . is visual similar with letter ‘l’ (Unicode 0x006C) in [4] http://www.unicode.org/reports/tr36/ . threshold 0.9, and both letters belong to the same [5] W. Liu, Y. Fu, X. Deng. Expose Homograph Obfuscation Unicode group, i.e., Latin , two words, that is, Intentions by Coloring Unicode Strings. APWeb 2008, “googIe” and “Iive”, are judged as legitimate in both pp. 275-286. Level 1 and Level 2, but they are judged correctly as [6] Y. Fu, X. Deng, W. Liu. REGAP: A Tool for Unicode- spoofing in Level 3. based Web Identity Fraud Detection. Journal of Digital In addition, we conducted a user study based on two Forensic Practice (JDFP), Vol. 1, No. 2, 2006. data sets to examine how satisfactory can our method help users to improve effectiveness and efficiency in [7] http://en.wikipedia.org/wiki/IDN_homograph_attack visual spoofing detection. [8] http://www.faqs.org/rfcs/rfc3743.html One dataset is preprocessed with coloring hints generated by our method. Seven participants are required to find spoofing characters in both datasets. Result shows 100% precision for both datasets, however, the average recall value improved from 49% to 93% under the help of colored hints. Based on the result of user study, we can conclude that machine-generated hints can assist end users to

19521956