Detect Visual Spoofing in Unicode-Based Text*

2010 International Conference on Pattern Recognition Detect Visual Spoofing in Unicode-based Text* Bite Qiu, Ning Fang, Liu Wenyin Department of Computer Science, City University of Hong Kong [email protected], [email protected], [email protected] Abstract “а” is mapped to a string “xn--80a”. Restriction techniques are usually deployed in domain name Visual spoofing in Unicode-based text is anticipated registration level. Top-level-domain registry may apply as a severe web security problem in the near future as policies [8] to restrict the usage of homoglyphs in IDN more and more Unicode-based web documents will be and specify methods to monitor homographic domains. used. In this paper, to detect whether a suspicious In addition, Liu et al. [5] proposed to color suspicious Unicode character in a word is visual spoofing or not, characters by a fixed color palette or an adaptive color the context of the suspicious character is utilized by palette when mix scripts are found within a single employing a Bayesian framework. Specifically, two word. Finally, browser vendors may integrate above contexts are taken into consideration: simple context defenses into their browsers and provide options for and general context. Simple context of a suspicious users to customize the desired security level [7]. character is the word where the character exists while Above defenses take security actions for IDN general context consists of all homoglyphs of the without knowing whether it is a real attack or not. character within Universal Character Set (UCS). Even though homoglyphs are potentially exploited for Three decision rules are designed and used jointly for deceptive usage, it is a mistake to conclude that all convicting a suspicious character. Preliminary homoglyphs are malicious or spoofing. For example, evaluations and user study show that the proposed homoglyphs of character ‘l’ should be considered as approach can detect Unicode-based visual spoofing spoofing if it exists in context of “PayPal”, but may not with high effectiveness and efficiency. be considered as spoofing in another semantic-less context, such as “letter-l”. Both transformation and restriction techniques carry inconvenience to end users 1. Introduction and domain name owners. Moreover, these IDN defenses leave plain text in web content unprotected. There are many similar-looking characters in the Therefore, we proposed a context-aware method to Universal Character Set (UCS), which can cause detect malicious homoglyphs, which allows browsers severe web security problems. Unicode-based web to take smarter security actions that tackle real attacks homoglyph fraud (a.k.a. homograph attack [1]) is just and minimizes the disturbance to end users. one of such examples. A homoglyph is one of two or A Unicode visual spoofing string is usually more characters with shapes that are either identical or produced by replacing one or more characters of the cannot be differentiated by instant visual inspection legitimate string with their homoglyphs in the [2]. Therefore, the homoglyph becomes useful in many characters’ general context. We assume that the content-based attacks, such as phishing attacks and frequency of occurrences of a legitimate string on the spam attacks. A real case is that a faked “paypаl.com”, Web is higher than that of its spoofing string, as a kind in which the second ‘а’ is a Cyrillic letter (U-0430) of prior knowledge. By taking the prior knowledge into instead of the Latin ‘a’ (U-0061), was successfully consideration, we employ a Bayesian framework to registered in 2005. detect a suspicious string. In the Bayesian framework, We classify visual spoofing defenses in IDN into 2 the similarities of homoglyphs, which are adopted from categories: transformation and restriction. Punycode, Fu etc. [6], are also included. Through a series of as proposed by Unicode Consortium [3][4], is a widely evaluations, the model is witnessed to be effective to adopted transformation technique. It allows the non- identify the suspicious characters as spoofing or not. ASCII characters to be transformed uniquely and reversibly into ASCII characters. For example, Cyrillic * The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 117907]. 1051-4651/10 $26.00 © 2010 IEEE 19491953 DOI 10.1109/ICPR.2010.480 2. The approach calculated; if the resulting probability of SC[i], denoted 2.1. Definitions as P(SC[i]|SC), is larger than any of the homoglyphs in its generally context, SC[i] is considered as a The proposed approach takes full advantage of the legitimate character, otherwise it is a spoofing context of a suspicious character/string with a character. probabilistic Bayesian Model, and the approach can ⎧legitimate∀>cPSCiSCPcSC∈ GC([]), SC i ()[ ]|() | =⎪ g g mark (e.g., color) spoofing characters in a given string SC[] i ⎨ , (5) spoofing else and prevent end users from being deceived. ⎩⎪ Specifically, we define and distinguish two types of where, P(SC[i]|SC) and P(cg|SC) can be derived contexts: simple context and general context. The with the following formula: simple context of a character is defined as the set of PSC()()| xx⋅ P PSC()x||==⋅⋅ APSCP()()xx Unicode characters that a word includes where the PSC() character exists. The general context of a character is , (6) defined as the set of homoglyphs of the character in where, x∈GC(SC[i]), i denotes the position of the UCS. They are denoted as follows: suspicious character. A is the constant 1/P(SC). P(SC|x) SC() c:=∈∈≠{ c | c w , c w, c c} can be derived as: ss s , (1) f()(,[]) w= ⋅ Sim x SC i GC() c:=∈∈{ c | c= c , c UCS, c UCS} , (2) wi[] x gg g PSCx()| = f ()(,[])wSimcSCi= ⋅ where, c denotes a suspicious character; w denotes ∑ wi[] cg g cGCSCi∈ ([]) the word where character c exists; cs denotes a g , (7) neighboring character around character c in w and c g where, f(ww[i]=x) is the frequency of ww[i]=x, which is denotes a homoglyph of c; SC(c) denotes the set of valued as the number of returning results from Google neighboring characters of c in w; GC(c) is the set of with ww[i]=x as the query, and Sim(x, cg) is the visual homoglyphs of c. The symbol ‘ ∼ ’ denotes visually similarity between x and cg and is derived with an similar relation under a specified threshold. application proposed by Fu et al. [6]. The simple context and the general context of a The value of P(x) can be obtained as follows: suspicious character are used as prior knowledge to fx() calculate prior distribution. The prior distribution of c Px()= is the proportion of the occurring frequencies of the ∑ f ()c original word w (where c exists) in the total occurring cUCS∈ . (8) frequencies of all the similar words (which are referred to as new words in the rest of this paper), where c is 2.3. Decision rules based on context replaced by its homoglyphs. New words:|=∀∈{ w= c GC() c } In fact, there is some heuristics to examine all wi[] cg g , (3) possible words. Hence, three heuristic decision rules Prior distribution:,= p() f() w f() w = are developed to prune away the unnecessary wi[] cg , (4) th computations. where, ww[i]=Cg denotes a new word in which w’s i A regular text (e.g., URL/webpage/e-mail) character is replaced with cg; f(·) denotes the occurring generally includes only one or a few languages; its frequency of the corresponding word; p(·) is the simple context is thus limited to a small set of Unicode function of probabilistic density function, that is, the groups/subgroups. For example, English usually adopts normalized occurring frequency. Latin scripts, and Chinese uses CJK scripts. Therefore, we defined the first rule of verifying a suspicious 2.2. Bayesian inference character as follows: Rule 1: A legitimate string tends to involve a For a given suspicious word, we iteratively check limited number of Unicode groups/subgroups in UCS, each character in the word. If certain character is usually only one group/subgroup. If the Unicode group determined as spoofing, it will be highlighted to warn of a character is different from that of its neighboring th users. Each character SC[i] (i.e., the i character in the characters in its simple context, the character is judged simple context/word), will be examined in the as spoofing, or legitimate otherwise. The rule is following way: for each homoglyph cg in the general denoted as follows, context of SC[i], the probability of cg conditioned on ()≠ () = ⎧legitimate UG c UG cs the simple context of SC[i], represented as P(cg|SC), is c ⎨ , (9) ⎩ spoofing else 19501954 where, function UG(c) denotes the Unicode Unicode visual spoofing detection. Moreover, any group/subgroup of c. new rule can also be generated and added to this The above rule is not sufficient to detect spoofing, approach according to user’s experience and because there are many legitimate usages of mixed knowledge. In addition, a larger simple context scripts. Especially, it is quite common to mix English sometimes is necessary, such as the phrase or the words (with Latin characters) with other languages, whole sentence where the suspicious character exists. including languages using non-Latin scripts. Even in English, legitimate product/organization names may 3. Experiments and Evaluation contain non-Latin characters, such as Ωmega, Teχ, Toys-Я-Us, and HλLF-LIFE. Moreover, English also In this paper, we frequently need to know the visual adopt some words from other languages, e.g., “résumé” similarity of two readable characters.

Load more