Weaponizing Unicodes with Deep Learning - Identifying Homoglyphs with Weakly Labeled Data

Weaponizing Unicodes with Deep Learning - Identifying Homoglyphs with Weakly Labeled Data Perry Deng§ Cooper Linsky§ Matthew Wright Global Cybersecurity Institute Global Cybersecurity Institute Global Cybersecurity Institute Rochester Institute of Technology Rochester Institute of Technology Rochester Institute of Technology Rochester, USA Rochester, USA Rochester, USA [email protected] [email protected] [email protected] Abstract—Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs – particularly ones that have not been previously spotted – and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak Fig. 1. A screenshot of apple.com, which uses the Latin script labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normal- ized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is more efficient than pairwise information for security practitioners to quickly lookup homoglyphs or to normalize confusable string encodings. To measure clustering performance, we propose a metric (mBIOU) building on the classic Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592 mBIOU, compared to 0.430 for the naive baseline. We also use our model to predict Fig. 2. A screenshot of [3]’s domain, which uses the Cyrillic script over 8,000 previously unknown homoglyphs, and find good early indications that many of these may be true positives. Source code and list of predicted homoglyphs are uploaded to Github: different Unicode characters and appear visually identical (See https://github.com/PerryXDeng/weaponizing unicode Figures 1 and 2). A complete database of all homoglyphs Index Terms—homoglyphs, unicode, cybersecurity within the Unicode standard would be helpful for us to assess the risks of systems utilizing Unicode. Due to the sheer size I. INTRODUCTION of the Unicode standard, however, manual identification of all Identical-looking characters, known as homoglyphs, can be homoglyphs is infeasible. used to substitute characters in strings to create visually similar Prior works [4]–[6] on automated homoglyph identification strings. These strings can then be used to trick users into have a few critical shortcomings: 1) they are only used to arXiv:2010.04382v4 [cs.CR] 22 Dec 2020 clicking on malicious domains passing as legitimate ones [1], identify homoglyphs for few characters, or/and 2) they do or to evade spam and plagiarism detectors [2]. While the not cluster homoglyphs into distinct sets, or/and 3) they do security risks of homoglyphs have been known since the turn not present quantifiable evidence on the accuracy of their ap- of the century [1], the problem has gotten more attention as proach. We seek to address these shortcomings with empirical Unicode becomes the predominant standard for text process- evidence on a method based on deep learning, which has ing. Unlike ASCII, which contains only 128 characters, the performed tremendously well on computer vision problems latest Unicode standard encapsulates more than 140 thousand such as object classification [7] and medical imaging [8]. characters. Among the Unicode characters, many have been Unlike many works in computer vision, we do not have identified to be homoglyphs. This allows scenarios where a large number of human-labeled examples for each visual the entire string of a domain name can be substituted with class of data, because we do not know the complete sets of homoglyphs. We only know that the overwhelming majority This material is based upon work supported by the National Science of characters are not homoglyphs. Foundation under Award No. 1816851. §These authors contributed equally to this work. To address this challenge, we propose a novel model that ©2020 IEEE adapts data augmentation, transfer learning, and embedding learning from similarly challenging deep-learning problems similar to U+006C, or ’l’. Homoglyphs can be used to such as facial recognition. We fine-tune our neural network substitute characters in strings of one or more characters to on weakly labeled data by exploiting the fact that, in general, form homographs, or confusables in Unicode terminology. different characters do not look alike. We compare our deep Confusables pose significant security risks. They can fool learning model to the approach of Roshabin and Miller [4] on users into unintended actions, such as malicious domains the pairwise identification of homoglyphs. We show that our masquerading as legitimate websites [1]. Confusables can also method significantly outperforms theirs in both accuracy (0.91 cause text to slip past mechanical gatekeepers, such as a spam vs 0.64) and average precision (0.97 vs 0.75). We also attempt or plagiarism detector [2]. clustering of homoglyphs into distinct sets, and propose to The Unicode Consortium maintains a database [9] of measure the performance and generalizability of our clustering confusables to help mitigate security risks. The incomplete model with mBIOU, a metric based on Intersection Over database maps visually similar strings of one or more code- Union. Our approach achieves a mBIOU of 0.592 between points to equivalence classes, identified by their respective the predicted sets of homoglyphs and the known homoglyphs “prototype” string. With this mapping, security practition- from Unicode Consortium [9]. In addition, we show that our ers can quickly lookup homoglyphs for a given character learned deep embeddings can be used to find many previously or normalize confusable strings into a single encoding [9]. unidentified homoglyphs. Formally, given the set of possible Unicode strings U ∗, and In summary, our contributions are as follows: the equivalence relation ”is visually similar to”, ≡, on U ∗, the • We propose a novel deep learning model for identifying confusable equivalence class S of a string ~u is: homoglyphs that overcomes the lack of training data S(~u) = f~s 2 U ∗j~s ≡ ~u;~u 2 U ∗g (1) by leveraging embedding learning and transfer learning, together with data augmentation methods. Since there are infinite strings, developing an exhaustive • We are the first to attempt clustering of homoglyphs into confusable database is impossible. We can instead limit the distinct sets, and propose mBIOU, a way to systematically scope to just single codepoints. More formally, we can modify measure the accuracy of clustered sets. Equation 1 to have equivalence class (or simply class) H of • We show that our approach outperforms the prior work on a codepoint c: the pairwise identification task, with an average precision H(c) = fh 2 Ujh ≡ c; c 2 Ug (2) of 0.97 compared to 0.79. Our approach also achieves 0.592 mBIOU on the clustering task, compared to a naive where U is the set of all renderable Unicode codepoints. When baseline of 0.430. jH(c)j ≥ 2, we get non-trivial sets of two or more homoglyphs • We apply our model to find new homoglyphs, and visual- (as each character is always in an equivalence class containing ize a random sample of predicted sets. 8,452 unrecorded at least itself). With these homoglyph sets, we can form longer homoglyphs were predicted with our approach, and visual confusables through character substitution and assess risks in inspection of our random sample suggests that some of vulnerable systems. these are likely to be true positives. C. Automated Homoglyph Identification II. BACKGROUND Homoglyph identification can be understood as the problem A. Unicode of finding all sets of equivalence classes that satisfy Equa- Unicode [10] is the predominant technology standard for tion 2. An alternative problem formation, used in all previous the encoding, representation, and processing of text. It is work, is to identify pairs of homoglyphs rather than sets. maintained by the Unicode Consortium, and is backward- Pairs are less useful than sets, because for us to store all compatible with the legacy ASCII. Unicode defines a space of the homoglyph associations of all given characters, pairwise integers, or codepoints, ranging from 0 to 0x10FFFF, which is lookup would require drastically more storage. To the best of hexadecimal for slightly more than one million. The entire set our knowledge, prior work on the automated identification of of codepoints is called the code space. The codepoints can be homoglyphs within Unicode includes: mapped to characters, so that characters can be encoded and • Roshanbin and Miller [4] develop an approach based on processed as unique numbers by computers. As of Unicode Normalized Compression Distance (NCD) that identifies Version 13.0 released in March 2020, there are 143,859 pairs of homoglyphs from a set of around 6,200 Unicode characters mapped to codepoints in the standard implemented characters. They claim that their study expands the set of for public use. For human readability, a codepoint can either known homoglyphs at the time of their work. be rendered into an image with a font, or printed in hex format with ”U+<hex code>”. For example, the lowercase first letter • Ginsberg and Yu [5] aim to identify potential homoglyph of the Latin alphabet can be ’a’ or U+0061. pairs by decreasing their granularity and normalizing their sizes and locations. They present empirical results on a B. Homoglyphs, Homographs, and Confusables select few letters, and mainly focus on the speed of their A homoglyph is a character that is visually similar to method. Like Roshanbin and Miller, they did not attempt another character. For example, U+0049, or ’I’, can appear to map codepoints into equivalence sets. • Sawabe et al.

Load more