Weaponizing Unicodes with Deep Learning - Identifying Homoglyphs with Weakly Labeled Data

Total Page:16

File Type:pdf, Size:1020Kb

Weaponizing Unicodes with Deep Learning - Identifying Homoglyphs with Weakly Labeled Data Weaponizing Unicodes with Deep Learning - Identifying Homoglyphs with Weakly Labeled Data Perry Deng§ Cooper Linsky§ Matthew Wright Global Cybersecurity Institute Global Cybersecurity Institute Global Cybersecurity Institute Rochester Institute of Technology Rochester Institute of Technology Rochester Institute of Technology Rochester, USA Rochester, USA Rochester, USA [email protected] [email protected] [email protected] Abstract—Visually similar characters, or homoglyphs, can be used to perform social engineering attacks or to evade spam and plagiarism detectors. It is thus important to understand the capabilities of an attacker to identify homoglyphs – particularly ones that have not been previously spotted – and leverage them in attacks. We investigate a deep-learning model using embedding learning, transfer learning, and augmentation to determine the visual similarity of characters and thereby identify potential homoglyphs. Our approach uniquely takes advantage of weak Fig. 1. A screenshot of apple.com, which uses the Latin script labels that arise from the fact that most characters are not homoglyphs. Our model drastically outperforms the Normal- ized Compression Distance approach on pairwise homoglyph identification, for which we achieve an average precision of 0.97. We also present the first attempt at clustering homoglyphs into sets of equivalence classes, which is more efficient than pairwise information for security practitioners to quickly lookup homoglyphs or to normalize confusable string encodings. To measure clustering performance, we propose a metric (mBIOU) building on the classic Intersection-Over-Union (IOU) metric. Our clustering method achieves 0.592 mBIOU, compared to 0.430 for the naive baseline. We also use our model to predict Fig. 2. A screenshot of [3]’s domain, which uses the Cyrillic script over 8,000 previously unknown homoglyphs, and find good early indications that many of these may be true positives. Source code and list of predicted homoglyphs are uploaded to Github: different Unicode characters and appear visually identical (See https://github.com/PerryXDeng/weaponizing unicode Figures 1 and 2). A complete database of all homoglyphs Index Terms—homoglyphs, unicode, cybersecurity within the Unicode standard would be helpful for us to assess the risks of systems utilizing Unicode. Due to the sheer size I. INTRODUCTION of the Unicode standard, however, manual identification of all Identical-looking characters, known as homoglyphs, can be homoglyphs is infeasible. used to substitute characters in strings to create visually similar Prior works [4]–[6] on automated homoglyph identification strings. These strings can then be used to trick users into have a few critical shortcomings: 1) they are only used to arXiv:2010.04382v4 [cs.CR] 22 Dec 2020 clicking on malicious domains passing as legitimate ones [1], identify homoglyphs for few characters, or/and 2) they do or to evade spam and plagiarism detectors [2]. While the not cluster homoglyphs into distinct sets, or/and 3) they do security risks of homoglyphs have been known since the turn not present quantifiable evidence on the accuracy of their ap- of the century [1], the problem has gotten more attention as proach. We seek to address these shortcomings with empirical Unicode becomes the predominant standard for text process- evidence on a method based on deep learning, which has ing. Unlike ASCII, which contains only 128 characters, the performed tremendously well on computer vision problems latest Unicode standard encapsulates more than 140 thousand such as object classification [7] and medical imaging [8]. characters. Among the Unicode characters, many have been Unlike many works in computer vision, we do not have identified to be homoglyphs. This allows scenarios where a large number of human-labeled examples for each visual the entire string of a domain name can be substituted with class of data, because we do not know the complete sets of homoglyphs. We only know that the overwhelming majority This material is based upon work supported by the National Science of characters are not homoglyphs. Foundation under Award No. 1816851. §These authors contributed equally to this work. To address this challenge, we propose a novel model that ©2020 IEEE adapts data augmentation, transfer learning, and embedding learning from similarly challenging deep-learning problems similar to U+006C, or ’l’. Homoglyphs can be used to such as facial recognition. We fine-tune our neural network substitute characters in strings of one or more characters to on weakly labeled data by exploiting the fact that, in general, form homographs, or confusables in Unicode terminology. different characters do not look alike. We compare our deep Confusables pose significant security risks. They can fool learning model to the approach of Roshabin and Miller [4] on users into unintended actions, such as malicious domains the pairwise identification of homoglyphs. We show that our masquerading as legitimate websites [1]. Confusables can also method significantly outperforms theirs in both accuracy (0.91 cause text to slip past mechanical gatekeepers, such as a spam vs 0.64) and average precision (0.97 vs 0.75). We also attempt or plagiarism detector [2]. clustering of homoglyphs into distinct sets, and propose to The Unicode Consortium maintains a database [9] of measure the performance and generalizability of our clustering confusables to help mitigate security risks. The incomplete model with mBIOU, a metric based on Intersection Over database maps visually similar strings of one or more code- Union. Our approach achieves a mBIOU of 0.592 between points to equivalence classes, identified by their respective the predicted sets of homoglyphs and the known homoglyphs “prototype” string. With this mapping, security practition- from Unicode Consortium [9]. In addition, we show that our ers can quickly lookup homoglyphs for a given character learned deep embeddings can be used to find many previously or normalize confusable strings into a single encoding [9]. unidentified homoglyphs. Formally, given the set of possible Unicode strings U ∗, and In summary, our contributions are as follows: the equivalence relation ”is visually similar to”, ≡, on U ∗, the • We propose a novel deep learning model for identifying confusable equivalence class S of a string ~u is: homoglyphs that overcomes the lack of training data S(~u) = f~s 2 U ∗j~s ≡ ~u;~u 2 U ∗g (1) by leveraging embedding learning and transfer learning, together with data augmentation methods. Since there are infinite strings, developing an exhaustive • We are the first to attempt clustering of homoglyphs into confusable database is impossible. We can instead limit the distinct sets, and propose mBIOU, a way to systematically scope to just single codepoints. More formally, we can modify measure the accuracy of clustered sets. Equation 1 to have equivalence class (or simply class) H of • We show that our approach outperforms the prior work on a codepoint c: the pairwise identification task, with an average precision H(c) = fh 2 Ujh ≡ c; c 2 Ug (2) of 0.97 compared to 0.79. Our approach also achieves 0.592 mBIOU on the clustering task, compared to a naive where U is the set of all renderable Unicode codepoints. When baseline of 0.430. jH(c)j ≥ 2, we get non-trivial sets of two or more homoglyphs • We apply our model to find new homoglyphs, and visual- (as each character is always in an equivalence class containing ize a random sample of predicted sets. 8,452 unrecorded at least itself). With these homoglyph sets, we can form longer homoglyphs were predicted with our approach, and visual confusables through character substitution and assess risks in inspection of our random sample suggests that some of vulnerable systems. these are likely to be true positives. C. Automated Homoglyph Identification II. BACKGROUND Homoglyph identification can be understood as the problem A. Unicode of finding all sets of equivalence classes that satisfy Equa- Unicode [10] is the predominant technology standard for tion 2. An alternative problem formation, used in all previous the encoding, representation, and processing of text. It is work, is to identify pairs of homoglyphs rather than sets. maintained by the Unicode Consortium, and is backward- Pairs are less useful than sets, because for us to store all compatible with the legacy ASCII. Unicode defines a space of the homoglyph associations of all given characters, pairwise integers, or codepoints, ranging from 0 to 0x10FFFF, which is lookup would require drastically more storage. To the best of hexadecimal for slightly more than one million. The entire set our knowledge, prior work on the automated identification of of codepoints is called the code space. The codepoints can be homoglyphs within Unicode includes: mapped to characters, so that characters can be encoded and • Roshanbin and Miller [4] develop an approach based on processed as unique numbers by computers. As of Unicode Normalized Compression Distance (NCD) that identifies Version 13.0 released in March 2020, there are 143,859 pairs of homoglyphs from a set of around 6,200 Unicode characters mapped to codepoints in the standard implemented characters. They claim that their study expands the set of for public use. For human readability, a codepoint can either known homoglyphs at the time of their work. be rendered into an image with a font, or printed in hex format with ”U+<hex code>”. For example, the lowercase first letter • Ginsberg and Yu [5] aim to identify potential homoglyph of the Latin alphabet can be ’a’ or U+0061. pairs by decreasing their granularity and normalizing their sizes and locations. They present empirical results on a B. Homoglyphs, Homographs, and Confusables select few letters, and mainly focus on the speed of their A homoglyph is a character that is visually similar to method. Like Roshanbin and Miller, they did not attempt another character. For example, U+0049, or ’I’, can appear to map codepoints into equivalence sets. • Sawabe et al.
Recommended publications
  • A Method to Accommodate Backward Compatibility on the Learning Application-Based Transliteration to the Balinese Script
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 6, 2021 A Method to Accommodate Backward Compatibility on the Learning Application-based Transliteration to the Balinese Script 1 3 4 Gede Indrawan , I Gede Nurhayata , Sariyasa I Ketut Paramarta2 Department of Computer Science Department of Balinese Language Education Universitas Pendidikan Ganesha (Undiksha) Universitas Pendidikan Ganesha (Undiksha) Singaraja, Indonesia Singaraja, Indonesia Abstract—This research proposed a method to accommodate transliteration rules (for short, the older rules) from The backward compatibility on the learning application-based Balinese Alphabet document 1 . It exposes the backward transliteration to the Balinese Script. The objective is to compatibility method to accommodate the standard accommodate the standard transliteration rules from the transliteration rules (for short, the standard rules) from the Balinese Language, Script, and Literature Advisory Agency. It is Balinese Language, Script, and Literature Advisory Agency considered as the main contribution since there has not been a [7]. This Bali Province government agency [4] carries out workaround in this research area. This multi-discipline guidance and formulates programs for the maintenance, study, collaboration work is one of the efforts to preserve digitally the development, and preservation of the Balinese Language, endangered Balinese local language knowledge in Indonesia. The Script, and Literature. proposed method covered two aspects, i.e. (1) Its backward compatibility allows for interoperability at a certain level with This study was conducted on the developed web-based the older transliteration rules; and (2) Breaking backward transliteration learning application, BaliScript, for further compatibility at a certain level is unavoidable since, for the same ubiquitous Balinese Language learning since the proposed aspect, there is a contradictory treatment between the standard method reusable for the mobile application [8], [9].
    [Show full text]
  • ISO Basic Latin Alphabet
    ISO basic Latin alphabet The ISO basic Latin alphabet is a Latin-script alphabet and consists of two sets of 26 letters, codified in[1] various national and international standards and used widely in international communication. The two sets contain the following 26 letters each:[1][2] ISO basic Latin alphabet Uppercase Latin A B C D E F G H I J K L M N O P Q R S T U V W X Y Z alphabet Lowercase Latin a b c d e f g h i j k l m n o p q r s t u v w x y z alphabet Contents History Terminology Name for Unicode block that contains all letters Names for the two subsets Names for the letters Timeline for encoding standards Timeline for widely used computer codes supporting the alphabet Representation Usage Alphabets containing the same set of letters Column numbering See also References History By the 1960s it became apparent to thecomputer and telecommunications industries in the First World that a non-proprietary method of encoding characters was needed. The International Organization for Standardization (ISO) encapsulated the Latin script in their (ISO/IEC 646) 7-bit character-encoding standard. To achieve widespread acceptance, this encapsulation was based on popular usage. The standard was based on the already published American Standard Code for Information Interchange, better known as ASCII, which included in the character set the 26 × 2 letters of the English alphabet. Later standards issued by the ISO, for example ISO/IEC 8859 (8-bit character encoding) and ISO/IEC 10646 (Unicode Latin), have continued to define the 26 × 2 letters of the English alphabet as the basic Latin script with extensions to handle other letters in other languages.[1] Terminology Name for Unicode block that contains all letters The Unicode block that contains the alphabet is called "C0 Controls and Basic Latin".
    [Show full text]
  • The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles
    Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2017 The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles Moran, Steven ; Cysouw, Michael DOI: https://doi.org/10.5281/zenodo.290662 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-135400 Monograph The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0) License. Originally published at: Moran, Steven; Cysouw, Michael (2017). The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles. CERN Data Centre: Zenodo. DOI: https://doi.org/10.5281/zenodo.290662 The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran & Michael Cysouw Change dedication in localmetadata.tex Preface This text is meant as a practical guide for linguists, and programmers, whowork with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together. The intersection of the Unicode Standard and the International Phonetic Al- phabet is often not met without frustration by users. Nevertheless, thetwo standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA. Our research uses quantitative methods to compare languages and uncover and clarify their phylogenetic relations. However, the majority of lexical data available from the world’s languages is in author- or document-specific orthogra- phies.
    [Show full text]
  • Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
    1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only.
    [Show full text]
  • Detection of Suspicious IDN Homograph Domains Using Active DNS Measurements
    A Case of Identity: Detection of Suspicious IDN Homograph Domains Using Active DNS Measurements Ramin Yazdani Olivier van der Toorn Anna Sperotto University of Twente University of Twente University of Twente Enschede, The Netherlands Enschede, The Netherlands Enschede, The Netherlands [email protected] [email protected] [email protected] Abstract—The possibility to include Unicode characters in problem in the case of Unicode – ASCII homograph domain names allows users to deal with domains in their domains. We propose a low-cost method for proactively regional languages. This is done by introducing Internation- detecting suspicious IDNs. Since our proactive approach alized Domain Names (IDN). However, the visual similarity is based not on use, but on characteristics of the domain between different Unicode characters - called homoglyphs name itself and its associated DNS records, we are able to - is a potential security threat, as visually similar domain provide an early alert for both domain owners as well as names are often used in phishing attacks. Timely detection security researchers to further investigate these domains of suspicious homograph domain names is an important before they are involved in malicious activities. The main step towards preventing sophisticated attacks, since this can contributions of this paper are that we: prevent unaware users to access those homograph domains • propose an improved Unicode Confusion table able that actually carry malicious content. We therefore propose to detect 2.97 times homograph domains compared a structured approach to identify suspicious homograph to the state-of-the-art confusion tables; domain names based not on use, but on characteristics • combine active DNS measurements and Unicode of the domain name itself and its associated DNS records.
    [Show full text]
  • Speaker Biographies
    SPEAKER BIOGRAPHIES Speakers: Dr. Deborah Anderson, Researcher, Dept. of Linguistics, UC Berkeley Deborah (Debbie) Anderson is a Researcher in the Department of Linguistics at UC Berkeley. Since 2002, she has run the Script Encoding Initiative project, which helps get scripts and characters into the Unicode Standard. She also represents UC Berkeley in the Unicode Technical Committee meetings and is a member of the US National Body in ISO/IEC JTC1/SC2. In addition, she is a Unicode Technical Director. Zibi Braniecki, Mozilla, Sr. Staff Platform Engineer Zibi Braniecki is a Sr. Staff Platform Engineer at Mozilla working on internationalization and localization of Gecko and Firefox. Zibi represents Mozilla at TC39 committee and is championing multiple ECMA402 proposals. When not in front of the keyboard, he's captaining the Polish National Team in Ultimate Frisbee. Shane Carr, Senior Software Engineer, Internationalization, Google, Inc. Shane Carr is a Senior Software Engineer on Google's i18n Engineering team. He is chair of the ECMA 402 subcommittee for JavaScript i18n standards and is a core contributor to the International Components for Unicode (ICU) project. His work on ICU has focused on locale data, number formatting, and performance optimization. Shane has previously presented on Zawgyi and on ICU number formatting at the 41st and 42nd Internationalization & Unicode Conference (IUC). He has also presented at the 33rd International Conference on Machine Learning (ICML) and the 2015 Annual Meeting of the American Institute of Chemical Engineers (AIChE). He holds an MS and BS in Computer Science and BS in Chemical Engineering summa cum laude from Washington University in St.
    [Show full text]
  • 4.4. Is Unicode Font Available? 14 4.4.1
    1 Contents 1. Introduction 3 1.1. Who we are 3 1.2. The UNESCO and IYIL initiative 4 2. Process overview 4 2.1. Language Status Workflow 5 2.2. Technology Implementation Workflow 6 3. Language Status 7 3.1. Is language currently used by a community? 7 3.2. Is language intended for active community use? 7 3.2.1. Revitalize language 7 3.3. Is language in a public registry? 8 3.4. Is language written? 8 3.4.1. Develop written form 8 3.4.2. Document language 8 3.4.2.1. Language is documented 8 3.4.2.2. Language is not documented 9 3.5. Does language use a consistent writing system? 9 3.5.1. Are the characters used already supported? 9 3.6. Is writing supported by a standard? 10 3.6.1. Submit character proposals 10 3.6.2. Develop standard 11 3.7. Proceed to implementation 11 4. Language Technology Implementation Workflow 11 4.1 Note on technology for text in digital systems 11 4.2. Definitions for implementing digital support 12 4.3. Standard language code available? 13 4.3.1. Apply for language code 13 4.4. Is Unicode font available? 14 4.4.1. Create font 14 4.5. Is font available on devices? 14 4.5.1. Manual install or ask vendors for support 14 4.6. Does device have input support? 15 4.7. Is input supported by third party apps or devices? 15 4.7.1. Develop input method 15 4.8. Does device have Unicode data support? 16 This work is licensed under a Creative Commons Attribution 4.0 International License ​ 2 5.
    [Show full text]
  • Unicode Encoding the TITUS Project
    ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Unicode Encoding and Online Data Access Ralf Gehrke / Jost Gippert The TITUS Project („Thesaurus indogermanischer Text- und Sprachmaterialien“) (since 1987/1993) www.ala.org/alcts 1 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Scope of the TITUS project: • Electronic retrieval engine covering the textual heritage of all ancient Indo-European languages • Present retrieval task: – Documentation of the usage of all word forms occurring in the texts, in their resp. contexts • Survey of the parts of the text database: – http://titus.uni-frankfurt.de/texte/texte2.htm Data formats (since 1995): • Text formats: – WordCruncher Text format (8-Bit) – HTML (UTF-8 Unicode 4.0) – (Plain 7-bit ASCII format) • Database format: – MS Access (relational, Unicode-based) – Retrieval via SQL www.ala.org/alcts 2 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Original Scripts Covered: • Latin (with all kinds of diacritics), incl. variants* • Greek • Slavic (Cyrillic and Glagolitic*) • Armenian • Georgian • Devangar • Other Brhm scripts (Tocharian, Khotanese)* • Avestan* • Middle Persian (Pahlav)* • Manichean* • Arabic (incl. Persian) • Runic • Ogham • and many more * not yet encodable (as such) in Unicode Example 1a: Donelaitis (Lithuanian: formatted text incl. diacritics: 8-bit version) www.ala.org/alcts 3 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 1b: Donelaitis (Lithuanian: formatted text incl. diacritics: Unicode version) Example 2a: Catechism (Old Prussian: formatted text incl. diacritics: 8-bit version, special TITUS font) www.ala.org/alcts 4 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 2b: Catechism (Old Prussian: formatted text incl.
    [Show full text]
  • Name Synopsis Description
    Perl version 5.10.0 documentation - Locale::Script NAME Locale::Script - ISO codes for script identification (ISO 15924) SYNOPSIS use Locale::Script; use Locale::Constants; $script = code2script('ph'); # 'Phoenician' $code = script2code('Tibetan'); # 'bo' $code3 = script2code('Tibetan', LOCALE_CODE_ALPHA_3); # 'bod' $codeN = script2code('Tibetan', LOCALE_CODE_ALPHA_NUMERIC); # 330 @codes = all_script_codes(); @scripts = all_script_names(); DESCRIPTION The Locale::Script module provides access to the ISOcodes for identifying scripts, as defined in ISO 15924.For example, Egyptian hieroglyphs are denoted by the two-lettercode 'eg', the three-letter code 'egy', and the numeric code 050. You can either access the codes via the conversion routines(described below), or with the two functions which return listsof all script codes or all script names. There are three different code sets you can use for identifyingscripts: alpha-2 Two letter codes, such as 'bo' for Tibetan.This code set is identified with the symbol LOCALE_CODE_ALPHA_2. alpha-3 Three letter codes, such as 'ell' for Greek.This code set is identified with the symbol LOCALE_CODE_ALPHA_3. numeric Numeric codes, such as 410 for Hiragana.This code set is identified with the symbol LOCALE_CODE_NUMERIC. All of the routines take an optional additional argumentwhich specifies the code set to use.If not specified, it defaults to the two-letter codes.This is partly for backwards compatibility (previous versionsof Locale modules only supported the alpha-2 codes), andpartly because they are the most widely used codes. The alpha-2 and alpha-3 codes are not case-dependent,so you can use 'BO', 'Bo', 'bO' or 'bo' for Tibetan.When a code is returned by one of the functions inthis module, it will always be lower-case.
    [Show full text]
  • Referência Debian I
    Referência Debian i Referência Debian Osamu Aoki Referência Debian ii Copyright © 2013-2021 Osamu Aoki Esta Referência Debian (versão 2.85) (2021-09-17 09:11:56 UTC) pretende fornecer uma visão geral do sistema Debian como um guia do utilizador pós-instalação. Cobre muitos aspetos da administração do sistema através de exemplos shell-command para não programadores. Referência Debian iii COLLABORATORS TITLE : Referência Debian ACTION NAME DATE SIGNATURE WRITTEN BY Osamu Aoki 17 de setembro de 2021 REVISION HISTORY NUMBER DATE DESCRIPTION NAME Referência Debian iv Conteúdo 1 Manuais de GNU/Linux 1 1.1 Básico da consola ................................................... 1 1.1.1 A linha de comandos da shell ........................................ 1 1.1.2 The shell prompt under GUI ......................................... 2 1.1.3 A conta root .................................................. 2 1.1.4 A linha de comandos shell do root ...................................... 3 1.1.5 GUI de ferramentas de administração do sistema .............................. 3 1.1.6 Consolas virtuais ............................................... 3 1.1.7 Como abandonar a linha de comandos .................................... 3 1.1.8 Como desligar o sistema ........................................... 4 1.1.9 Recuperar uma consola sã .......................................... 4 1.1.10 Sugestões de pacotes adicionais para o novato ................................ 4 1.1.11 Uma conta de utilizador extra ........................................ 5 1.1.12 Configuração
    [Show full text]
  • Fun with Unicode - an Overview About Unicode Dangers
    Fun with Unicode - an overview about Unicode dangers by Thomas Skora Overview ● Short Introduction to Unicode/UTF-8 ● Fooling charset detection ● Ambigiuous Encoding ● Ambigiuous Characters ● Normalization overflows your buffer ● Casing breaks your XSS filter ● Unicode in domain names – how to short payloads ● Text Direction Unicode/UTF-8 ● Unicode = Character set ● Encodings: – UTF-8: Common standard in web, … – UTF-16: Often used as internal representation – UTF-7: if the 8th bit is not safe – UTF-32: yes, it exists... UTF-8 ● Often used in Internet communication, e.g. the web. ● Efficient: minimum length 1 byte ● Variable length, up to 7 bytes (theoretical). ● Downwards-compatible: First 127 chars use ASCII encoding ● 1 Byte: 0xxxxxxx ● 2 Bytes: 110xxxxx 10xxxxxx ● 3 Bytes: 1110xxxx 10xxxxxx 10xxxxxx ● ...got it? ;-) UTF-16 ● Often used for internal representation: Java, .NET, Windows, … ● Inefficient: minimum length per char is 2 bytes. ● Byte Order? Byte Order Mark! → U+FEFF – BOM at HTML beginning overrides character set definition in IE. ● Y\x00o\x00u\x00 \x00k\x00n\x00o\x00w\x00 \x00t\x00h\x00i\x00s\x00?\x00 UTF-7 ● Unicode chars in not 8 bit-safe environments. Used in SMTP, NNTP, … ● Personal opinion: browser support was an inside job of the security industry. ● Why? Because: <script>alert(1)</script> == +Adw-script+AD4-alert(1)+ADw-/script+AD4- ● Fortunately (for the defender) support is dropped by browser vendors. Byte Order Mark ● U+FEFF ● Appears as:  ● W3C says: BOM has priority over declaration – IE 10+11 just dropped this insecure behavior, we should expect that it comes back. – http://www.w3.org/International/tests/html-css/character- encoding/results-basics#precedence – http://www.w3.org/International/questions/qa-byte-order -mark.en#bomhow ● If you control the first character of a HTML document, then you also control its character set.
    [Show full text]
  • Exploiting Unicode-Enabled Software
    Unraveling Unicode: A Bag of Tricks for Bug Hunting Black Hat USA July 2009 Chris Weber www.lookout.net [email protected] Casaba Security Can you tell the difference? Black Hat USA - July 2009 www.casabasecurity.com © 2009 Chris Weber How about now? Black Hat USA - July 2009 www.casabasecurity.com © 2009 Chris Weber The Transformers When good input turns bad <scrİpt> becomes <script> Black Hat USA - July 2009 www.casabasecurity.com © 2009 Chris Weber Agenda Black Hat USA - July 2009 www.casabasecurity.com © 2009 Chris Weber Unicode Transformations Agenda • Unicode crash course • Root Causes • Attack Vectors • Tools – Find Unicode issues in Web-testing – Visual Spoofing Detection Black Hat USA - July 2009 www.casabasecurity.com © 2009 Chris Weber Unicode Transformations Agenda • Unicode crash course • Root Causes • Attack Vectors • Tools Black Hat USA - July 2009 www.casabasecurity.com © 2009 Chris Weber Unicode Crash Course The Unicode Attack Surface • End users • Applications • Databases • Programming languages • Operating Systems Black Hat USA - July 2009 www.casabasecurity.com © 2009 Chris Weber Unicode Crash Course Unthink it Black Hat USA - July 2009 www.casabasecurity.com © 2009 Chris Weber Unicode Crash Course • A large and complex standard code points canonical mappings encodings decomposition types categorization case folding normalization best-fit mapping binary properties 17 planes case mapping private use ranges conversion tables script blocks bi-directional properties escapings Black Hat USA - July 2009 © 2009 Chris
    [Show full text]