Detect Visual Spoofing in Unicode-Based Text*

Total Page:16

File Type:pdf, Size:1020Kb

Detect Visual Spoofing in Unicode-Based Text* 2010 International Conference on Pattern Recognition Detect Visual Spoofing in Unicode-based Text* Bite Qiu, Ning Fang, Liu Wenyin Department of Computer Science, City University of Hong Kong [email protected], [email protected], [email protected] Abstract “а” is mapped to a string “xn--80a”. Restriction techniques are usually deployed in domain name Visual spoofing in Unicode-based text is anticipated registration level. Top-level-domain registry may apply as a severe web security problem in the near future as policies [8] to restrict the usage of homoglyphs in IDN more and more Unicode-based web documents will be and specify methods to monitor homographic domains. used. In this paper, to detect whether a suspicious In addition, Liu et al. [5] proposed to color suspicious Unicode character in a word is visual spoofing or not, characters by a fixed color palette or an adaptive color the context of the suspicious character is utilized by palette when mix scripts are found within a single employing a Bayesian framework. Specifically, two word. Finally, browser vendors may integrate above contexts are taken into consideration: simple context defenses into their browsers and provide options for and general context. Simple context of a suspicious users to customize the desired security level [7]. character is the word where the character exists while Above defenses take security actions for IDN general context consists of all homoglyphs of the without knowing whether it is a real attack or not. character within Universal Character Set (UCS). Even though homoglyphs are potentially exploited for Three decision rules are designed and used jointly for deceptive usage, it is a mistake to conclude that all convicting a suspicious character. Preliminary homoglyphs are malicious or spoofing. For example, evaluations and user study show that the proposed homoglyphs of character ‘l’ should be considered as approach can detect Unicode-based visual spoofing spoofing if it exists in context of “PayPal”, but may not with high effectiveness and efficiency. be considered as spoofing in another semantic-less context, such as “letter-l”. Both transformation and restriction techniques carry inconvenience to end users 1. Introduction and domain name owners. Moreover, these IDN defenses leave plain text in web content unprotected. There are many similar-looking characters in the Therefore, we proposed a context-aware method to Universal Character Set (UCS), which can cause detect malicious homoglyphs, which allows browsers severe web security problems. Unicode-based web to take smarter security actions that tackle real attacks homoglyph fraud (a.k.a. homograph attack [1]) is just and minimizes the disturbance to end users. one of such examples. A homoglyph is one of two or A Unicode visual spoofing string is usually more characters with shapes that are either identical or produced by replacing one or more characters of the cannot be differentiated by instant visual inspection legitimate string with their homoglyphs in the [2]. Therefore, the homoglyph becomes useful in many characters’ general context. We assume that the content-based attacks, such as phishing attacks and frequency of occurrences of a legitimate string on the spam attacks. A real case is that a faked “paypаl.com”, Web is higher than that of its spoofing string, as a kind in which the second ‘а’ is a Cyrillic letter (U-0430) of prior knowledge. By taking the prior knowledge into instead of the Latin ‘a’ (U-0061), was successfully consideration, we employ a Bayesian framework to registered in 2005. detect a suspicious string. In the Bayesian framework, We classify visual spoofing defenses in IDN into 2 the similarities of homoglyphs, which are adopted from categories: transformation and restriction. Punycode, Fu etc. [6], are also included. Through a series of as proposed by Unicode Consortium [3][4], is a widely evaluations, the model is witnessed to be effective to adopted transformation technique. It allows the non- identify the suspicious characters as spoofing or not. ASCII characters to be transformed uniquely and reversibly into ASCII characters. For example, Cyrillic * The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 117907]. 1051-4651/10 $26.00 © 2010 IEEE 19491953 DOI 10.1109/ICPR.2010.480 2. The approach calculated; if the resulting probability of SC[i], denoted 2.1. Definitions as P(SC[i]|SC), is larger than any of the homoglyphs in its generally context, SC[i] is considered as a The proposed approach takes full advantage of the legitimate character, otherwise it is a spoofing context of a suspicious character/string with a character. probabilistic Bayesian Model, and the approach can ⎧legitimate∀>cPSCiSCPcSC∈ GC([]), SC i ()[ ]|() | =⎪ g g mark (e.g., color) spoofing characters in a given string SC[] i ⎨ , (5) spoofing else and prevent end users from being deceived. ⎩⎪ Specifically, we define and distinguish two types of where, P(SC[i]|SC) and P(cg|SC) can be derived contexts: simple context and general context. The with the following formula: simple context of a character is defined as the set of PSC()()| xx⋅ P PSC()x||==⋅⋅ APSCP()()xx Unicode characters that a word includes where the PSC() character exists. The general context of a character is , (6) defined as the set of homoglyphs of the character in where, x∈GC(SC[i]), i denotes the position of the UCS. They are denoted as follows: suspicious character. A is the constant 1/P(SC). P(SC|x) SC() c:=∈∈≠{ c | c w , c w, c c} can be derived as: ss s , (1) f()(,[]) w= ⋅ Sim x SC i GC() c:=∈∈{ c | c= c , c UCS, c UCS} , (2) wi[] x gg g PSCx()| = f ()(,[])wSimcSCi= ⋅ where, c denotes a suspicious character; w denotes ∑ wi[] cg g cGCSCi∈ ([]) the word where character c exists; cs denotes a g , (7) neighboring character around character c in w and c g where, f(ww[i]=x) is the frequency of ww[i]=x, which is denotes a homoglyph of c; SC(c) denotes the set of valued as the number of returning results from Google neighboring characters of c in w; GC(c) is the set of with ww[i]=x as the query, and Sim(x, cg) is the visual homoglyphs of c. The symbol ‘ ∼ ’ denotes visually similarity between x and cg and is derived with an similar relation under a specified threshold. application proposed by Fu et al. [6]. The simple context and the general context of a The value of P(x) can be obtained as follows: suspicious character are used as prior knowledge to fx() calculate prior distribution. The prior distribution of c Px()= is the proportion of the occurring frequencies of the ∑ f ()c original word w (where c exists) in the total occurring cUCS∈ . (8) frequencies of all the similar words (which are referred to as new words in the rest of this paper), where c is 2.3. Decision rules based on context replaced by its homoglyphs. New words:|=∀∈{ w= c GC() c } In fact, there is some heuristics to examine all wi[] cg g , (3) possible words. Hence, three heuristic decision rules Prior distribution:,= p() f() w f() w = are developed to prune away the unnecessary wi[] cg , (4) th computations. where, ww[i]=Cg denotes a new word in which w’s i A regular text (e.g., URL/webpage/e-mail) character is replaced with cg; f(·) denotes the occurring generally includes only one or a few languages; its frequency of the corresponding word; p(·) is the simple context is thus limited to a small set of Unicode function of probabilistic density function, that is, the groups/subgroups. For example, English usually adopts normalized occurring frequency. Latin scripts, and Chinese uses CJK scripts. Therefore, we defined the first rule of verifying a suspicious 2.2. Bayesian inference character as follows: Rule 1: A legitimate string tends to involve a For a given suspicious word, we iteratively check limited number of Unicode groups/subgroups in UCS, each character in the word. If certain character is usually only one group/subgroup. If the Unicode group determined as spoofing, it will be highlighted to warn of a character is different from that of its neighboring th users. Each character SC[i] (i.e., the i character in the characters in its simple context, the character is judged simple context/word), will be examined in the as spoofing, or legitimate otherwise. The rule is following way: for each homoglyph cg in the general denoted as follows, context of SC[i], the probability of cg conditioned on ()≠ () = ⎧legitimate UG c UG cs the simple context of SC[i], represented as P(cg|SC), is c ⎨ , (9) ⎩ spoofing else 19501954 where, function UG(c) denotes the Unicode Unicode visual spoofing detection. Moreover, any group/subgroup of c. new rule can also be generated and added to this The above rule is not sufficient to detect spoofing, approach according to user’s experience and because there are many legitimate usages of mixed knowledge. In addition, a larger simple context scripts. Especially, it is quite common to mix English sometimes is necessary, such as the phrase or the words (with Latin characters) with other languages, whole sentence where the suspicious character exists. including languages using non-Latin scripts. Even in English, legitimate product/organization names may 3. Experiments and Evaluation contain non-Latin characters, such as Ωmega, Teχ, Toys-Я-Us, and HλLF-LIFE. Moreover, English also In this paper, we frequently need to know the visual adopt some words from other languages, e.g., “résumé” similarity of two readable characters.
Recommended publications
  • The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles
    Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2017 The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles Moran, Steven ; Cysouw, Michael DOI: https://doi.org/10.5281/zenodo.290662 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-135400 Monograph The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0) License. Originally published at: Moran, Steven; Cysouw, Michael (2017). The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles. CERN Data Centre: Zenodo. DOI: https://doi.org/10.5281/zenodo.290662 The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran & Michael Cysouw Change dedication in localmetadata.tex Preface This text is meant as a practical guide for linguists, and programmers, whowork with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together. The intersection of the Unicode Standard and the International Phonetic Al- phabet is often not met without frustration by users. Nevertheless, thetwo standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA. Our research uses quantitative methods to compare languages and uncover and clarify their phylogenetic relations. However, the majority of lexical data available from the world’s languages is in author- or document-specific orthogra- phies.
    [Show full text]
  • Detection of Suspicious IDN Homograph Domains Using Active DNS Measurements
    A Case of Identity: Detection of Suspicious IDN Homograph Domains Using Active DNS Measurements Ramin Yazdani Olivier van der Toorn Anna Sperotto University of Twente University of Twente University of Twente Enschede, The Netherlands Enschede, The Netherlands Enschede, The Netherlands [email protected] [email protected] [email protected] Abstract—The possibility to include Unicode characters in problem in the case of Unicode – ASCII homograph domain names allows users to deal with domains in their domains. We propose a low-cost method for proactively regional languages. This is done by introducing Internation- detecting suspicious IDNs. Since our proactive approach alized Domain Names (IDN). However, the visual similarity is based not on use, but on characteristics of the domain between different Unicode characters - called homoglyphs name itself and its associated DNS records, we are able to - is a potential security threat, as visually similar domain provide an early alert for both domain owners as well as names are often used in phishing attacks. Timely detection security researchers to further investigate these domains of suspicious homograph domain names is an important before they are involved in malicious activities. The main step towards preventing sophisticated attacks, since this can contributions of this paper are that we: prevent unaware users to access those homograph domains • propose an improved Unicode Confusion table able that actually carry malicious content. We therefore propose to detect 2.97 times homograph domains compared a structured approach to identify suspicious homograph to the state-of-the-art confusion tables; domain names based not on use, but on characteristics • combine active DNS measurements and Unicode of the domain name itself and its associated DNS records.
    [Show full text]
  • Imperceptible NLP Attacks
    Bad‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌‌ Characters: Imperceptible NLP Attacks Nicholas Boucher Ilia Shumailov University of Cambridge University of Cambridge Cambridge, United Kingdom Vector Institute, University of Toronto [email protected] [email protected] Ross Anderson Nicolas Papernot University of Cambridge Vector Institute, University of Toronto University of Edinburgh Toronto, Canada [email protected] [email protected] Abstract—Several years of research have shown that machine- Indeed, the title of this paper contains 1000 invisible characters
    [Show full text]
  • Integrated Issues Report)
    The IDN Variant Issues Project A Study of Issues Related to the Management of IDN Variant TLDs (Integrated Issues Report) 20 February 2012 The IDN Variant Issues Project: A Study of Issues Related to the Management of IDN Variant TLDs 20 February 2012 Contents Executive Summary…………………………………………………………………………………………………………….. 6 1 Overview of this Report………………………………………………………………………………………………. 9 1.1 Fundamental Assumptions…………………………………………………………………………………. 10 1.2 Variants and the Current Environment………………………………………………………………. 12 2 Project Overview……………………………………………………………………………………………………….. 16 2.1 The Variant Issues Project…………………………………………………………………………………… 16 2.2 Objectives of the Integrated Issues Report………………………………………………………… 17 2.3 Scope of the Integrated Issues Report………………………………………………………………… 18 3 Range of Possible Variant Cases Identified………………………………………………………………… 19 3.1 Classification of Variants as Discovered……………………………………………………………… 19 3.1.1 Code Point Variants……………………………………………………………………………………. 20 3.1.2 Whole-String Variants………………………………………………………………………………… 20 3.2 Taxonomy of Identified Variant Cases……………………………………………………………….. 21 3.3 Discussion of Variant Classes……………………………………………………………………………… 28 3.4 Visual Similarity Cases……………………………………………………………………………………….. 33 3.4.1 Treatment of Visual Similarity Cases………………………………………………………….. 33 3.4.2 Cross-Script Visual Similarity ……………………………………………………………………….34 3.4.3 Terminology concerning Visual Similarity ……………………………………………………35 3.5 Whole-String Issues …………………………………………………………………………………………….36 3.6 Synopsis of Issues ……………………………………………………………………………………………….39
    [Show full text]
  • IETF A. Freytag Internet-Draft ASMUS, Inc. Intended Status: Standards Track J
    IETF A. Freytag Internet-Draft ASMUS, Inc. Intended status: Standards Track J. Klensin Expires: December 31, 2018 A. Sullivan Oracle Corp. June 29, 2018 Those Troublesome Characters: A Registry of Unicode Code Points Needing Special Consideration When Used in Network Identifiers draft-freytag-troublesome-characters-02 Abstract Unicode’s design goal is to be the universal character set for all applications. The goal entails the inclusion of very large numbers of characters. It is also focused on written language in general; special provisions have always been needed for identifiers. The sheer size of the repertoire increases the possibility of accidental or intentional use of characters that can cause confusion among users, particularly where linguistic context is ambiguous, unavailable, or impossible to determine. A registry of code points that can be sometimes especially problematic may be useful to guide system administrators in setting parameters for allowable code points or combinations in an identifier system, and to aid applications in creating security aids for users. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on December 31, 2018.
    [Show full text]
  • MSR-4: Annotated Repertoire Tables, Non-CJK
    Maximal Starting Repertoire - MSR-4 Annotated Repertoire Tables, Non-CJK Integration Panel Date: 2019-01-25 How to read this file: This file shows all non-CJK characters that are included in the MSR-4 with a yellow background. The set of these code points matches the repertoire specified in the XML format of the MSR. Where present, annotations on individual code points indicate some or all of the languages a code point is used for. This file lists only those Unicode blocks containing non-CJK code points included in the MSR. Code points listed in this document, which are PVALID in IDNA2008 but excluded from the MSR for various reasons are shown with pinkish annotations indicating the primary rationale for excluding the code points, together with other information about usage background, where present. Code points shown with a white background are not PVALID in IDNA2008. Repertoire corresponding to the CJK Unified Ideographs: Main (4E00-9FFF), Extension-A (3400-4DBF), Extension B (20000- 2A6DF), and Hangul Syllables (AC00-D7A3) are included in separate files. For links to these files see "Maximal Starting Repertoire - MSR-4: Overview and Rationale". How the repertoire was chosen: This file only provides a brief categorization of code points that are PVALID in IDNA2008 but excluded from the MSR. For a complete discussion of the principles and guidelines followed by the Integration Panel in creating the MSR, as well as links to the other files, please see “Maximal Starting Repertoire - MSR-4: Overview and Rationale”. Brief description of exclusion
    [Show full text]
  • Overview and Rationale
    Integration Panel: Maximal Starting Repertoire — MSR-4 Overview and Rationale REVISION – November 09, 2018 Table of Contents 1 Overview 3 2 Maximal Starting Repertoire (MSR-4) 3 2.1 Files 3 2.1.1 Overview 3 2.1.2 Normative Definition 3 2.1.3 Code Charts 4 2.2 Determining the Contents of the MSR 5 2.3 Process of Deciding the MSR 6 3 Scripts 7 3.1 Comprehensiveness and Staging 7 3.2 What Defines a Related Script? 8 3.3 Separable Scripts 8 3.4 Deferred Scripts 9 3.5 Historical and Obsolete Scripts 9 3.6 Selecting Scripts and Code Points for the MSR 9 3.7 Scripts Appropriate for Use in Identifiers 9 3.8 Modern Use Scripts 10 3.8.1 Common and Inherited 11 3.8.2 Scripts included in MSR-1 11 3.8.3 Scripts added in MSR-2 11 3.8.4 Scripts added in MSR-3 or MSR-4 12 3.8.5 Modern Scripts Ineligible for the Root Zone 12 3.9 Scripts for Possible Future MSRs 12 3.10 Scripts Identified in UAX#31 as Not Suitable for identifiers 13 4 Exclusions of Individual Code Points or Ranges 14 4.1 Historic and Phonetic Extensions to Modern Scripts 14 4.2 Code Points That Pose Special Risks 15 4.3 Code Points with Strong Justification to Exclude 15 4.4 Code Points That May or May Not be Excludable from the Root Zone LGR 15 4.5 Non-spacing Combining Marks 16 5 Discussion of Particular Code Points 18 Integration Panel: Maximal Starting Repertoire — MSR-3 Overview and Rationale 5.1 Digits and Hyphen 19 5.2 CONTEXT O Code Points 19 5.3 CONTEXT J Code Points 19 5.4 Code Points Restricted for Identifiers 19 5.5 Compatibility with IDNA2003 20 5.6 Code Points for Which the
    [Show full text]
  • A Comparative Analysis of Information Hiding Techniques for Copyright Protection of Text Documents
    Hindawi Security and Communication Networks Volume 2018, Article ID 5325040, 22 pages https://doi.org/10.1155/2018/5325040 Review Article A Comparative Analysis of Information Hiding Techniques for Copyright Protection of Text Documents Milad Taleby Ahvanooey ,1 Qianmu Li ,1 Hiuk Jae Shim,2 and Yanyan Huang3 1 School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China 2School of Computer and Sofware, Nanjing University of Information Science and Technology, Nanjing, China 3School of Automation, Nanjing University of Science and Technology, Nanjing, China Correspondence should be addressed to Milad Taleby Ahvanooey; [email protected] and Qianmu Li; [email protected] Received 6 October 2017; Revised 28 December 2017; Accepted 16 January 2018; Published 17 April 2018 Academic Editor: Pino Caballero-Gil Copyright © 2018 Milad Taleby Ahvanooey et al. Tis is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. With the ceaseless usage of web and other online services, it has turned out that copying, sharing, and transmitting digital media over the Internet are amazingly simple. Since the text is one of the main available data sources and most widely used digital media on the Internet, the signifcant part of websites, books, articles, daily papers, and so on is just the plain text. Terefore, copyrights protection of plain texts is still a remaining issue that must be improved in order to provide proof of ownership and obtain the desired accuracy. During the last decade, digital watermarking and steganography techniques have been used as alternatives to prevent tampering, distortion, and media forgery and also to protect both copyright and authentication.
    [Show full text]
  • Attacking Neural Text Detectors
    ATTACKING NEURAL TEXT DETECTORS Max Wolff∗ Stuart Wolff Viewpoint School Calabasas, CA 91302, USA [email protected] ABSTRACT Machine learning based language models have recently made significant progress, which introduces a danger to spread misinformation. To combat this potential danger, several methods have been proposed for detecting text written by these language models. This paper presents two classes of black-box attacks on these detectors, one which randomly replaces characters with homoglyphs, and the other a simple scheme to purposefully misspell words. The homoglyph and misspelling attacks decrease a popular neural text detector’s recall on neural text from 97.44% to 0.26% and 22.68%, respectively. Results also indicate that the attacks are transferable to other neural text detectors. 1 Introduction Contemporary state of the art language models such as GPT-2 [1] are rapidly improving, as they are being trained on increasingly large datasets and defined using billions of parameters. Language models are currently able to generate coherent text that humans can identify as machine-written text (neural text) with approximately 54% accuracy. [2]– close to random guessing. With this increasing power, language models provide bad actors with the potential to spread misinformation on an unprecedented scale [3] and undermine clear authorship. To reduce the spread of misinformation via language models and give readers a better sense of what entity (machine or human) may have actually written a piece of text, multiple neural text detection methods have been proposed. Two automatic neural text detectors are considered in this work, RoBERTa [3, 4] and GROVER [5], which are 95% and 92% accurate in discriminating neural text from human-written text, respectively.
    [Show full text]
  • The Unicode Cookbook for Linguists
    The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran Michael Cysouw language Translation and Multilingual Natural science press Language Processing 10 Translation and Multilingual Natural Language Processing Editors: Oliver Czulo (Universität Leipzig), Silvia Hansen-Schirra (Johannes Gutenberg-Universität Mainz), Reinhard Rapp (Johannes Gutenberg-Universität Mainz) In this series: 1. Fantinuoli, Claudio & Federico Zanettin (eds.). New directions in corpus-based translation studies. 2. Hansen-Schirra, Silvia & Sambor Grucza (eds.). Eyetracking and Applied Linguistics. 3. Neumann, Stella, Oliver Čulo & Silvia Hansen-Schirra (eds.). Annotation, exploitation and evaluation of parallel corpora: TC3 I. 4. Czulo, Oliver & Silvia Hansen-Schirra (eds.). Crossroads between Contrastive Linguistics, Translation Studies and Machine Translation: TC3 II. 5. Rehm, Georg, Felix Sasaki, Daniel Stein & Andreas Witt (eds.). Language technologies for a multilingual Europe: TC3 III. 6. Menzel, Katrin, Ekaterina Lapshinova-Koltunski & Kerstin Anna Kunz (eds.). New perspectives on cohesion and coherence: Implications for translation. 7. Hansen-Schirra, Silvia, Oliver Czulo & Sascha Hofmann (eds). Empirical modelling of translation and interpreting. 8. Svoboda, Tomáš, Łucja Biel & Krzysztof Łoboda (eds.). Quality aspects in institutional translation. 9. Fox, Wendy. Can integrated titles improve the viewing experience? Investigating the impact of subtitling on the reception and enjoyment of film using eye tracking and questionnaire data. 10. Moran, Steven & Michael Cysouw. The Unicode cookbook for linguists: Managing writing systems using orthography profiles ISSN: 2364-8899 The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran Michael Cysouw language science press Steven Moran & Michael Cysouw. 2018. The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles (Translation and Multilingual Natural Language Processing 10).
    [Show full text]
  • Detection of Malicious IDN Homoglyph Domains Using Active DNS Measurements
    1 Faculty of Electrical Engineering, Mathematics & Computer Science Detection of Malicious IDN Homoglyph Domains Using Active DNS Measurements Ramin Yazdani Master of Science Thesis August 2019 Supervisors: dr. Anna Sperotto Olivier van der Toorn, MSc Graduation Committee: prof.dr.ir. Aiko Pras dr. Anna Sperotto dr.ir. Roland van Rijswijk-Deij dr. Doina Bucur Olivier van der Toorn, MSc DACS Research Group Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente P.O. Box 217 7500 AE Enschede The Netherlands Preface Throughout conducting this research, I received great support from several people. I would like to express my appreciation to my supervisor, dr. Anna Sperotto for her great support in formulating my research. I received great critical and encouraging feedback from you during our meetings. Thanks for always being concerned about my progress as well as whether I liked the topic. I would also like to thank Olivier van der Toorn, MSc for always being enthusiastic and willing to help me when I had difficulties in figuring out some aspects of my research. I interrupted you every now and then, but you always kindly helped me. In addition, I would like to thank other members of my graduation committee members, prof.dr.ir. Aiko Pras, dr.ir. Roland van Rijswijk-Deij, dr. Doina Bucur and members of DACS research group for their precious remarks during this research which helped me to better steer my thesis. Last but not least, a special thanks to my family. Words cannot express how grateful I am to them for always being supportive during different stages of my stud- ies.
    [Show full text]
  • The 2011 IDN Homograph Attack Mitigation Survey
    The 2011 IDN Homograph Attack Mitigation Survey P. Hannay1 and G. Baatard1 1ECUSRI, Edith Cowan University, Perth, WA, Australia Abstract - The advent of internationalized domain names domains in their respective languages. In 1998 the initial (IDNs) has introduced a new threat, with the non-English work on Internationalized Domain Names (IDN) began. This character sets allowing for visual mimicry of domain names. work and subsequent work cumulated in 2003 with the Whilst this potential for this form of attack has been well publication of RFC3454, RFC3490, RFC3491 and RFC3492, recognized, many applications such as Internet browsers and a set of documents outlining the function and proposed e-mail clients have been slow to adopt successful mitigation implementation of IDN (Bell-ATL 2011). strategies and countermeasures. This research examines those strategies and countermeasures, identifying areas of weakness The proposed IDN solution made use of UTF-8 character that allow for homograph attacks. As well as examining the encoding to allow for non-latin characters to be displayed. In presentation of IDNs in e-mail clients and Internet browser order to enable existing DNS infrastructure to handle UTF-8 URL bars, this year’s study examines the presentation of IDNs domains a system known as Punycode was developed in browser-based security certificates and requests for (Faltstrom, Hoffman et al. 2003). Punycode provides facility locational data access. to represent IDNs as regular ASCII domain names, as such no changes are required for the majority of infrastructure (Costello 2003). An example of an IDN would be the domain name ☃.com, which would be represented as xn--n3h.com Keywords: IDN, homograph, network security, when converted to punycode.
    [Show full text]