Detection of Malicious IDN Homoglyph Domains Using Active DNS Measurements
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Sig Process Book
A Æ B C D E F G H I J IJ K L M N O Ø Œ P Þ Q R S T U V W X Ethan Cohen Type & Media 2018–19 SigY Z А Б В Г Ґ Д Е Ж З И К Л М Н О П Р С Т У Ф Х Ч Ц Ш Щ Џ Ь Ъ Ы Љ Њ Ѕ Є Э І Ј Ћ Ю Я Ђ Α Β Γ Δ SIG: A Revival of Rudolf Koch’s Wallau Type & Media 2018–19 ЯREthan Cohen ‡ Submitted as part of Paul van der Laan’s Revival class for the Master of Arts in Type & Media course at Koninklijke Academie von Beeldende Kunsten (Royal Academy of Art, The Hague) INTRODUCTION “I feel such a closeness to William Project Overview Morris that I always have the feeling Sig is a revival of Rudolf Koch’s Wallau Halbfette. My primary source that he cannot be an Englishman, material was the Klingspor Kalender für das Jahr 1933 (Klingspor Calen- dar for the Year 1933), a 17.5 × 9.6 cm book set in various cuts of Wallau. he must be a German.” The Klingspor Kalender was an annual promotional keepsake printed by the Klingspor Type Foundry in Offenbach am Main that featured different Klingspor typefaces every year. This edition has a daily cal- endar set in Magere Wallau (Wallau Light) and an 18-page collection RUDOLF KOCH of fables set in 9 pt Wallau Halbfette (Wallau Semibold) with woodcut illustrations by Willi Harwerth, who worked as a draftsman at the Klingspor Type Foundry. -
Package Mathfont V. 1.6 User Guide Conrad Kosowsky December 2019 [email protected]
Package mathfont v. 1.6 User Guide Conrad Kosowsky December 2019 [email protected] For easy, off-the-shelf use, type the following in your docu- ment preamble and compile using X LE ATEX or LuaLATEX: \usepackage[hfont namei]{mathfont} Abstract The mathfont package provides a flexible interface for changing the font of math- mode characters. The package allows the user to specify a default unicode font for each of six basic classes of Latin and Greek characters, and it provides additional support for unicode math and alphanumeric symbols, including punctuation. Crucially, mathfont is compatible with both X LE ATEX and LuaLATEX, and it provides several font-loading commands that allow the user to change fonts locally or for individual characters within math mode. Handling fonts in TEX and LATEX is a notoriously difficult task. Donald Knuth origi- nally designed TEX to support fonts created with Metafont, and while subsequent versions of TEX extended this functionality to postscript fonts, Plain TEX's font-loading capabilities remain limited. Many, if not most, LATEX users are unfamiliar with the fd files that must be used in font declaration, and the minutiae of TEX's \font primitive can be esoteric and confusing. LATEX 2"'s New Font Selection System (nfss) implemented a straightforward syn- tax for loading and managing fonts, but LATEX macros overlaying a TEX core face the same versatility issues as Plain TEX itself. Fonts in math mode present a double challenge: after loading a font either in Plain TEX or through the nfss, defining math symbols can be unin- tuitive for users who are unfamiliar with TEX's \mathcode primitive. -
ISO Basic Latin Alphabet
ISO basic Latin alphabet The ISO basic Latin alphabet is a Latin-script alphabet and consists of two sets of 26 letters, codified in[1] various national and international standards and used widely in international communication. The two sets contain the following 26 letters each:[1][2] ISO basic Latin alphabet Uppercase Latin A B C D E F G H I J K L M N O P Q R S T U V W X Y Z alphabet Lowercase Latin a b c d e f g h i j k l m n o p q r s t u v w x y z alphabet Contents History Terminology Name for Unicode block that contains all letters Names for the two subsets Names for the letters Timeline for encoding standards Timeline for widely used computer codes supporting the alphabet Representation Usage Alphabets containing the same set of letters Column numbering See also References History By the 1960s it became apparent to thecomputer and telecommunications industries in the First World that a non-proprietary method of encoding characters was needed. The International Organization for Standardization (ISO) encapsulated the Latin script in their (ISO/IEC 646) 7-bit character-encoding standard. To achieve widespread acceptance, this encapsulation was based on popular usage. The standard was based on the already published American Standard Code for Information Interchange, better known as ASCII, which included in the character set the 26 × 2 letters of the English alphabet. Later standards issued by the ISO, for example ISO/IEC 8859 (8-bit character encoding) and ISO/IEC 10646 (Unicode Latin), have continued to define the 26 × 2 letters of the English alphabet as the basic Latin script with extensions to handle other letters in other languages.[1] Terminology Name for Unicode block that contains all letters The Unicode block that contains the alphabet is called "C0 Controls and Basic Latin". -
The Unicode Cookbook for Linguists: Managing Writing Systems Using Orthography Profiles
Zurich Open Repository and Archive University of Zurich Main Library Strickhofstrasse 39 CH-8057 Zurich www.zora.uzh.ch Year: 2017 The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles Moran, Steven ; Cysouw, Michael DOI: https://doi.org/10.5281/zenodo.290662 Posted at the Zurich Open Repository and Archive, University of Zurich ZORA URL: https://doi.org/10.5167/uzh-135400 Monograph The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0) License. Originally published at: Moran, Steven; Cysouw, Michael (2017). The Unicode Cookbook for Linguists: Managing writing systems using orthography profiles. CERN Data Centre: Zenodo. DOI: https://doi.org/10.5281/zenodo.290662 The Unicode Cookbook for Linguists Managing writing systems using orthography profiles Steven Moran & Michael Cysouw Change dedication in localmetadata.tex Preface This text is meant as a practical guide for linguists, and programmers, whowork with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together. The intersection of the Unicode Standard and the International Phonetic Al- phabet is often not met without frustration by users. Nevertheless, thetwo standards have provided language researchers with a consistent computational architecture needed to process, publish and analyze data from many different languages. We bring to light common, but not always transparent, pitfalls that researchers face when working with Unicode and IPA. Our research uses quantitative methods to compare languages and uncover and clarify their phylogenetic relations. However, the majority of lexical data available from the world’s languages is in author- or document-specific orthogra- phies. -
Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress
1 Assessment of Options for Handling Full Unicode Character Encodings in MARC21 A Study for the Library of Congress Part 1: New Scripts Jack Cain Senior Consultant Trylus Computing, Toronto 1 Purpose This assessment intends to study the issues and make recommendations on the possible expansion of the character set repertoire for bibliographic records in MARC21 format. 1.1 “Encoding Scheme” vs. “Repertoire” An encoding scheme contains codes by which characters are represented in computer memory. These codes are organized according to a certain methodology called an encoding scheme. The list of all characters so encoded is referred to as the “repertoire” of characters in the given encoding schemes. For example, ASCII is one encoding scheme, perhaps the one best known to the average non-technical person in North America. “A”, “B”, & “C” are three characters in the repertoire of this encoding scheme. These three characters are assigned encodings 41, 42 & 43 in ASCII (expressed here in hexadecimal). 1.2 MARC8 "MARC8" is the term commonly used to refer both to the encoding scheme and its repertoire as used in MARC records up to 1998. The ‘8’ refers to the fact that, unlike Unicode which is a multi-byte per character code set, the MARC8 encoding scheme is principally made up of multiple one byte tables in which each character is encoded using a single 8 bit byte. (It also includes the EACC set which actually uses fixed length 3 bytes per character.) (For details on MARC8 and its specifications see: http://www.loc.gov/marc/.) MARC8 was introduced around 1968 and was initially limited to essentially Latin script only. -
Invisible-Punctuation.Pdf
... ' I •e •e •4 I •e •e •4 •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • ••••• • • •• • • • • • I •e •e •4 In/visible Punctuation • • • • •• • • • •• • • • •• • • • • • •• • • • • • •• • • • • • • •• • • • • • •• • • • • • ' •• • • • • • John Lennard •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • •• • • • • • I •e •e •4 I •e •e •4 I •e •e •4 I •e •e •4 I •e •e •4 I ••• • • 4 I.e• • • 4 I ••• • • 4 I ••• • • 4 I ••• • • 4 I ••• • • 4 • • •' .•. • . • .•. •. • ' .. ' • • •' .•. • . • .•. • . • ' . ' . UNIVERSITY OF THE WEST INDIES- LENNARD, 121-138- VISIBLE LANGUAGE 45.1/ 2 I •e •e' • • • • © VISIBLE LANGUAGE, 2011 -RHODE ISLAND SCHOOL OF DESIGN- PROVIDENCE, RHODE ISLAND 02903 .. ' ABSTRACT The article offers two approaches to the question of 'invisible punctuation,' theoretical and critical. The first is a taxonomy of modes of punctuational invisibility, · identifying denial, repression, habituation, error and absence. Each is briefly discussed and some relations with technologies of reading are considered. The second considers the paragraphing, or lack of it, in Sir Philip Sidney's Apology for Poetry: one of the two early printed editions and at least one of the two MSS are mono paragraphic, a feature always silently eliminated by editors as a supposed carelessness. It is argued that this is improbable -
Iso/Iec Jtc1/Sc2/Wg2 N4907 L2/17-359
ISO/IEC JTC1/SC2/WG2 N4907 L2/17-359 2017-10-20 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal to add six Latin Tironian letters to the UCS Source: Michael Everson and Andrew West Status: Individual Contribution Date: 2017-10-20 Replaces: N4841 (L2/17-300) 0. Summary. This proposal requests the encoding of three new casing letters used in medieval European texts. If this proposal is accepted, the following characters will exist: A7F0 LATIN CAPITAL LETTER TIRONIAN ET A7F1 LATIN SMALL LETTER TIRONIAN ET • Old and Middle English, … → 204A tironian sign et ꟲ A7F2 LATIN CAPITAL LETTER TIRONIAN ET WITH HOOK ꟳ A7F3 LATIN SMALL LETTER TIRONIAN ET WITH HOOK • Middle English, Latin, … ꟴ A7F4 LATIN CAPITAL LETTER TIRONIAN ET WITH HOOK AND STROKE Ꟶ A7F5 LATIN SMALL LETTER TIRONIAN ET WITH HOOK AND STROKE • Middle English, Latin, French, … 1. Background. A punctuation character chiefly for Irish use was added in Unicode 3.0 in 1999. Its current entry in the standard reads: ⁊ 204A TIRONIAN SIGN ET • Irish Gaelic, Old English, … → 0026 & ampersand → 1F670 script ligature et ornament A variety of medieval English manuscripts across a number of centuries treat the the Tironian sign as an actual letter of the alphabet, and case it when in sentence-initial position. Modern transcribers of documents containing these letters have distinguished them as casing, sometimes using the digit 7 as a font workaround. The original simple two-stroke shape of the character ⁊ as used in the Insular tradition (for Irish and Old English) was replaced by Carolingian character with a hooked base ꟳ, which had a range of glyph variants including a long extension of the top-bar descending to the left Page 1 and sometimes even encircling the glyph. -
UTF-8-Plan9-Paper
Hello World or Kαληµε´ρα κο´σµε or Rob Pike Ken Thompson AT&T Bell Laboratories Murray Hill, New Jersey 07974 ABSTRACT Plan 9 from Bell Labs has recently been converted from ASCII to an ASCII- compatible variant of Unicode, a 16-bit character set. In this paper we explain the rea- sons for the change, describe the character set and representation we chose, and present the programming models and software changes that support the new text format. Although we stopped short of full internationalizationÐfor example, system error mes- sages are in Unixese, not JapaneseÐwe believe Plan 9 is the first system to treat the rep- resentation of all major languages on a uniform, equal footing throughout all its software. Introduction The world is multilingual but most computer systems are based on English and ASCII. The release of Plan 9 [Pike90], a new distributed operating system from Bell Laboratories, seemed a good occasion to correct this chauvinism. It is easier to make such deep changes when building new systems than by refit- ting old ones. The ANSI C standard [ANSIC] contains some guidance on the matter of ‘wide’ and ‘multi-byte’ characters but falls far short of solving the myriad associated problems. We could find no literature on how to convert a system to larger character sets, although some individual programs had been converted. This paper reports what we discovered as we explored the problem of representing multilingual text at all levels of an operating system, from the file system and kernel through the applications and up to the window sys- tem and display. -
Part 1: Introduction to The
PREVIEW OF THE IPA HANDBOOK Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet PARTI Introduction to the IPA 1. What is the International Phonetic Alphabet? The aim of the International Phonetic Association is to promote the scientific study of phonetics and the various practical applications of that science. For both these it is necessary to have a consistent way of representing the sounds of language in written form. From its foundation in 1886 the Association has been concerned to develop a system of notation which would be convenient to use, but comprehensive enough to cope with the wide variety of sounds found in the languages of the world; and to encourage the use of thjs notation as widely as possible among those concerned with language. The system is generally known as the International Phonetic Alphabet. Both the Association and its Alphabet are widely referred to by the abbreviation IPA, but here 'IPA' will be used only for the Alphabet. The IPA is based on the Roman alphabet, which has the advantage of being widely familiar, but also includes letters and additional symbols from a variety of other sources. These additions are necessary because the variety of sounds in languages is much greater than the number of letters in the Roman alphabet. The use of sequences of phonetic symbols to represent speech is known as transcription. The IPA can be used for many different purposes. For instance, it can be used as a way to show pronunciation in a dictionary, to record a language in linguistic fieldwork, to form the basis of a writing system for a language, or to annotate acoustic and other displays in the analysis of speech. -
Detection of Suspicious IDN Homograph Domains Using Active DNS Measurements
A Case of Identity: Detection of Suspicious IDN Homograph Domains Using Active DNS Measurements Ramin Yazdani Olivier van der Toorn Anna Sperotto University of Twente University of Twente University of Twente Enschede, The Netherlands Enschede, The Netherlands Enschede, The Netherlands [email protected] [email protected] [email protected] Abstract—The possibility to include Unicode characters in problem in the case of Unicode – ASCII homograph domain names allows users to deal with domains in their domains. We propose a low-cost method for proactively regional languages. This is done by introducing Internation- detecting suspicious IDNs. Since our proactive approach alized Domain Names (IDN). However, the visual similarity is based not on use, but on characteristics of the domain between different Unicode characters - called homoglyphs name itself and its associated DNS records, we are able to - is a potential security threat, as visually similar domain provide an early alert for both domain owners as well as names are often used in phishing attacks. Timely detection security researchers to further investigate these domains of suspicious homograph domain names is an important before they are involved in malicious activities. The main step towards preventing sophisticated attacks, since this can contributions of this paper are that we: prevent unaware users to access those homograph domains • propose an improved Unicode Confusion table able that actually carry malicious content. We therefore propose to detect 2.97 times homograph domains compared a structured approach to identify suspicious homograph to the state-of-the-art confusion tables; domain names based not on use, but on characteristics • combine active DNS measurements and Unicode of the domain name itself and its associated DNS records. -
Unicode Encoding the TITUS Project
ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Unicode Encoding and Online Data Access Ralf Gehrke / Jost Gippert The TITUS Project („Thesaurus indogermanischer Text- und Sprachmaterialien“) (since 1987/1993) www.ala.org/alcts 1 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Scope of the TITUS project: • Electronic retrieval engine covering the textual heritage of all ancient Indo-European languages • Present retrieval task: – Documentation of the usage of all word forms occurring in the texts, in their resp. contexts • Survey of the parts of the text database: – http://titus.uni-frankfurt.de/texte/texte2.htm Data formats (since 1995): • Text formats: – WordCruncher Text format (8-Bit) – HTML (UTF-8 Unicode 4.0) – (Plain 7-bit ASCII format) • Database format: – MS Access (relational, Unicode-based) – Retrieval via SQL www.ala.org/alcts 2 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Original Scripts Covered: • Latin (with all kinds of diacritics), incl. variants* • Greek • Slavic (Cyrillic and Glagolitic*) • Armenian • Georgian • Devangar • Other Brhm scripts (Tocharian, Khotanese)* • Avestan* • Middle Persian (Pahlav)* • Manichean* • Arabic (incl. Persian) • Runic • Ogham • and many more * not yet encodable (as such) in Unicode Example 1a: Donelaitis (Lithuanian: formatted text incl. diacritics: 8-bit version) www.ala.org/alcts 3 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 1b: Donelaitis (Lithuanian: formatted text incl. diacritics: Unicode version) Example 2a: Catechism (Old Prussian: formatted text incl. diacritics: 8-bit version, special TITUS font) www.ala.org/alcts 4 ALCTS "Library Catalogs and Non- 2004 ALA Annual Conference Orlando Roman Scripts" Program Example 2b: Catechism (Old Prussian: formatted text incl. -
Name Synopsis Description
Perl version 5.10.0 documentation - Locale::Script NAME Locale::Script - ISO codes for script identification (ISO 15924) SYNOPSIS use Locale::Script; use Locale::Constants; $script = code2script('ph'); # 'Phoenician' $code = script2code('Tibetan'); # 'bo' $code3 = script2code('Tibetan', LOCALE_CODE_ALPHA_3); # 'bod' $codeN = script2code('Tibetan', LOCALE_CODE_ALPHA_NUMERIC); # 330 @codes = all_script_codes(); @scripts = all_script_names(); DESCRIPTION The Locale::Script module provides access to the ISOcodes for identifying scripts, as defined in ISO 15924.For example, Egyptian hieroglyphs are denoted by the two-lettercode 'eg', the three-letter code 'egy', and the numeric code 050. You can either access the codes via the conversion routines(described below), or with the two functions which return listsof all script codes or all script names. There are three different code sets you can use for identifyingscripts: alpha-2 Two letter codes, such as 'bo' for Tibetan.This code set is identified with the symbol LOCALE_CODE_ALPHA_2. alpha-3 Three letter codes, such as 'ell' for Greek.This code set is identified with the symbol LOCALE_CODE_ALPHA_3. numeric Numeric codes, such as 410 for Hiragana.This code set is identified with the symbol LOCALE_CODE_NUMERIC. All of the routines take an optional additional argumentwhich specifies the code set to use.If not specified, it defaults to the two-letter codes.This is partly for backwards compatibility (previous versionsof Locale modules only supported the alpha-2 codes), andpartly because they are the most widely used codes. The alpha-2 and alpha-3 codes are not case-dependent,so you can use 'BO', 'Bo', 'bO' or 'bo' for Tibetan.When a code is returned by one of the functions inthis module, it will always be lower-case.