Unicode & Emoji
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Iso/Iec Jtc1/Sc22/Wg20 N809r
ISO/IEC JTC1/SC22/WG20 N809R 2001-01-09 Internationalization International Organization for Standardization Organisation internationale de normalisation еждународнаяорганизацияпостандартизации Doc Type: Working Group Document Title: Ordering the Runic script Source: Michael Everson Status: Expert Contribution Date: 2001-01-09 On 2000-12-24 Olle Järnefors published on behalf of the ISORUNES Project in Sweden a proposal for ordering the Runes in the Common Tailorable Template (CTT) of ISO/IEC 14651. In my view this ordering is unsuitable for the CTT for a number of reasons. Runic ordering in ISO/IEC 10646. The Runes are encoded at U+16A0–U+16FF, in a unified set of characters encompassing the four major traditions of Runic use: Germanic, Anglo-Frisian, Danish, and Swedish-Norwegian, and Medieval. The Runes are arranged in the code table agreed by ISO/IEC JTC1/SC2/WG2 in an order based on the the traditional positions of the Runes in abecedaries, namely, the fuþark order. This order is known from hundreds of primary sources which list the Runes in sequence, often with no other text. Most scholarly texts refer to the fuþark in one way or another. Nearly all secondary texts, whether popular introductions to the Runes or New-Age esoterica, give primacy to the traditional fuþark sequence. Runic names in ISO/IEC 10646. The names given to the Runes in the UCS may be a bit clumsy, but they are intended to serve the needs of scholars and amateurs alike; not everyone is familiar with Runic transliteration practices, and not everyone is conversant with the traditional names in Germanic, English, and Scandinavian usage. -
Unicode/IDN Security Mark Davis President, Unicode Consortium Chief SW Globalization Arch., IBM the Unicode Consortium
Unicode/IDN Security Mark Davis President, Unicode Consortium Chief SW Globalization Arch., IBM The Unicode Consortium Software globalization standards: define properties and behavior for every character in every script Unicode Standard: a unique code for every character Common Locale Data Repository: LDML format plus repository for required locale data Collation, line breaking, regex, charset mapping, … Used by every major modern operating system, browser, office software, email client,… Core of XML, HTML, Java, C#, C (with ICU), Javascript, … Security ~ Identity System A X = x System B X ≠ x IDN You get an email about your paypal.com account, click on the link… You carefully examine your browser's address box to make sure that it is actually going to http://paypal.com/ … But actually it is going to a spoof site: “paypal.com” with the Cyrillic letter “p”. You (System A) think that they are the same DNS (System B) thinks they are different Examples: Letters Cross-Script p in Latin vs p in Cyrillic In-Script Sequences rn may appear at display sizes like m ! + ा typically looks identical to # so̷s looks like søs Rendering Support ä with two umlauts may look the same as ä with one el is actually e + l Examples: Numbers Western 0 1 2 3 4 5 6 7 8 9 Bengali ৺৺৺৺৺৺৺৺৺৺ Oriya ୱୱୱୱୱୱୱୱୱୱ Thus ৺ୱ = 42 Syntax Spoofing http://example.org/1234/not.mydomain.com http://example.org/1234/not.mydomain.com / = fraction-slash Also possible without Unicode: http://example.org--long-and-obscure-list-of- characters.mydomain.com UTR -
Network Working Group D. Goldsmith Request for Comments: 1641 M
Network Working Group D. Goldsmith Request for Comments: 1641 M. Davis Category: Experimental Taligent, Inc. July 1994 Using Unicode with MIME Status of this Memo This memo defines an Experimental Protocol for the Internet community. This memo does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Abstract The Unicode Standard, version 1.1, and ISO/IEC 10646-1:1993(E) jointly define a 16 bit character set (hereafter referred to as Unicode) which encompasses most of the world's writing systems. However, Internet mail (STD 11, RFC 822) currently supports only 7- bit US ASCII as a character set. MIME (RFC 1521 and RFC 1522) extends Internet mail to support different media types and character sets, and thus could support Unicode in mail messages. MIME neither defines Unicode as a permitted character set nor specifies how it would be encoded, although it does provide for the registration of additional character sets over time. This document specifies the usage of Unicode within MIME. Motivation Since Unicode is starting to see widespread commercial adoption, users will want a way to transmit information in this character set in mail messages and other Internet media. Since MIME was expressly designed to allow such extensions and is on the standards track for the Internet, it is the most appropriate means for encoding Unicode. RFC 1521 and RFC 1522 do not define Unicode as an allowed character set, but allow registration of additional character sets. In addition to allowing use of Unicode within MIME bodies, another goal is to specify a way of using Unicode that allows text which consists largely, but not entirely, of US-ASCII characters to be represented in a way that can be read by mail clients who do not understand Unicode. -
ISO/IEC JTC1/SC2/WG2 N2664 L2/03-393 A. Administrative B. Technical -- General
ISO/IEC JTC1/SC2/WG2 N2664 L2/03-393 2003-11-02 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation internationale de normalisation еждународная организация по стандартизации Doc Type: Working Group Document Title: Preliminary proposal to encode the Cuneiform script in the SMP of the UCS Source: Michael Everson, Karljürgen Feuerherm, Steve Tinney Status: Individual Contribution Date: 2003-11-02 A. Administrative 1. Title Preliminary proposal to encode the Cuneiform script in the SMP of the UCS. 2. Requester’s name Michael Everson, Karljürgen Feuerherm, Steve Tinney 3. Requester type (Member body/Liaison/Individual contribution) Individual contribution. 4. Submission date 2003-11-02 5. Requester’s reference (if applicable) 6. Choose one of the following: 6a. This is a complete proposal No. This is a preliminary proposal 6b. More information will be provided later Yes. B. Technical -- General 1. Choose one of the following: 1a. This proposal is for a new script (set of characters) Yes. Proposed name of script Cuneiform and Cuneiform Numbers. 1b. The proposal is for addition of character(s) to an existing block No. 1b. Name of the existing block 2. Number of characters in proposal 952. 3. Proposed category (see section II, Character Categories) Category B. 4a. Proposed Level of Implementation (1, 2 or 3) (see clause 14, ISO/IEC 10646-1: 2000) Level 1. 4b. Is a rationale provided for the choice? Yes. 4c. If YES, reference Characters are ordinary spacing characters. 5a. Is a repertoire including character names provided? Yes. 5b. If YES, are the names in accordance with the character naming guidelines in Annex L of ISO/IEC 10646-1: 2000? Yes. -
Jtc1/Sc2/Wg2 N3427 L2/08-132
JTC1/SC2/WG2 N3427 L2/08-132 2008-04-08 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal to encode 39 Unified Canadian Aboriginal Syllabics in the UCS Source: Michael Everson and Chris Harvey Status: Individual Contribution Action: For consideration by JTC1/SC2/WG2 and UTC Date: 2008-04-08 1. Summary. This document requests 39 additional characters to be added to the UCS and contains the proposal summary form. 1. Syllabics hyphen (U+1400). Many Aboriginal Canadian languages use the character U+1428 CANADIAN SYLLABICS FINAL SHORT HORIZONTAL STROKE, which looks like the Latin script hyphen. Algonquian languages like western dialects of Cree, Oji-Cree, western and northern dialects of Ojibway employ this character to represent /tʃ/, /c/, or /j/, as in Plains Cree ᐊᓄᐦᐨ /anohc/ ‘today’. In Athabaskan languages, like Chipewyan, the sound is /d/ or an alveolar onset, as in Sayisi Dene ᐨᕦᐣᐨᕤ /t’ąt’ú/ ‘how’. To avoid ambiguity between this character and a line-breaking hyphen, a SYLLABICS HYPHEN was developed which resembles an equals sign. Depending on the typeface, the width of the syllabics hyphen can range from a short ᐀ to a much longer ᐀. This hyphen is line-breaking punctuation, and should not be confused with the Blackfoot syllable internal-w final proposed for U+167F. See Figures 1 and 2. 2. DHW- additions for Woods Cree (U+1677..U+167D). ᙷᙸᙹᙺᙻᙼᙽ/ðwē/ /ðwi/ /ðwī/ /ðwo/ /ðwō/ /ðwa/ /ðwā/. The basic syllable structure in Cree is (C)(w)V(C)(C). -
Comments Received for ISO 639-3 Change Request 2015-046 Outcome
Comments received for ISO 639-3 Change Request 2015-046 Outcome: Accepted after appeal Effective date: May 27, 2016 SIL International ISO 639-3 Registration Authority 7500 W. Camp Wisdom Rd., Dallas, TX 75236 PHONE: (972) 708-7400 FAX: (972) 708-7380 (GMT-6) E-MAIL: [email protected] INTERNET: http://www.sil.org/iso639-3/ Registration Authority decision on Change Request no. 2015-046: to create the code element [ovd] Ӧvdalian . The request to create the code [ovd] Ӧvdalian has been reevaluated, based on additional information from the original requesters and extensive discussion from outside parties on the IETF list. The additional information has strengthened the case and changed the decision of the Registration Authority to accept the code request. In particular, the long bibliography submitted shows that Ӧvdalian has undergone significant language development, and now has close to 50 publications. In addition, it has been studied extensively, and the academic works should have a distinct code to distinguish them from publications on Swedish. One revision being added by the Registration Authority is the added English name “Elfdalian” which was used in most of the extensive discussion on the IETF list. Michael Everson [email protected] May 4, 2016 This is an appeal by the group responsible for the IETF language subtags to the ISO 639 RA to reconsider and revert their earlier decision and to assign an ISO 639-3 language code to Elfdalian. The undersigned members of the group responsible for the IETF language subtag are concerned about the rejection of the Elfdalian language. There is no doubt that its linguistic features are unique in the continuum of North Germanic languages. -
Unicode Compression: Does Size Really Matter? TR CS-2002-11
Unicode Compression: Does Size Really Matter? TR CS-2002-11 Steve Atkin IBM Globalization Center of Competency International Business Machines Austin, Texas USA 78758 [email protected] Ryan Stansifer Department of Computer Sciences Florida Institute of Technology Melbourne, Florida USA 32901 [email protected] July 2003 Abstract The Unicode standard provides several algorithms, techniques, and strategies for assigning, transmitting, and compressing Unicode characters. These techniques allow Unicode data to be represented in a concise format in several contexts. In this paper we examine several techniques and strategies for compressing Unicode data using the programs gzip and bzip. Unicode compression algorithms known as SCSU and BOCU are also examined. As far as size is concerned, algorithms designed specifically for Unicode may not be necessary. 1 Introduction Characters these days are more than one 8-bit byte. Hence, many are concerned about the space text files use, even in an age of cheap storage. Will storing and transmitting Unicode [18] take a lot more space? In this paper we ask how compression affects Unicode and how Unicode affects compression. 1 Unicode is used to encode natural-language text as opposed to programs or binary data. Just what is natural-language text? The question seems simple, yet there are complications. In the information age we are accustomed to discretization of all kinds: music with, for instance, MP3; and pictures with, for instance, JPG. Also, a vast amount of text is stored and transmitted digitally. Yet discretizing text is not generally considered much of a problem. This may be because the En- glish language, western society, and computer technology all evolved relatively smoothly together. -
New Emoji Requests from Twitter Users: When, Where, Why, and What We Can Do About Them
1 New Emoji Requests from Twitter Users: When, Where, Why, and What We Can Do About Them YUNHE FENG, University of Tennessee, Knoxville, USA ZHENG LU, University of Tennessee, Knoxville, USA WENJUN ZHOU, University of Tennessee, Knoxville, USA ZHIBO WANG, Wuhan University, China QING CAO∗, University of Tennessee, Knoxville, USA As emojis become prevalent in personal communications, people are always looking for new, interesting emojis to express emotions, show attitudes, or simply visualize texts. In this study, we collected more than thirty million tweets mentioning the word “emoji” in a one-year period to study emoji requests on Twitter. First, we filtered out bot-generated tweets and extracted emoji requests from the raw tweets using a comprehensive list of linguistic patterns. To our surprise, some extant emojis, such as fire and hijab , were still frequently requested by many users. A large number of non-existing emojis were also requested, which were classified into one of eight emoji categories by Unicode Standard. We then examined patterns of new emoji requests by exploring their time, location, and context. Eagerness and frustration of not having these emojis were evidenced by our sentiment analysis, and we summarize users’ advocacy channels. Focusing on typical patterns of co-mentioned emojis, we also identified expressions of equity, diversity, and fairness issues due to unreleased but expected emojis, and summarized the significance of new emojis on society. Finally, time- continuity sensitive strategies at multiple time granularity levels were proposed to rank petitioned emojis by the eagerness, and a real-time monitoring system to track new emoji requests is implemented. -
Encoding of Tengwar Telcontar Version 0.08
Tengwar Telcontar This document discusses the encoding of Tengwar Telcontar version 0.08. The latest version of the font and this document can be downloaded from the Free Tengwar Font Project: http://freetengwar.sourceforge.net/. Changes from earlier proposals By creating a fully functional Unicode font for Tengwar, my intention is to promote the proposal to encode Tengwar in Unicode, and to spur the discussion on the best way to design such an encoding. What then constitutes the best way to encode Tengwar? This I can hardly decide on my own; indeed, it is of the highest importance that an encoding proposal is approved by a majority of the Tengwar user community. You are therefore cordially invited to discuss and to propose changes to Tengwar Telcontar. If you want to familiarize yourself with the Unicode standard in general, you can find much information at http://www.unicode.org/. In particular, I recommend that you read chapter 2 of The Unicode Standard, which defines many important concepts (e. g. terms like character, glyph, etc.): http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf. The current encoding of the characters in Tengwar Telcontar (shown in a table at the end of this document) is based on the encoding in Michael Everson’s latest discussion paper at the Conscript Unicode Registry: http://www.evertype.com/standards/iso10646/pdf/tengwar.pdf. However, I have on more than one occasion diverged from Everson’s table, adding some characters that I felt were missing and removed others that, to my opinion, either do not merit inclusion at all, or which possibly might be better represented in other ways. -
Manuscript Submission: Use of the Coptic Script Version 1.1, April 19, 2021, by Pim Rietbroek and Maaike Langerak
Manuscript Submission: Use of the Coptic Script Version 1.1, April 19, 2021, by Pim Rietbroek and Maaike Langerak Instructions for Authors These instructions cover the Sahidic dialect of the Coptic language; for any questions about glyph variants or diacritics found in Bohairic Coptic or other Coptic dialects, please contact your associate editor at Brill. 1 Word Processing Windows users should use MS Office Word 2016 or later, or 365. Documents should be saved in .docx format. Mac users should use either MS Office for Mac, Mellel, Nisus Writer Pro, Nisus Writer Express, or Pages. Save (or export) in .doc or .docx format, but also submit the files in their original format (.mellel, .pages, etc.). 2 Input Fonts Make sure you use a Unicode font, preferably Antinoou. Brill uses Antinoou as default font for typesetting Coptic. Antinoou has been developed by Michael Everson under the auspices of the International Association for Coptic Studies. The Institut français d’archéologie Orientale (IFAO) provides a couple of Coptic fonts: IFAO N Copte and IFAO- Grec Unicode. Please do not use these fonts: some of the rarer characters and glyphs in them are encoded in the Private Use Area, and these will be lost when the text is placed in the layout template by the typesetters, meaning they will not transfer to either print or online publications. Antinoou will render most of them using standard Unicode characters; should you need Coptic characters or glyphs that are not yet encoded in the Unicode Standard, please contact your associate editor at Brill. 3 Keying Coptic 3.1 Keyboards Out of the box Windows and macOS do not provide ‘keyboards’ (‘IMEs’ or ‘Input Methods’ or ‘Input Sources’) for Coptic, but fortunately there are few available as free downloads. -
Michael Everson, DDS, MS
Michael Everson, DDS, MS Notice of Privacy Practices Acknowledgement I understand that I have certain rights to privacy regarding my protected health information. These rights are given to me under the Health Insurance Portability and Accountability Act of 1996 (HIPAA). I understand that by signing this consent I authorize you to use and disclose my protected health information to carry out: - Conduct, plan, and direct my treatment and follow-up among the multiple healthcare providers who may be involved in that treatment directly and indirectly. - Obtaining payment from third party payers (e.g. insurance company) - Conduct normal healthcare operations such as quality assessments and physician certifications. I have also been informed of and given the right to review and secure a copy of your Notice of Privacy Practices, which contains a more complete description of the uses and disclosures of my protected health information and my rights under HIPAA. I understand that you reserve the right to change the terms of this notice from time to time and that I may contact you at any time to obtain the most current copy of this notice. I understand that I have the right to request restrictions on how my protected health information is used and disclosed to carry out treatment, payment and healthcare operations, but that you are not required to agree to these requested restrictions. However, if you do agree, you are then bound to comply with this restriction. I understand that I may revoke this consent in writing, at any time. However, any use or disclosure that occurred prior to the date I revoke this consent is not affected. -
Iso/Iec Jtc1/Sc2/Wg2 N4178r L2/12-002
ISO/IEC JTC1/SC2/WG2 N4178R L2/12-002 2012-01-26 Universal Multiple-Octet Coded Character Set International Organization for Standardization Organisation Internationale de Normalisation Международная организация по стандартизации Doc Type: Working Group Document Title: Proposal for additions and corrections to Sumero-Akkadian Cuneiform Source: UC Berkeley Script Encoding Initiative (Universal Scripts Project) Authors: Michael Everson and Steve Tinney Status: Liaison Contribution Date: 2012-01-26 The encoding of Sumero-Akkadian Cuneiform in Unicode is a multi-phase process. In the first phase, the principal focus was signs from 2100 BCE onwards, with earlier material covered to the extent that it was clearly understood. This document adds signs that were inadvertently omitted from Phase 1, usually because they were not listed in the standard published sign lists. An exhaustive online sign-list is being prepared at http://oracc.museum.upenn.edu/ogsl/, and the characters listed here are included in that list. 1. Signs omitted from the initial Cuneiform repertoire. CUNEIFORM SIGN KAP ELAMITE CUNEIFORM SIGN GA2 TIMES ASH2 CUNEIFORM SIGN KA TIMES ANSHE CUNEIFORM SIGN KA TIMES GUD CUNEIFORM SIGN KA TIMES SHUL CUNEIFORM SIGN LU2 SHESHIG TIMES BAD CUNEIFORM SIGN NU11 ROTATED NINETY DEGREES CUNEIFORM SIGN U U CUNEIFORM SIGN UR2 INVERTED CUNEIFORM NUMERIC SIGN ONE QUARTER GUR CUNEIFORM NUMERIC SIGN ONE HALF GUR CUNEIFORM NUMERIC SIGN ELAMITE ONE THIRD CUNEIFORM NUMERIC SIGN ELAMITE TWO THIRDS CUNEIFORM NUMERIC SIGN ELAMITE FORTY CUNEIFORM NUMERIC SIGN ELAMITE FIFTY CUNEIFORM NUMERIC SIGN ELAMITE SIXTY CUNEIFORM NUMERIC SIGN FOUR U VARIANT FORM CUNEIFORM NUMERIC SIGN FIVE U VARIANT FORM CUNEIFORM NUMERIC SIGN SIX U VARIANT FORM CUNEIFORM NUMERIC SIGN SEVEN U VARIANT FORM CUNEIFORM NUMERIC SIGN EIGHT U VARIANT FORM CUNEIFORM NUMERIC SIGN NINE U VARIANT FORM CUNEIFORM PUNCTUATION SIGN DIAGONAL QUADCOLON Page 1 2.