Omega — Why Bother with Unicode?

Total Page:16

File Type:pdf, Size:1020Kb

Omega — Why Bother with Unicode? Omega — why Bother with Unicode? Robin Fairbairns University of Cambridge Computer Laboratory Pembroke Street Cambridge CB23QG UK Email: [email protected] Abstract Yannis Haralambous’ and John Plaice’s Omega system employs Unicode as its coding system. This short note (which previously appeared in the UKTUG mag- azine Baskerville volume 5, number 3) considers the rationale behind Unicode itself and behind its adoption for Omega. Introduction and written by Yannis Haralambous (Lille) and John Plaice (Universit´e Laval, Montr´eal). It follows on As almost all T X users who ‘listen to the networks’ E quite naturally from Yannis’ work on exotic lan- at all will know, the Francophone T X users’ group, E guages (see, among many examples, Haralambous, GUTenberg, arranged a meeting in March at CERN 1990; 1991; 1993; 1994), which have always seemed (Geneva) to ‘launch’ Ω. The UKTUG responded to to me to be bedevilled by problems of text encoding. GUTenberg’s plea for support to enable T Xusers E Simply, Ω (the program) is able to read scripts from impoverished countries to attend, by making that are encoded in Unicode (or in some other code the first disbursement from the UKTUG’s newly- that is readily transformable to Unicode), and then established Cathy Booth fund. The meeting was to process them in the same way that T Xdoes. well attended, with representatives from both East- E Parallel work has defined formats for fonts and other ern and Western Europe (including me; I also at- necessary files to deal with the demands arising from tended with UKTUG money), and one representa- AFONT Unicode input, and upgraded versions of MET , tive from Australia (though he is presently resident the virtual font utilities, and so on, have been writ- in Europe, too). ten. Ω itself is based on the normal Web2C distribu- The speakers at the meeting were Michel Gooss- tion that is at the base of most modern Unix imple- ens (the president of GUTenberg1), as host for GUT- mentations, and of at least one of the PC versions enberg and as an expert on background to, and the that is freely available. use of Unicode), and Yannis Haralambous and John Plaice, Ω’s two developers. Why Unicode? The meeting can be accounted a success; all that attended enjoyed themselves, and also learnt a There are something between 3000 and 6000 lan- lot. guages in use in the world, for which a writing sys- This paper is a minor revision of an article I tem exists. (The set of languages is shrinking all the wrote for the UKTUG magazine Baskerville,vol- time as the deadening effect of cultural intrusion, ume 5, number 3, and outlines some of my views of primarily through the electronic media, overwhelms (and support for) the choices that Haralambous and the desire to support existing cultures to the extent Plaice have made. I will consider the arguments for of teaching their language to the young.) The distri- using Unicode as a foundation for future text pro- bution of languages is by no means even throughout cessing, in particular (of course) TEX-related pro- the globe (Michel showed us a map), and there are cessing. many that have not been and will presumably now never be formally recorded. What is Ω? When we come to writing systems, we find al- most every variation imaginable in use somewhere Ω (Haralambous and Plaice, 1994) is an extension in the world. The Latin-like system (written left of T X and related programs that has been designed E to right with modest numbers of diacritics simply 1 And now (at the time of writing) president-elect of TUG arranged) has very wide penetration, not least be- itself. cause so many languages were first written down by TUGboat, Volume 16 (1995), No. 3 — Proceedings of the 1995 Annual Meeting 325 Robin Fairbairns Western European missionaries or other explorers. some of the characters that ASCII does specify are Languages such as Vietnamese are classified as ‘com- left “for national variation” in ISO 646; ASCII itself plex Latin-like’, with ≥ 2 diacritics per character; an then became the USA national variation of ISO 646. artificial example of the same effect is IPA (the In- An example of national variation is defined for the ternational Phonetic Alphabet) which has sub- and UK, which specifies that the code point that holds super-scripts and joining marks. Languages such as ‘#’inASCII should hold a pounds sign ($). There Hebrew and Arabic are written right to left, and con- are versions for various Nordic languages that in- stitute another class. Then there are the multiple- clude characters such as æ or ˚a in place of braces, a ligature writing systems typified by the Indic lan- version for French with acute, grave and circumflex- guages such as Devanagari (of which we had a fasci- accented letters, one for German that offers letters nating exposition at the 1993 UKTUG Easter meet- with umlauts and ‘sharp s’ (ß). ing on ‘non-American’ languages, from Dominik Wu- There were various attempts at mechanisms to jastyk), and finally the syllabic scripts (such as Ko- assign different character sets for use by those who rean Hangul and Japanese Hiragana and Katakana), need to use characters from several different sets (for and the ideographic scripts (Chinese and Japanese example, someone writing a Swedish-English Dictio- Kanji). nary); an example is ISO 2022, which defines escape Encodings are needed for computer operations sequences for such switches. These efforts proved on language of any sort. There are differences be- impractical (at least they seemed so to me), and 8- tween the coded representation and the written (or bit developments of ISO 646 arose, with the ability printed) representation. Everyone who’s read about (comfortably) to express more than one language. TEX at all will know about ligatures (the CM fonts, Thus were born the ISO 8859 character sets. and most PostScript fonts, implement ligatures so The commonest of these (at least in the ken of most that, for example, ‘fl’ typed appears as ‘fl’ printed). English speakers) is ISO Latin-1 (ISO 8859-1, that More significantly, almost all adults in Western cul- is, part one of the multi-part standard), which was tures write ‘joined-up’, which is in itself application designed for use by Western Europeans. As well of a form of ligature. All these ligatures are for pre- as the ‘basic ASCII set’ in the first 128 characters, sentation, not for information, and so it is unrea- it has diphthongs and vowels appropriate to most sonable for them to be represented in a character Western European languages. Oddly, it omits the set. Other ligatures, however, form real characters œ dipthong that French uses, and (perhaps less sur- in some languages (examples are æ in Danish and prisingly2) it omits some of the accent forms used by Norwegian, and œ in French). Welsh. ISO 8859 didn’t stop with part 1, though; Each encoding represents a ‘character set’ that there are variants that accommodate Cyrillic (for is to be used by the computer; the history of how Russian, Serbian, and several other languages of the these character sets have developed is long and sorry old Soviet Union), Arabic, Hebrew, and so on. indeed. In the ‘dark ages’ (in fact, as recently as the This is all well and good, but it doesn’t an- early 1960s, when I started computing), every make swer the needs of a writer preparing multilingual of computer system had its own character code, many documents, except in the case that the multiple lan- of them based on the 5-bit teleprinter codes used in guages are accommodated in the same part of ISO telex printers. Eventually, the rather more sophisti- 8859: it will happen some of the time, but most ‘in- cated teletypes appeared, which used seven bits of teresting’ combinations will require switches of char- an eight-bit code; this 7-bit codification was stan- acter set whenever the language changes. dardised as ASCII (the American Standard Code for So ISO (by this time, jointly with IEC) started Information Interchange), which was (in the area of development of an all-encompassing character set, to application it was designed for) an excellent code. It be numbered ISO/IEC 10646 (the difference of 10 000 had all the properties needed for many of the signif- is no accident). ISO/IEC 10646 was to accommodate icant development of computers in the 1960s, but it every possible language in the world by the simple had one serious flaw: it was not able to encode dia- expedient of allowing 32-bit characters. Of course, critics, which are used in almost every language (but no-one can comprehend a 32-bit character set, and which your all-American information interchanger so the set was to be structured, as a hypercube of would seldom have a need for). different repertoires; the (0, 0, 0, ∗) repertoire would To regularise the resulting mess, ISO adopted the ASCII standard as the basis for an international 7-bit character set, ISO 646. ISO 646 is identical to 2 Given that Wales would have been represented by the ASCII in the code points that it specifies; however, BSI in the standardisation process. 326 TUGboat, Volume 16 (1995), No. 3 — Proceedings of the 1995 Annual Meeting Omega — why Bother with Unicode? be the same ISO-Latin 1, but all the other sets could world’’, which would be transparently converted be accomodated, too. to “hello world”. Thus, the two grave accents and Independently, Apple and Microsoft got together the two single quotes constitute ‘programming’.
Recommended publications
  • Unicode and Code Page Support
    Natural for Mainframes Unicode and Code Page Support Version 4.2.6 for Mainframes October 2009 This document applies to Natural Version 4.2.6 for Mainframes and to all subsequent releases. Specifications contained herein are subject to change and these changes will be reported in subsequent release notes or new editions. Copyright © Software AG 1979-2009. All rights reserved. The name Software AG, webMethods and all Software AG product names are either trademarks or registered trademarks of Software AG and/or Software AG USA, Inc. Other company and product names mentioned herein may be trademarks of their respective owners. Table of Contents 1 Unicode and Code Page Support .................................................................................... 1 2 Introduction ..................................................................................................................... 3 About Code Pages and Unicode ................................................................................ 4 About Unicode and Code Page Support in Natural .................................................. 5 ICU on Mainframe Platforms ..................................................................................... 6 3 Unicode and Code Page Support in the Natural Programming Language .................... 7 Natural Data Format U for Unicode-Based Data ....................................................... 8 Statements .................................................................................................................. 9 Logical
    [Show full text]
  • DE-Tex-FAQ (Vers. 72
    Fragen und Antworten (FAQ) über das Textsatzsystem TEX und DANTE, Deutschsprachige Anwendervereinigung TEX e.V. Bernd Raichle, Rolf Niepraschk und Thomas Hafner Version 72 vom September 2003 Dieser Text enthält häufig gestellte Fragen und passende Antworten zum Textsatzsy- stem TEX und zu DANTE e.V. Er kann über beliebige Medien frei verteilt werden, solange er unverändert bleibt (in- klusive dieses Hinweises). Die Autoren bitten bei Verteilung über gedruckte Medien, über Datenträger wie CD-ROM u. ä. um Zusendung von mindestens drei Belegexem- plaren. Anregungen, Ergänzungen, Kommentare und Bemerkungen zur FAQ senden Sie bit- te per E-Mail an [email protected] 1 Inhalt Inhalt 1 Allgemeines 5 1.1 Über diese FAQ . 5 1.2 CTAN, das ‚Comprehensive TEX Archive Network‘ . 8 1.3 Newsgroups und Diskussionslisten . 10 2 Anwendervereinigungen, Tagungen, Literatur 17 2.1 DANTE e.V. 17 2.2 Anwendervereinigungen . 19 2.3 Tagungen »geändert« .................................... 21 2.4 Literatur »geändert« .................................... 22 3 Textsatzsystem TEX – Übersicht 32 3.1 Grundlegendes . 32 3.2 Welche TEX-Formate gibt es? Was ist LATEX? . 38 3.3 Welche TEX-Weiterentwicklungen gibt es? . 41 4 Textsatzsystem TEX – Bezugsquellen 45 4.1 Wie bekomme ich ein TEX-System? . 45 4.2 TEX-Implementierungen »geändert« ........................... 48 4.3 Editoren, Frontend-/GUI-Programme »geändert« .................... 54 5 TEX, LATEX, Makros etc. (I) 62 5.1 LATEX – Grundlegendes . 62 5.2 LATEX – Probleme beim Umstieg von LATEX 2.09 . 67 5.3 (Silben-)Trennung, Absatz-, Seitenumbruch . 68 5.4 Seitenlayout, Layout allgemein, Kopf- und Fußzeilen »geändert« . 72 6 TEX, LATEX, Makros etc. (II) 79 6.1 Abbildungen und Tafeln .
    [Show full text]
  • Miktex Manual Revision 2.0 (Miktex 2.0) December 2000
    MiKTEX Manual Revision 2.0 (MiKTEX 2.0) December 2000 Christian Schenk <[email protected]> Copyright c 2000 Christian Schenk Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the con- ditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another lan- guage, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Free Software Foundation. Chapter 1: What is MiKTEX? 1 1 What is MiKTEX? 1.1 MiKTEX Features MiKTEX is a TEX distribution for Windows (95/98/NT/2000). Its main features include: • Native Windows implementation with support for long file names. • On-the-fly generation of missing fonts. • TDS (TEX directory structure) compliant. • Open Source. • Advanced TEX compiler features: -TEX can insert source file information (aka source specials) into the DVI file. This feature improves Editor/Previewer interaction. -TEX is able to read compressed (gzipped) input files. - The input encoding can be changed via TCX tables. • Previewer features: - Supports graphics (PostScript, BMP, WMF, TPIC, . .) - Supports colored text (through color specials) - Supports PostScript fonts - Supports TrueType fonts - Understands HyperTEX(html:) specials - Understands source (src:) specials - Customizable magnifying glasses • MiKTEX is network friendly: - integrates into a heterogeneous TEX environment - supports UNC file names - supports multiple TEXMF directory trees - uses a file name database for efficient file access - Setup Wizard can be run unattended The MiKTEX distribution consists of the following components: • TEX: The traditional TEX compiler.
    [Show full text]
  • Introduction to Unicode
    Introduction to Unicode Presented by developerWorks, your source for great tutorials ibm.com/developerWorks Table of Contents If you're viewing this document online, you can click any of the topics below to link directly to that section. 1. Tutorial basics 2 2. Introduction 3 3. Unicode-based multilingual form development 5 4. Going interactive: Adding the Perl CGI script 10 5. Conclusions and resources 13 6. Feedback 15 Introduction to Unicode Page 1 Presented by developerWorks, your source for great tutorials ibm.com/developerWorks Section 1. Tutorial basics Is this tutorial right for you? This tutorial is for anyone who wants to understand the basics of Unicode-based multilingual Web page development. It is an introduction to how multilingual characters, the Unicode encoding standard, XML, and Perl CGI scripts can be integrated to produce a truly multilingual Web page. The tutorial explains the concepts behind this process and lays the groundwork for future development. Navigation Navigating through the tutorial is easy: * Use the Next and Previous buttons to move forward and backward. * Use the Menu button to return to the tutorial menu. * If you'd like to tell us what you think, use the Feedback button. * If you need help with the tutorial, use the Help button. What is required: Hardware and software needed No additional hardware is needed for this tutorial; however, in order to view some of the files online, Unicode fonts must be loaded. See Resources for information on Unicode fonts if you don't already have one loaded. (Note: Many Unicode fonts are still incomplete or in development, so one particular font may not contain every character; however, for the purposes of this tutorial, the Unicode font downloaded by the user will ideally contain all or most of the language characters used in our examples.
    [Show full text]
  • The TEX Live Guide TEX Live 2012
    The TEX Live Guide TEX Live 2012 Karl Berry, editor http://tug.org/texlive/ June 2012 Contents 1 Introduction 2 1.1 TEX Live and the TEX Collection...............................2 1.2 Operating system support...................................3 1.3 Basic installation of TEX Live.................................3 1.4 Security considerations.....................................3 1.5 Getting help...........................................3 2 Overview of TEX Live4 2.1 The TEX Collection: TEX Live, proTEXt, MacTEX.....................4 2.2 Top level TEX Live directories.................................4 2.3 Overview of the predefined texmf trees............................5 2.4 Extensions to TEX.......................................6 2.5 Other notable programs in TEX Live.............................6 2.6 Fonts in TEX Live.......................................7 3 Installation 7 3.1 Starting the installer......................................7 3.1.1 Unix...........................................7 3.1.2 MacOSX........................................8 3.1.3 Windows........................................8 3.1.4 Cygwin.........................................9 3.1.5 The text installer....................................9 3.1.6 The expert graphical installer.............................9 3.1.7 The simple wizard installer.............................. 10 3.2 Running the installer...................................... 10 3.2.1 Binary systems menu (Unix only).......................... 10 3.2.2 Selecting what is to be installed...........................
    [Show full text]
  • The Textgreek Package∗
    The textgreek package∗ Leonard Michlmayr leonard.michlmayr at gmail.com 2011/10/09 Abstract The LATEX package textgreek provides NFSS text symbols for Greek letters. This way the author can use Greek letters in text without changing to math mode. The usual font selection commands|e.g. \textbf|apply to these Greek letters as to usual text and the font is upright in an upright environment. Further, hyperref can use these symbols in PDF-strings such as PDF-bookmarks. Contents 1 Introduction2 2 Usage2 2.1 Package Options............................2 2.2 Advanced commands..........................3 3 Examples3 3.1 Use \b-decay" in a heading......................3 4 Compatibility4 5 Limitations4 6 Copyright4 7 Implementation4 7.1 Package Options............................4 7.2 Font selection..............................5 7.3 Greek letter definitions.........................7 7.3.1 Utility macro..........................7 7.3.2 List of Greek letters......................7 7.3.3 Variants.............................9 8 Change History 10 9 Index 11 ∗This document corresponds to textgreek v0.7, dated 2011/10/09. 1 1 Introduction The usual way to print Greek letters in LATEX uses the math mode. E.g. $\beta$ produces β. With the default math fonts, the Greek letters produced this way are italic. Generally, this is ok, since they represent variables and variables are typeset italic with the default math font settings. In some circumstances, however, Greek letters don't represent variables and should be typeset upright. E.g. in \b-decay" or \mA". The package upgreek provides commands to set upright Greek letters in math mode, but it does not provide text symbols.
    [Show full text]
  • Makor: Typesetting Hebrew with Omega
    Makor: Typesetting Hebrew with Omega Alan Hoenig City University of New York [email protected] Abstract It’s relatively easy to typeset a language using a keyboard customized for that language. A more interesting problem arises when trying to set a ‘foreign’ lan- guage (say, Russian, Hebrew, or Arabic) using a native keyboard (American, for example). This leads to the problem of transliteration: how to represent some language, call it A, using the conventions of a different language B? This paper concerns the author’s attempted solution to one such problem: to create quality Hebrew typesetting using the conventions of an English language keyboard. Apart from the different alphabet, which invokes a different set of sounds than does its English counterpart, Hebrew can involve as many as two distinct sets of diacritics, uses special glyph forms (sometimes) at word endings, and is, of course, typeset from right to left. The solution involves using the Omega extension of TEX. Makor, the name for this Hebrew typesetting system, consists of a user man- ual, fonts from seven distinct font families, and a special set of macros and con- ventions. Many examples of its use will be shown. All this software is publicly and freely available. 1 Introduction typesetter. This is not really a problem at all. Early on, T X was extended to handle RTL. These early We English speakers and readers are lucky — TEX or E its equivalent would have been quite different, and versions, TEX--XET and so on, early provided this arguably more difficult to create, had Don Knuth capability.
    [Show full text]
  • The Mathspec Package Font Selection for Mathematics with Xǝlatex Version 0.2B
    The mathspec package Font selection for mathematics with XƎLaTEX version 0.2b Andrew Gilbert Moschou* [email protected] thursday, 22 december 2016 table of contents 1 preamble 1 4.5 Shorthands ......... 6 4.6 A further example ..... 7 2 introduction 2 5 greek symbols 7 3 implementation 2 6 glyph bounds 9 4 setting fonts 3 7 compatability 11 4.1 Letters and Digits ..... 3 4.2 Symbols ........... 4 8 the package 12 4.3 Examples .......... 4 4.4 Declaring alphabets .... 5 9 license 33 1 preamble This document describes the mathspec package, a package that provides an interface to select ordinary text fonts for typesetting mathematics with XƎLaTEX. It relies on fontspec to work and familiarity with fontspec is advised. I thank Will Robertson for his useful advice and suggestions! The package is developmental and later versions might to be incompatible with this version. This version is incompatible with earlier versions. The package requires at least version 0.9995 of XƎTEX. *v0.2b update by Will Robertson ([email protected]). 1 Should you be using this package? If you are using another LaTEX package for some mathematics font, then you should not (unless you know what you are doing). If you want to use Asana Math or Cambria Math (or the final release version of the stix fonts) then you should be using unicode-math. Some paragraphs in this document are marked advanced. Such paragraphs may be safely ignored by basic users. 2 introduction Since Jonathan Kew released XƎTEX, an extension to TEX that permits the inclusion of system wide Unicode fonts and modern font technologies in TEX documents, users have been able to easily typeset documents using readily available fonts such as Hoefler Text and Times New Roman (This document is typeset using Sabon lt Std).
    [Show full text]
  • Section 18.1, Han
    The Unicode® Standard Version 13.0 – Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trade- mark claim, the designations have been printed with initial capital letters or in all capitals. Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries. The authors and publisher have taken care in the preparation of this specification, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. © 2020 Unicode, Inc. All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction. For information regarding permissions, inquire at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please see http://www.unicode.org/copyright.html. The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version 13.0. Includes index. ISBN 978-1-936213-26-9 (http://www.unicode.org/versions/Unicode13.0.0/) 1.
    [Show full text]
  • Math Symbol Tables
    APPENDIX A Math symbol tables A.1 Hebrew and Greek letters Hebrew letters Type Typeset \aleph ℵ \beth ℶ \daleth ℸ \gimel ℷ © Springer International Publishing AG 2016 481 G. Grätzer, More Math Into LATEX, DOI 10.1007/978-3-319-23796-1 482 Appendix A Math symbol tables Greek letters Lowercase Type Typeset Type Typeset Type Typeset \alpha \iota \sigma \beta \kappa \tau \gamma \lambda \upsilon \delta \mu \phi \epsilon \nu \chi \zeta \xi \psi \eta \pi \omega \theta \rho \varepsilon \varpi \varsigma \vartheta \varrho \varphi \digamma ϝ \varkappa Uppercase Type Typeset Type Typeset Type Typeset \Gamma Γ \Xi Ξ \Phi Φ \Delta Δ \Pi Π \Psi Ψ \Theta Θ \Sigma Σ \Omega Ω \Lambda Λ \Upsilon Υ \varGamma \varXi \varPhi \varDelta \varPi \varPsi \varTheta \varSigma \varOmega \varLambda \varUpsilon A.2 Binary relations 483 A.2 Binary relations Type Typeset Type Typeset < < > > = = : ∶ \in ∈ \ni or \owns ∋ \leq or \le ≤ \geq or \ge ≥ \ll ≪ \gg ≫ \prec ≺ \succ ≻ \preceq ⪯ \succeq ⪰ \sim ∼ \approx ≈ \simeq ≃ \cong ≅ \equiv ≡ \doteq ≐ \subset ⊂ \supset ⊃ \subseteq ⊆ \supseteq ⊇ \sqsubseteq ⊑ \sqsupseteq ⊒ \smile ⌣ \frown ⌢ \perp ⟂ \models ⊧ \mid ∣ \parallel ∥ \vdash ⊢ \dashv ⊣ \propto ∝ \asymp ≍ \bowtie ⋈ \sqsubset ⊏ \sqsupset ⊐ \Join ⨝ Note the \colon command used in ∶ → 2, typed as f \colon x \to x^2 484 Appendix A Math symbol tables More binary relations Type Typeset Type Typeset \leqq ≦ \geqq ≧ \leqslant ⩽ \geqslant ⩾ \eqslantless ⪕ \eqslantgtr ⪖ \lesssim ≲ \gtrsim ≳ \lessapprox ⪅ \gtrapprox ⪆ \approxeq ≊ \lessdot
    [Show full text]
  • Unicode Cree Syllabics for Windows and Macintosh
    Unicode Cree Syllabics for Windows and Macintosh 37th Algonquian Conference, Ottawa 2005 Bill Jancewicz SIL International and Naskapi Development Corporation ABSTRACT Submitted as an update to a presentation made at the 34th Algonquian Conference (Kingston). The ongoing development of the operating systems has included increased support for cross-platform use of Unicode syllabic script. Key improvements that were included in Macintosh's OS X.3 (Panther) operating system now allow direct keyboarding of Unicode characters by means of a user-defined input method. A summary and comparison of the available tools for handling Unicode on both Windows and Macintosh will be discussed. INTRODUCTION Since 1988 the author has been working in the Naskapi language at Kawawachikamach with the primary purpose of linguistic analysis and Bible translation, sponsored by Wycliffe Bible Translators and SIL International. Related work includes mother tongue translator training, Naskapi literature and curriculum development. Along with the language work the author also developed methods of production for Naskapi language materials in syllabic script by means of the computer. With the advent of high quality publishing capabilities in newer computers, procedures were updated to keep pace with the improving technology. With resources available from SIL International computer services department, a very satisfactory system of keyboarding syllabic texts in Naskapi was developed. At the urging of colleagues working in related languages, the system was expanded to include the wider inventory of standard Eastern and Western Cree syllabics. While this has pushed the limit of what is possible with the current technology, Unicode makes this practical. Note that the system originally developed for keyboarding Naskapi syllabics is similar but not identical to the Cree system, because of the unique local orthography in use at Kawawachikamach.
    [Show full text]
  • I18n, M17n, Unicode, and All That
    I18N, M17N, UNICODE, AND ALL THAT Tim Bray General-Purpose Web Geek Sun Microsystems /[a-zA-Z]+/ This is probably a bug. The Problems We Have To Solve Identifying characters Storage Byte⇔character mapping Transfer Good string API Published in 1996; it has 74 major sections, most of which discuss whole families of writing systems. www.w3.org/TR/charmod Identifying Characters 1,1 17 “Planes”14,1 each with 64k code points: U+0000 – U+10FFFF BMP 12 Unicode Code Points 0 0000 1 0000 Basic Multilingual Plane 2 0000 Dead Languages & Math 3 0000 Han Characters 4 0000 5 0000 Non-BMP 6 0000 7 0000 99,024 characters defined in Unicode 5.0 “Astral” Planes 8 0000 9 0000 A 0000 B 0000 C 0000 D 0000 E 0000 Language F 0000 10 0000 Private Use T ags The Basic Multilingual Plane (BMP) U+0000 – U+FFFF 0000 Alphabets 1000 2000 3000 Punctuation 4000 Asian-language Support 5000 Han Characters 6000 7000 8000 9000 A000 Y B000 i Hangul C000 D000 E000 (*: Legacy-Compatibility junk)Surrogates F000 Private Use * Unicode Character Database 00C8;LATIN CAPITAL LETTER E WITH GRAVE;Lu;0;L;0045 0300;;;;N;LATIN CAPITAL LETTER E GRAVE;;;00E8; “Character #200 is LATIN CAPITAL LETTER E WITH GRAVE, a lower-case letter, combining class 0, renders L-to-R, can be composed by U+0045/U+0300, had a differentÈ name in Unicode 1, isn’t a number, lowercase is U+00E8.” www.unicode.org/Public/Unidata $ U+0024 DOLLAR SIGN Ž U+017D LATIN CAPITAL LETTER Z WITH CARON ® U+00AE REGISTERED SIGN ή U+03AE GREEK SMALL LETTER ETA WITH TONOS Ж U+0416 CYRILLIC CAPITAL LETTER ZHE א U+05D0 HEBREW LETTER
    [Show full text]