A Framework for Multilingual Information Processing by Steven Edward Atkin Bachelor of Science Physics State University of New Y

A Framework for Multilingual Information Processing by Steven Edward Atkin Bachelor of Science Physics State University of New York, Stony Brook 1989 Master of Science in Computer Science Florida Institute of Technology 1994 A dissertation submitted to the College of Engineering at Florida Institute of Technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Melbourne, Florida December, 2001 We the undersigned committee hereby recommend that the attached document be accepted as fulfilling in part the requirements for the degree of Doctor of Philosophy of Computer Science “A Framework for Multilingual Information Processing,” a dissertation by Steven Edward Atkin __________________________________ Ryan Stansifer, Ph.D. Associate Professor, Computer Science Dissertation Advisor __________________________________ Phil Bernhard, Ph.D. Associate Professor, Computer Science __________________________________ James Whittaker, Ph.D. Associate Professor, Computer Science __________________________________ Gary Howell, Ph.D. Professor, Mathematics __________________________________ William Shoaff, Ph.D. Associate Professor and Head, Computer Science Abstract Title: A Framework for Multilingual Information Processing Author: Steven Edward Atkin Major Advisor: Ryan Stansifer, Ph.D. Recent and (continuing) rapid increases in computing power now enable more of humankind’s written communication to be represented as digital data. The most recent and obvious changes in multilingual information processing have been the introduction of larger character sets encompassing more writing systems. Yet the very richness of larger collections of characters has made the interpretation and processing of text more difficult. The many competing motivations (satisfying the needs of linguists, computer scientists, and typographers) for standardizing character sets threaten the purpose of information processing: accurate and facile manipu- lation of data. Existing character sets are constructed without a consistent strategy or architecture. Complex algorithms and reports are necessary now to understand raw streams of characters representing multilingual text. We assert that information processing is an architectural problem and not just a character set problem. We analyze several multilingual information processing algorithms (e.g., bidirectional reordering and character normalization) and we conclude that they are more dangerous than beneficial. The countless number of unexpected interactions suggest a lack of a coherent architecture. We introduce abstractions, novel mechanisms, and take the first steps towards organizing them into a new architecture for multilingual information processing. We propose a multi- layered architecture which we call Metacode where character sets appear in lower layers and protocols and algorithms in higher layers. We recast bidirectional reordering and character normalization in the Metacode framework. iii Table of Contents List of Figures ................................................. ix List of Tables .................................................. xi Acknowledgement ............................................. xiii Dedication .................................................... xiv Chapter 1 Introduction...........................................1 1.1 ArchitectureOverview....................................................3 1.2 ProblemStatement.......................................................5 1.3 Outline of Dissertation. 5 Chapter 2 Software Globalization ..................................7 2.1 Overview..............................................................7 2.2 Translation.............................................................9 2.3 International User Interfaces . 10 2.3.1 Metaphors..........................................................10 2.3.2 Geometry..........................................................11 2.3.3 Color..............................................................11 2.3.4 Icons..............................................................12 2.3.5 Sound.............................................................12 2.4 CulturalandLinguisticFormatting.........................................13 2.4.1 NumericFormatting..................................................13 2.4.2 Date and Time Formatting . 14 2.4.3 CalendarSystems....................................................14 2.4.4 Measurement.......................................................15 2.4.5 Collating...........................................................15 2.4.6 CharacterClassification...............................................16 2.4.7 Locales............................................................17 2.5 KeyboardInput........................................................17 2.6 Fonts.................................................................18 2.7 CharacterCodingSystems................................................20 Chapter 3 Character Coding Systems..............................22 3.1 Terms................................................................22 3.2 CharacterEncodingSchemes..............................................24 3.3 EuropeanEncodings....................................................24 iv 3.3.1 ISO7-bitCharacterEncodings.........................................27 3.3.2 ISO8-bitCharacterEncodings.........................................29 3.3.3 VendorSpecificCharacterEncodings....................................33 3.4 JapaneseEncodings.....................................................40 3.4.1 PersonalComputerEncodingMethod....................................43 3.4.2 ExtendedUnixCodeEncodingMethod..................................45 3.4.3 HostEncodingMethod...............................................46 3.4.4 InternetExchangeMethod.............................................47 3.4.5 VendorSpecificEncodings............................................49 3.5 ChineseEncodings......................................................50 3.5.1 PeoplesRepublicofChina.............................................51 3.5.2 RepublicofChina...................................................53 3.5.3 HongKongSpecialAdministrativeRegion................................54 3.6 KoreanEncodings......................................................55 3.6.1 SouthKorea........................................................55 3.6.2 NorthKorea........................................................55 3.7 VietnameseEncodings...................................................56 3.8 MultilingualEncodings..................................................57 3.8.1 Why Are Multilingual Encodings Necessary?. 57 3.8.2 UnicodeandISO-10646...............................................59 3.8.2.1 HistoryofUnicode................................................60 3.8.2.2 GoalsofUnicode.................................................60 3.8.2.3 Unicode’sPrinciples..............................................61 3.8.2.4 DifferencesBetweenUnicodeandISO-10646..........................62 3.8.2.5 Unicode’sOrganization............................................63 3.8.2.6 UnicodeTransmissionForms........................................67 3.8.3 CriticismofUnicode .................................................68 3.8.3.1 ProblemsWithCharacter/GlyphSeparation............................69 3.8.3.2 ProblemsWithHanUnification......................................69 3.8.3.3 ISO-8859-1Compatibility..........................................70 3.8.3.4 Efficiency.......................................................71 3.8.4 Mudawwar’s Multicode . 71 3.8.4.1 Character Sets in Multicode . 71 3.8.4.2 Character Set Switching in Multicode . 72 3.8.4.3 FocusonWrittenLanguages........................................73 3.8.4.4 ASCII/UnicodeCompatibility.......................................74 3.8.4.5 Glyph Association in Multicode . 74 3.8.5 TRON.............................................................74 3.8.5.1 TRONSingleByteCharacterCode...................................75 3.8.5.2 TRONDoubleByteCharacterCode..................................76 3.8.5.3 JapaneseTRONCode.............................................76 3.8.6 EPICIST...........................................................77 v 3.8.6.1 EPICISTCodePoints..............................................78 3.8.6.2 EPICISTCharacterCodeSpace......................................78 3.8.6.3 Compatibility With Unicode . 78 3.8.6.4 EpicVirtualMachine..............................................79 3.8.6.5 UsingtheEpicVirtualMachineforAncientSymbols....................79 3.8.7 Current Direction of Multilingual Encodings . 80 Chapter 4 Bidirectional Text .....................................81 4.1 NonLatinScripts.......................................................82 4.1.1 ArabicandHebrewScripts............................................83 4.1.1.1 Cursive.........................................................83 4.1.1.2 Position.........................................................84 4.1.1.3 Ligatures........................................................84 4.1.1.4 Mirroring.......................................................85 4.1.2 MongolianScript....................................................85 4.2 BidirectionalLayout....................................................86 4.2.1 LogicalandDisplayOrder.............................................86 4.2.2 ContextualProblems.................................................87 4.2.3 DomainNames......................................................88

A Framework for Multilingual Information Processing by Steven Edward Atkin Bachelor of Science Physics State University of New Y

Cumberland Tech Ref.Book

AMPERSAND an International Journal of General and Applied Linguistics

Tilde-Arrow-Out (~→O)

INTERSKILL MAINFRAME QUARTERLY December 2011

Proposal to Add U+2B95 Rightwards Black Arrow to Unicode Emoji

Dictation Presentation.Pptx

Ill[S ADJUSTMENTS

Relink Eo 3-11-97

Programmer's Manual SP2000 Series

IGP® / VGL Emulation Code V™ Graphics Language Programmer's Reference Manual Line Matrix Series Printers

Ascii, Baudot, and the Radio Amateur

JFP Reference Manual 5 : Standards, Environments, and Macros