A Framework for Multilingual Information Processing by Steven Edward Atkin Bachelor of Science Physics State University of New Y
Total Page:16
File Type:pdf, Size:1020Kb
A Framework for Multilingual Information Processing by Steven Edward Atkin Bachelor of Science Physics State University of New York, Stony Brook 1989 Master of Science in Computer Science Florida Institute of Technology 1994 A dissertation submitted to the College of Engineering at Florida Institute of Technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Melbourne, Florida December, 2001 We the undersigned committee hereby recommend that the attached document be accepted as fulfilling in part the requirements for the degree of Doctor of Philosophy of Computer Science “A Framework for Multilingual Information Processing,” a dissertation by Steven Edward Atkin __________________________________ Ryan Stansifer, Ph.D. Associate Professor, Computer Science Dissertation Advisor __________________________________ Phil Bernhard, Ph.D. Associate Professor, Computer Science __________________________________ James Whittaker, Ph.D. Associate Professor, Computer Science __________________________________ Gary Howell, Ph.D. Professor, Mathematics __________________________________ William Shoaff, Ph.D. Associate Professor and Head, Computer Science Abstract Title: A Framework for Multilingual Information Processing Author: Steven Edward Atkin Major Advisor: Ryan Stansifer, Ph.D. Recent and (continuing) rapid increases in computing power now enable more of humankind’s written communication to be represented as digital data. The most recent and obvious changes in multilingual information processing have been the introduction of larger character sets encompassing more writing systems. Yet the very richness of larger collections of characters has made the interpretation and pro- cessing of text more difficult. The many competing motivations (satisfying the needs of linguists, computer scientists, and typographers) for standardizing charac- ter sets threaten the purpose of information processing: accurate and facile manipu- lation of data. Existing character sets are constructed without a consistent strategy or architecture. Complex algorithms and reports are necessary now to understand raw streams of characters representing multilingual text. We assert that information processing is an architectural problem and not just a character set problem. We analyze several multilingual information process- ing algorithms (e.g., bidirectional reordering and character normalization) and we conclude that they are more dangerous than beneficial. The countless number of unexpected interactions suggest a lack of a coherent architecture. We introduce abstractions, novel mechanisms, and take the first steps towards organizing them into a new architecture for multilingual information processing. We propose a multi- layered architecture which we call Metacode where character sets appear in lower layers and protocols and algorithms in higher layers. We recast bidirectional reor- dering and character normalization in the Metacode framework. iii Table of Contents List of Figures ................................................. ix List of Tables .................................................. xi Acknowledgement ............................................. xiii Dedication .................................................... xiv Chapter 1 Introduction...........................................1 1.1 ArchitectureOverview....................................................3 1.2 ProblemStatement.......................................................5 1.3 Outline of Dissertation. 5 Chapter 2 Software Globalization ..................................7 2.1 Overview..............................................................7 2.2 Translation.............................................................9 2.3 International User Interfaces . 10 2.3.1 Metaphors..........................................................10 2.3.2 Geometry..........................................................11 2.3.3 Color..............................................................11 2.3.4 Icons..............................................................12 2.3.5 Sound.............................................................12 2.4 CulturalandLinguisticFormatting.........................................13 2.4.1 NumericFormatting..................................................13 2.4.2 Date and Time Formatting . 14 2.4.3 CalendarSystems....................................................14 2.4.4 Measurement.......................................................15 2.4.5 Collating...........................................................15 2.4.6 CharacterClassification...............................................16 2.4.7 Locales............................................................17 2.5 KeyboardInput........................................................17 2.6 Fonts.................................................................18 2.7 CharacterCodingSystems................................................20 Chapter 3 Character Coding Systems..............................22 3.1 Terms................................................................22 3.2 CharacterEncodingSchemes..............................................24 3.3 EuropeanEncodings....................................................24 iv 3.3.1 ISO7-bitCharacterEncodings.........................................27 3.3.2 ISO8-bitCharacterEncodings.........................................29 3.3.3 VendorSpecificCharacterEncodings....................................33 3.4 JapaneseEncodings.....................................................40 3.4.1 PersonalComputerEncodingMethod....................................43 3.4.2 ExtendedUnixCodeEncodingMethod..................................45 3.4.3 HostEncodingMethod...............................................46 3.4.4 InternetExchangeMethod.............................................47 3.4.5 VendorSpecificEncodings............................................49 3.5 ChineseEncodings......................................................50 3.5.1 PeoplesRepublicofChina.............................................51 3.5.2 RepublicofChina...................................................53 3.5.3 HongKongSpecialAdministrativeRegion................................54 3.6 KoreanEncodings......................................................55 3.6.1 SouthKorea........................................................55 3.6.2 NorthKorea........................................................55 3.7 VietnameseEncodings...................................................56 3.8 MultilingualEncodings..................................................57 3.8.1 Why Are Multilingual Encodings Necessary?. 57 3.8.2 UnicodeandISO-10646...............................................59 3.8.2.1 HistoryofUnicode................................................60 3.8.2.2 GoalsofUnicode.................................................60 3.8.2.3 Unicode’sPrinciples..............................................61 3.8.2.4 DifferencesBetweenUnicodeandISO-10646..........................62 3.8.2.5 Unicode’sOrganization............................................63 3.8.2.6 UnicodeTransmissionForms........................................67 3.8.3 CriticismofUnicode .................................................68 3.8.3.1 ProblemsWithCharacter/GlyphSeparation............................69 3.8.3.2 ProblemsWithHanUnification......................................69 3.8.3.3 ISO-8859-1Compatibility..........................................70 3.8.3.4 Efficiency.......................................................71 3.8.4 Mudawwar’s Multicode . 71 3.8.4.1 Character Sets in Multicode . 71 3.8.4.2 Character Set Switching in Multicode . 72 3.8.4.3 FocusonWrittenLanguages........................................73 3.8.4.4 ASCII/UnicodeCompatibility.......................................74 3.8.4.5 Glyph Association in Multicode . 74 3.8.5 TRON.............................................................74 3.8.5.1 TRONSingleByteCharacterCode...................................75 3.8.5.2 TRONDoubleByteCharacterCode..................................76 3.8.5.3 JapaneseTRONCode.............................................76 3.8.6 EPICIST...........................................................77 v 3.8.6.1 EPICISTCodePoints..............................................78 3.8.6.2 EPICISTCharacterCodeSpace......................................78 3.8.6.3 Compatibility With Unicode . 78 3.8.6.4 EpicVirtualMachine..............................................79 3.8.6.5 UsingtheEpicVirtualMachineforAncientSymbols....................79 3.8.7 Current Direction of Multilingual Encodings . 80 Chapter 4 Bidirectional Text .....................................81 4.1 NonLatinScripts.......................................................82 4.1.1 ArabicandHebrewScripts............................................83 4.1.1.1 Cursive.........................................................83 4.1.1.2 Position.........................................................84 4.1.1.3 Ligatures........................................................84 4.1.1.4 Mirroring.......................................................85 4.1.2 MongolianScript....................................................85 4.2 BidirectionalLayout....................................................86 4.2.1 LogicalandDisplayOrder.............................................86 4.2.2 ContextualProblems.................................................87 4.2.3 DomainNames......................................................88