Wide Characters

Total Page:16

File Type:pdf, Size:1020Kb

Wide Characters THE MAGAZINE OF USENIX & SAGE August 2002 volume 27 • number 4 inside: PROGRAMMING McCluskey: Wide Characters & The Advanced Computing Systems Association & The System Administrators Guild wide characters int main() by Glen { McCluskey printf("sizeof(wchar_t) = %u\n", sizeof(wchar_t)); Glen McCluskey is a } consultant with 20 years of experience When I run this program on my Linux system, the result is: and has focused on programming lan- sizeof(wchar_t) = 4 guages since 1988. He specializes in Java wchar_t is a signed or unsigned integral type big enough to and C++ perfor- mance, testing, and hold all the characters in the local extended character set, and is technical documen- a 32-bit long on my system. tation areas. [email protected] wide character constants are specified similarly to normal char- acter constants, with a preceding L before the constant: #include <stdio.h> We’ve been looking at some of the new features in #include <wchar.h> C99, the standards update to C. In this column we’ll int main() consider features added to the language and library in { support of wide characters, as typically used with for- wchar_t wc1 = L'a'; eign languages. printf("%lx\n", wc1); wchar_t wc2 = L'\377'; Character Sets printf("%lx\n", wc2); Several terms are used in the standard to describe C character sets. The first of these is “basic character set” and refers to a set wchar_t wc3 = L'\x12345678'; of single-byte characters that all C99 implementations must printf("%lx\n", wc3); } support. Roughly speaking, this character set is 7-bit ASCII without some of the control characters. It consists of printable When I run this program, the result is: characters like A–Z and 0–9, along with tab, form feed, and so 61 on. ff The basic character set is divided into source and execution 12345678 character sets, and these differ slightly. For example, the basic In the first two cases, the wide character is stored in the least execution character set is required to have a null character (\0), significant byte of the long, while in the last case, all four bytes used as a string terminator. of the long are used to represent a single wide character. The extended character set is a superset of the basic character This example illustrates a confusing point about wide charac- set, and adds additional locale-specific characters. It, too, is ters – the idea of multiple representations. For example, wc3 is divided into source and execution character sets. initialized with a wide character constant, a constant that A wide character is a character of type wchar_t, and is capable requires 13 bytes to express in the source program. The con- of representing any character in the current locale. In other stant itself is stored in four bytes during execution (in a 32-bit words, a wide character may be a character from either the basic long). And a little later on, we’ll see examples of what is called character set, such as the letter A, or a character from the “state-dependent encoding,”a mechanism used to encode wide extended character set. characters as a stream of bytes (1–6 bytes per wide character, on my system). This encoding is used for writing wide characters Wide Character Constants to a file. Let’s look at some actual examples of wide character usage. The The term “multibyte character” is defined to be a sequence of first demo program prints the size of wchar_t on your local sys- one or more bytes that represents a single character in the tem: extended source or execution environment. A character from #include <stdio.h> the extended character set can have several different representa- #include <wchar.h> tions. These representations may appear in source code, in the execution environment, or in data files. 12 Vol. 27, No. 4 ;login: Here’s another example, showing how wide character strings are #include <assert.h> specified: #include <stdio.h> #include <wchar.h> #include <assert.h> #include <wchar.h> int main() ROGRAMMING { P int main() // write a wide character string to a file { wchar_t* wstr1 = L"testing\x12345678"; FILE* fp = fopen("outfile", "w"); wchar_t wstr2[] = L"testing\x12345678"; assert(fp); fwprintf(fp, L"string is\377: %ls\n", L"TESTing\x1234"); assert(*wstr1 == 't'); fclose(fp); assert(*(wstr1 + 7) == 0x12345678); // read the characters of the string back from the assert(wstr2[0] == 't'); // file assert(wstr2[7] == 0x12345678); } fp = fopen("outfile", "r"); assert(fp); String Operations wint_t c; Many familiar operations are supported on wide character wchar_t buf[25]; strings. For example, here’s a demo that implements a function size_t len = 0; while ((c = getwc(fp)) != WEOF) to convert to lowercase: buf[len++] = c; #include <stdio.h> fclose(fp); #include <wchar.h> buf[len] = 0; #include <wctype.h> // check results wchar_t* tolower(wchar_t* str) if (wcscmp(buf, L"string is\377: TESTing\x1234\n") { == 0) wchar_t* start = str; printf("strings are equal\n"); // convert each wide character to lowercase else printf("strings are unequal\n"); for (; *str; str++) { } if (iswupper(*str)) *str = tolower(*str); Much of this code is identical to what you would use when } reading and writing regular strings of bytes. return start; wint_t is a type that is related to wchar_t in a way similar to the } relationship between int and char; it can hold all possible int main() wchar_t values, as well as one distinguished value that is not { part of the extended character set (WEOF). wchar_t* str = L"TESTing"; wchar_t buf[8]; The only other tricky thing in this example is stream orienta- tion, something that’s implicit in the code. A file stream can be wcscpy(buf, str); either byte or wide oriented. The orientation is determined by tolower(buf); the first operation on the stream, or explicitly via the fwide() call. Since the first write operation in the demo is fwprintf(), and printf("%ls\n", buf); the first read operation is getwc(), and these are wide character } functions, the streams are marked as having wide orientation. Note that the definition of an uppercase character may be locale Why does stream orientation matter? The reason is that an specific. encoding may be applied to wide characters written to a file. Wide Characters and I/O Suppose you are programming with wide characters, you need to do wide character I/O, and your wide characters are four Suppose that you have a wide character string and you’d like to bytes long when using the wchar_t representation. One way of write it to a file and then read it back. How can you do this? writing such characters to a file is to actually write four bytes Here’s one approach: for each character. August 2002 ;login: WIDE CHARACTERS 13 But what happens if, most of the time, the values of your wide len = 1 characters are within the range of 7-bit ASCII? In such a case, 61 three zero bytes will be written for each character. And the len = 2 resulting files will not be readable by tools that expect ASCII. c3 bf This problem exists today, for example, with tools that write 16- len = 6 fc 92 8d 85 99 b8 bit Unicode to a file. One solution to this problem is to encode characters such that 7-bit ASCII is represented as itself, that is, a The character a is encoded as itself, while the character \377 is single byte, while other character values are encoded using mul- encoded as two bytes 0xc3 and 0xbf. The constant tiple bytes. L’\x12345678’, internally represented as a four-byte long, is encoded using six bytes. But if an encoding is applied, then it no longer makes sense to mix byte and wide-file operations. This is especially true given Here’s another example of encoding and decoding: that an encoding may be state dependent, and dipping into a #include <stdio.h> byte stream in the middle of a multiple-byte encoding of a wide #include <stdlib.h> character has no meaning. #include <wchar.h> Encodings int main() Let’s look a little deeper into the encoding issue, with another { example. This demo converts wide character values into char buf[MB_CUR_MAX]; wchar_t wc1 = L'\x12345678'; sequences of encoded bytes: wchar_t wc2; #include <stdio.h> int n, nn; #include <stdlib.h> // convert a wide character to a multibyte #include <wchar.h> // character int main() n = wctomb(buf, wc1); { printf("%d\n", mblen(buf, n)); char buf[MB_CUR_MAX]; int n; // reverse the process // convert a single-byte character to a multibyte nn = mbtowc(&wc2, buf, n); // character // check result n = wctomb(buf, L'a'); if (wc1 == wc2 && n == nn) printf("len = %d\n", n); printf("equal\n"); for (int i = 0; i < n; i++) else printf("%hhx ", buf[i]); printf("unequal\n"); printf("\n"); } // convert another single-byte character The wctomb() function encodes a wide character into a stream n = wctomb(buf, L'\377'); of bytes, and mbtowc() reverses the process. printf("len = %d\n", n); for (int i = 0; i < n; i++) Restartable Functions printf("%hhx ", buf[i]); Consider the second part of the last example. The processing of printf("\n"); the first part of the example – a wide character encoded into a // convert a wide character buffer of one or more bytes – was reversed, by taking the buffer n = wctomb(buf, L'\x12345678'); and converting it back into a wide character. printf("len = %d\n", n); In a real-world example, things might not be quite as simple. for (int i = 0; i < n; i++) For instance, you might have an application where bytes are printf("%hhx ", buf[i]); coming in across a network one at a time, and several of the printf("\n"); bytes put together represent a wide character.
Recommended publications
  • C Strings and Pointers
    Software Design Lecture Notes Prof. Stewart Weiss C Strings and Pointers C Strings and Pointers Motivation The C++ string class makes it easy to create and manipulate string data, and is a good thing to learn when rst starting to program in C++ because it allows you to work with string data without understanding much about why it works or what goes on behind the scenes. You can declare and initialize strings, read data into them, append to them, get their size, and do other kinds of useful things with them. However, it is at least as important to know how to work with another type of string, the C string. The C string has its detractors, some of whom have well-founded criticism of it. But much of the negative image of the maligned C string comes from its abuse by lazy programmers and hackers. Because C strings are found in so much legacy code, you cannot call yourself a C++ programmer unless you understand them. Even more important, C++'s input/output library is harder to use when it comes to input validation, whereas the C input/output library, which works entirely with C strings, is easy to use, robust, and powerful. In addition, the C++ main() function has, in addition to the prototype int main() the more important prototype int main ( int argc, char* argv[] ) and this latter form is in fact, a prototype whose second argument is an array of C strings. If you ever want to write a program that obtains the command line arguments entered by its user, you need to know how to use C strings.
    [Show full text]
  • Download the Specification
    Internationalizing and Localizing Applications in Oracle Solaris Part No: E61053 November 2020 Internationalizing and Localizing Applications in Oracle Solaris Part No: E61053 Copyright © 2014, 2020, Oracle and/or its affiliates. License Restrictions Warranty/Consequential Damages Disclaimer This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. Warranty Disclaimer The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. Restricted Rights Notice If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any programs embedded, installed or activated on delivered hardware, and modifications of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed by U.S. Government end users are "commercial
    [Show full text]
  • Chapter 2 Working with Characters and Strings
    C02624245.fm Page 11 Friday, October 19, 2007 7:46 PM Chapter 2 Working with Characters and Strings In this chapter: Character Encodings. 12 ANSI and Unicode Character and String Data Types. 13 Unicode and ANSI Functions in Windows . 15 Unicode and ANSI Functions in the C Run-Time Library . 17 Secure String Functions in the C Run-Time Library. 18 Why You Should Use Unicode. 26 How We Recommend Working with Characters and Strings. 26 Translating Strings Between Unicode and ANSI. 27 With Microsoft Windows becoming more and more popular around the world, it is increasingly important that we, as developers, target the various international markets. It was once common for U.S. versions of software to ship as much as six months prior to the shipping of international ver- sions. But increasing international support for the operating system is making it easier to produce applications for international markets and therefore is reducing the time lag between distribution of the U.S. and international versions of our software. Windows has always offered support to help developers localize their applications. An application can get country-specific information from various functions and can examine Control Panel set- tings to determine the user’s preferences. Windows even supports different fonts for our applica- tions. Last but not least, in Windows Vista, Unicode 5.0 is now supported. (Read “Extend The Global Reach Of Your Applications With Unicode 5.0” at http://msdn.microsoft.com/msdnmag/ issues/07/01/Unicode/default.aspx for a high-level presentation of Unicode 5.0.) Buffer overrun errors (which are typical when manipulating character strings) have become a vec- tor for security attacks against applications and even against parts of the operating system.
    [Show full text]
  • Unicode and Code Page Support
    Natural for Mainframes Unicode and Code Page Support Version 4.2.6 for Mainframes October 2009 This document applies to Natural Version 4.2.6 for Mainframes and to all subsequent releases. Specifications contained herein are subject to change and these changes will be reported in subsequent release notes or new editions. Copyright © Software AG 1979-2009. All rights reserved. The name Software AG, webMethods and all Software AG product names are either trademarks or registered trademarks of Software AG and/or Software AG USA, Inc. Other company and product names mentioned herein may be trademarks of their respective owners. Table of Contents 1 Unicode and Code Page Support .................................................................................... 1 2 Introduction ..................................................................................................................... 3 About Code Pages and Unicode ................................................................................ 4 About Unicode and Code Page Support in Natural .................................................. 5 ICU on Mainframe Platforms ..................................................................................... 6 3 Unicode and Code Page Support in the Natural Programming Language .................... 7 Natural Data Format U for Unicode-Based Data ....................................................... 8 Statements .................................................................................................................. 9 Logical
    [Show full text]
  • Technical Study Desktop Internationalization
    Technical Study Desktop Internationalization NIC CH A E L T S T U D Y [This page intentionally left blank] X/Open Technical Study Desktop Internationalisation X/Open Company Ltd. December 1995, X/Open Company Limited All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the copyright owners. X/Open Technical Study Desktop Internationalisation X/Open Document Number: E501 Published by X/Open Company Ltd., U.K. Any comments relating to the material contained in this document may be submitted to X/Open at: X/Open Company Limited Apex Plaza Forbury Road Reading Berkshire, RG1 1AX United Kingdom or by Electronic Mail to: [email protected] ii X/Open Technical Study (1995) Contents Chapter 1 Internationalisation.............................................................................. 1 1.1 Introduction ................................................................................................. 1 1.2 Character Sets and Encodings.................................................................. 2 1.3 The C Programming Language................................................................ 5 1.4 Internationalisation Support in POSIX .................................................. 6 1.5 Internationalisation Support in the X/Open CAE............................... 7 1.5.1 XPG4 Facilities.........................................................................................
    [Show full text]
  • AIX Globalization
    AIX Version 7.1 AIX globalization IBM Note Before using this information and the product it supports, read the information in “Notices” on page 233 . This edition applies to AIX Version 7.1 and to all subsequent releases and modifications until otherwise indicated in new editions. © Copyright International Business Machines Corporation 2010, 2018. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents About this document............................................................................................vii Highlighting.................................................................................................................................................vii Case-sensitivity in AIX................................................................................................................................vii ISO 9000.....................................................................................................................................................vii AIX globalization...................................................................................................1 What's new...................................................................................................................................................1 Separation of messages from programs..................................................................................................... 1 Conversion between code sets.............................................................................................................
    [Show full text]
  • Introduction
    01-P1437 9/8/00 4:54 PM Page 1 ONE Introduction Around twenty years ago a computer revolution started when the IBM PC was released. The IBM PC took computing away from the air-conditioned environ- ment of the mainframe and minicomputer and put it onto the desk of poten- tially everyone. Nowadays most workers have a PC on their desk, and many have a PC at home, too. Laptop computers allow users to have one computer that can be used both at home and at work, as well as on the road. PCs are ge- neric computing devices providing tremendous computing power and flexibil- ity, and all PCs in the world from laptops to desktop PCs and through to servers have fundamentally the same architecture. Living through the PC era has been fun, frustrating, and exciting. However, there is an even bigger revolution on the way with more potential and even more challenges—the move to truly mobile-device-based computing. In the last few years computing devices have been coming onto the mar- ket that provide unparalleled portability and accessibility. Microsoft Windows CE devices, such as the palm-size device and handheld PC, provide cutdown desktop PC capabilities in a really small footprint, and Palm Pilot has been a very successful PDA (Personal Digital Assistant). Microsoft Pocket PC has tremen- dous features for enterprise computing, games, and entertainment. The Win- dows CE operating system has been embedded into many appliances (such as gas pumps and productions systems) for monitoring and control. Unlike the generic PC, these computing devices are not all the same and are designed for specific purposes.
    [Show full text]
  • An Introduction to Ada's Predefined Packages
    John McCormick Fall 2004 An Introduction to Ada's Predefined Packages The Annexes of the Ada Reference Manual (ARM) provides descriptions of the packages that every conforming Ada compiler vendor must provide (Chapters 1-13 and Annex A) and those that they usually provide. Here is a list of the names of those packages along with the location of their specifications in the ARM. Standard -- A.1 Strings (continued) Ada -- A.2 Unbounded -- A.4.5 Asynchronous_Task_Control -- D.11 Wide_Bounded -- A.4.7 Calendar -- 9.6 Wide_Fixed -- A.4.7 Characters -- A.3.1 Wide_Maps -- A.4.7 Handling -- A.3.2 Wide_Constants -- A.4.7 Latin_1 -- A.3.3 Wide_Unbounded -- A.4.7 Command_Line -- A.15 Synchronous_Task_Control -- D.10 Decimal -- F.2 Tags -- 3.9 Direct_IO -- A.8.4 Task_Attributes -- C.7.2 Dynamic_Priorities -- D.5 Task_Identification -- C.7.1 Exceptions -- 11.4.1 Text_IO -- A.10.1 Finalization -- 7.6 Complex_IO -- G.1.3 Interrupts -- C.3.2 Editing -- F.3.3 Names -- C.3.2 Text_Streams -- A.12.2 IO_Exceptions -- A.13 Unchecked_Conversion -- 13.9 Numerics -- A.5 Unchecked_Deallocation -- 13.11.2 Complex_Elementary_Functions -- G.1.2 Wide_Text_IO -- A.11 Complex_Types -- G.1.1 Complex_IO -- G.1.3 Discrete_Random -- A.5.2 Editing -- F.3.4 Elementary_Functions -- A.5.1 Text_Streams -- A.12.3 Float_Random -- A.5.2 Generic_Complex_Elementary_Functions -- G.1.2 Interfaces -- B.2 Generic_Complex_Types -- G.1.1 C -- B.3 Generic_Elementary_Functions -- A.5.1 Pointers -- B.3.2 Real_Time -- D.8 Strings -- B.3.1 Sequential_IO -- A.8.1 COBOL -- B.4 Storage_IO -- A.9 Fortran -- B.5 Streams -- 13.13.1 Stream_IO -- A.12.1 System -- 13.7 Strings -- A.4.1 Address_To_Access_Conversions -- 13.7.2 Bounded -- A.4.4 Machine_Code -- 13.8 Fixed -- A.4.3 RPC -- E.5 Maps -- A.4.2 Storage_Elements -- 13.7.1 Constants -- A.4.6 Storage_Pools -- 13.11 The purpose of this guide is to help you learn how to read the specifications for these packages.
    [Show full text]
  • Data Types in C
    Princeton University Computer Science 217: Introduction to Programming Systems Data Types in C 1 Goals of C Designers wanted C to: But also: Support system programming Support application programming Be low-level Be portable Be easy for people to handle Be easy for computers to handle • Conflicting goals on multiple dimensions! • Result: different design decisions than Java 2 Primitive Data Types • integer data types • floating-point data types • pointer data types • no character data type (use small integer types instead) • no character string data type (use arrays of small ints instead) • no logical or boolean data types (use integers instead) For “under the hood” details, look back at the “number systems” lecture from last week 3 Integer Data Types Integer types of various sizes: signed char, short, int, long • char is 1 byte • Number of bits per byte is unspecified! (but in the 21st century, pretty safe to assume it’s 8) • Sizes of other integer types not fully specified but constrained: • int was intended to be “natural word size” • 2 ≤ sizeof(short) ≤ sizeof(int) ≤ sizeof(long) On ArmLab: • Natural word size: 8 bytes (“64-bit machine”) • char: 1 byte • short: 2 bytes • int: 4 bytes (compatibility with widespread 32-bit code) • long: 8 bytes What decisions did the designers of Java make? 4 Integer Literals • Decimal: 123 • Octal: 0173 = 123 • Hexadecimal: 0x7B = 123 • Use "L" suffix to indicate long literal • No suffix to indicate short literal; instead must use cast Examples • int: 123, 0173, 0x7B • long: 123L, 0173L, 0x7BL • short:
    [Show full text]
  • Wording Improvements for Encodings and Character Sets
    Wording improvements for encodings and character sets Document #: P2297R0 Date: 2021-02-19 Project: Programming Language C++ Audience: SG-16 Reply-to: Corentin Jabot <[email protected]> Abstract Summary of behavior changes Alert & backspace The wording mandated that the executions encoding be able to encode ”alert, backspace, and carriage return”. This requirement is not used in the core wording (Tweaks of [5.13.3.3.1] may be needed), nor in the library wording, and therefore does not seem useful, so it was not added in the new wording. This will not have any impact on existing implementations. Unicode in raw string delimiters Falls out of the wording change. should we? New terminology Basic character set Formerly basic source character set. Represent the set of abstract (non-coded) characters in the graphic subset of the ASCII character set. The term ”source” has been dropped because the source code encoding is not observable nor relevant past phase 1. The basic character set is used: • As a subset of other encodings • To restric accepted characters in grammar elements • To restrict values in library literal character set, literal character encoding, wide literal character set, wide lit- eral character encoding Encodings and associated character sets of narrow and wide character and string literals. implementation-defined, and locale agnostic. 1 execution character set, execution character encoding, wide execution character set, wide execution character encoding Encodings and associated character sets of the encoding used by the library. isomorphic or supersets of their literal counterparts. Separating literal encodings from libraries encoding allows: • To make a distinction that exists in practice and which was not previously admitted by the standard previous.
    [Show full text]
  • Programming Languages Ð C
    Programming languages Ð C ABSTRACT (Cover sheet to be provided by ISO Secretariat.) This International Standard speci®es the form and establishes the interpretation of programs expressed in the programming language C. Its purpose is to promote portability, reliability, maintainability, and ef®cient execution of C language programs on a variety of computing systems. Clauses are included that detail the C language itself and the contents of the C language execution library. Annexes summarize aspects of both of them, and enumerate factors that in¯uence the portability of C programs. Although this International Standard is intended to guide knowledgeable C language programmers as well as implementors of C language translation systems, the document itself is not designed to serve as a tutorial. -- -- Introduction 1 With the introduction of new devices and extended character sets, new features may be added to this International Standard. Subclauses in the language and library clauses warn implementors and programmers of usages which, though valid in themselves, may con¯ict with future additions. 2 Certain features are obsolescent, which means that they may be considered for withdrawal in future revisions of this International Standard. They are retained because of their widespread use, but their use in new implementations (for implementation features) or new programs (for language [6.9] or library features [7.20]) is discouraged. 3 This International Standard is divided into four major subdivisions: Ð the introduction and preliminary elements; Ð the characteristics of environments that translate and execute C programs; Ð the language syntax, constraints, and semantics; Ð the library facilities. 4 Examples are provided to illustrate possible forms of the constructions described.
    [Show full text]
  • Character Encodings
    Character encodings Andreas Fester June 26, 2005 Abstract This article is about the concepts of character encoding. Earlier there were only a few encodings like EBCDIC and ASCII, and everything was quite simple. Today, there are a lot of terms like UNICODE, Code- page, UTF8, UTF16 and the like. The goal of this article is to explain the concepts behind these terms and how these concepts are supported in programming languages, especially Java, C and C++. 1 Introduction Character encodings are necessary because the computer hardware can only handle (and only knows) about numbers. The first computers could handle eight bit numbers (lets take aside the very first micropro- cessors which could only handle 4 bits), which is a range from 0 to 255. Larger numbers must be handled by combining two or more eight bit values. Later, computers were enhanced so that they could handle 16-, 32- and meanwhile 64 bit numbers without having to combine smaller units. But - its still all numbers, i.e. a sequence of 0 and 1 bits. However, it soon became clear that interfacing with a computer system through sequences of 0 and 1 is very cumbersome and only possible for very simple communication. Usually an end user wants to read text like "Name:" and enter words like "Bill", so it was necessary to define some conventions how numbers correspond to letters. These conventions are generally called character encodings. 2 Single byte encoding The traditional way of assigning a code to a character is to represent each character as a byte or a part of a byte (for example six or seven bits).
    [Show full text]