Guide to the Use of Character Set Standards in Europe

CEN TECHNICAL REPORT Draft 3 for CEN Trnnnn:1999 1999-07-23 Descriptors: Data processing, information interchange, text processing, text communication, graphic characters, character sets, representation of characters, coded character sets, architecture Information Technology - Guide to the use of character set standards in Europe This CEN Technical Report has been drawn up by CEN/TC 304 This CEN Technical Report was established by TC 304 in one official version (English). A version in any other language made by translation under the responsibility of a CEN member into its own language and notified to the Central Secretariat has the same status as the official version. CEN members are the national bodies of Austria, Belgium, the Czech Republic, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Italy, Luxembourg, Netherlands, Norway, Portugal, Spain, Sweden, Switzerland, and the United Kingdom. CEN European Committee for Standardization Comité Européen de Normalisation Europäisches Komitee für Normung Central Secretariat: rue de Stassart 36, B-1050 Brussels © CEN 1999 Copyright reserved to all CEN members Ref.No. TR xxxx:1999 E CEN TR nnnn : Draft 2 Guide to the use of character set standards in Europe ii Guide to the use of character set standards in Europe CEN TR nnnn : Draft 2 FOREWORD This report was produced by a CEN/TC 304 Project Team, set up in June, 1998, as one of several to carry out the funded work program of TC 304 (documented in CEN/TC 304 N 666 R2). A first draft was discussed at the TC meeting in Brussels in November, 1998. A revised draft was circulated for comments within the TC and thereafter discussed at the TC plenary meeting in April, 1999. This revised version is based upon comments re- ceived during and after that meeting and is circulated for written ballot within the TC. If approved, the report will then be sent to the CEN BT for approval. iii CEN TR nnnn : Draft 2 Guide to the use of character set standards in Europe TABLE OF CONTENTS FOREWORD iii Guide to the use of character set standards in Europe 1 1 Introduction 1 2 Executive summary 1 3 Scope and field of application 2 4 Definitions 2 5 Characters and their coding 3 5.1 Characters, glyphs and languages 3 5.2 Coding 3 5.3 Control functions and control characters 4 6 The character handling model 4 6.1 The input function 5 6.2 The processing function 5 6.3 The interchange function 5 6.4 The output function 6 6.5 Cultural issues 6 7 Official standards, manufacturer standards, and related standards 6 7.1 Telecommunication standards 6 7.2 Manufacturer standards 7 7.3 Related Standards 7 8 International character sets 7 8.1 Framework standards for 7- and 8-bit environments 8 8.2 7- and 8-bit character set standards 9 8.3 The universal character set (UCS) standard 10 8.4 Control functions 11 9 European character sets 12 9.1 8-bit character sets 12 9.2 The multilingual European subsets 12 9.3 The EURO SIGN 13 10 Procurement issues 13 10.1 Repertoires and code structures 14 10.2 Transformation and fall-back 14 10.4 Code structure interoperability 16 11 Procurement clauses 16 11.1 Structure 17 11.2 Input character repertoire 17 11.3 Output character repertoire 17 11.4 Processing character repertoire 18 11.5 Interchange character repertoire 19 11.6 Additional requirements when using the 8-bit code structure for interchange 20 11.7 Additional requirements when using the multi-byte UCS code structure for interchange 20 12 CEN and CEN/TC 304 21 13 References 22 iv TECHNICAL REPORT CEN TR nnnn Guide to the use of character set standards in Europe 1 Introduction toires, coding and uses. The second level, contained in the two annexes, provides much more There exist today a large number of standards detailed, tutorial information. The reader who and related specifications concerning character finds the level of technical detail to deep may repertoires and their coding in the form of offi- be better served by the “Manual: Standards for cial as well as manufacturer standards and in- the electronic interchange of personal data: Part tended for a wide range of applications and 5 – Character sets” (see References). uses. Furthermore, there are character set standards for data communication and there are Further information on character sets and their standards developed specifically for telecom- standardization can be found in the document munications applications. The situation can be “Language automation world-wide: The devel- very confusing to the non-expert user and to opment of character set standards” and on the people involved in procurement. Letter Database web site (see References). The user of IT systems normally does not have 2 Executive summary to be concerned with these types of standards. However, there may be situations where the The main body of this report is aimed primarily user has to be able to express working needs for at the non-technical person who needs to be- certain character repertoires. It may also hap- come familiar with use of character set stan- pen that the user, when involved in work to- dards in Europe for various purposes in an IT gether with other parties using other systems, environment. This audience will include man- needs to be able to interpret other people’s agers/decision makers and their advisors; ad- specifications given in the form of reference to ministrators (for procurement purposes); tech- standards. nicians (for programming and system development purposes); standardisers; perhaps also The procurer of IT systems should be able to journalists. specify requirements in the form of reference to established standards. The concepts of characters and their coding is introduced in section 5, and a conceptual model A particular purpose of the report is to give on the use of coded character sets is provided in guidance for public procurement in Europe. section 6. The guide concentrates on official Since there is an EC directive and a council character set standards. However, there is a decision for such procurement requiring the use range of other standards for character sets that of official European standards above certain are not official, and there are also specifications procurement amounts, the report concentrates on concerning associated topics such as rules for such standards. There may be future editions, in ordering character strings. Section 7 goes on to which case more attention will be given other types place the official standards in the wider context of standards. (See also section 7.) of these other standards. Sections 8 and 9 de- scribe a range of official character set standards The main purpose of this report is to give guid- with an international and a European scope ance to users and procurers by explaining the respectively. Section 10 introduces a number of purposes and relationships of the official stan- procurement issues, and section 11 provides dards in the domain of data communication. sample text that may be used as the basis for Explicit guidance is given in paragraphs marked inclusion in (public) procurement specifications with !. for IT systems and software. The text is presented on two levels. The first In addition, the guide has two annexes which level, contained in the body of the report, pro- contain a much more technical description of vides a general coverage of character reper- official character set standards. 1 CEN TR nnnn : Draft 2 Guide to the use of character set standards in Europe single bit combination. The activities of CEN/TC304, the committee *Note – A control character is not strictly spo- responsible for the promulgation of character ken a “character” but is called that way because set and related specifications in Europe, are its coded representation is of the same type as described in section 12, and finally pointers for that of a coded graphical character. further reading and research are given in section 13. coded character set (character set): A set of unambiguous rules that establishes a character 3 Scope and field of set and the one-to-one relationship between the application characters of the set and their coded representation. The technical scope of this guide is primarily limited to official character set standards prom- *code table: A tabular representation of a ulgated by ISO/IEC and CEN, as opposed to coded character set, showing also the coded official telecommunications standards and manu- representations. facturer standards. However, an overview of all types of standards is given in section 7. The *code page: Synonym for code table, used in guide furthermore concentrates on European the IBM environment. issues; thus character set standards for non- European languages are not covered. *code space: The numeric domain occupied by all bit combinations used for the coding of a The guide is mainly intended as an introduction coded character set. for people who need to familiarise themselves with the concept of character sets and their transliteration: The process which consists of coding; e.g. managers/decision makers and their representing the characters of an alphabetical or advisors; administrators (for procurement pur- syllable writing system by the characters of a poses); technicians (for programming and sys- conversion alphabet. tem development purposes); standardisers; perhaps also journalists. Particular emphasis is Note – In principle, a transliteration should be a placed on its use by procurers. one-to-one conversion. *fall-back: A non-reversible transformation 4 Definitions consisting of the substitution of an output char- The following terms are used in the body of acter which cannot be represented on the output this report and the official definitions are given device by one or more characters which can.

Guide to the Use of Character Set Standards in Europe

Armenian Secret and Invented Languages and Argots

Petit Manuel Unix®

Choosing Inscriptions Making Font for 'Armazuli' Aramaic Objectives Mark up of the Texts and Linked Data New Photo Document

Proposal for the Georgian Script Root Zone LGR

Sylfaen : Foundations of Multiscript Typography

Binary Codes

UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn

Nill-1BER 97 the ORIGINS of CAUCASIAN CIVILIZATION: the CHRISTIAN COMPONENT R

'Links Between Scripts, Writing Systems, Orthographies, Fonts And

Revitalizing Endangered Languages

Section 7.9, Combining Marks, and Also Section 2.11, Combining Characters

Proposal for Malayalam Script Root Zone Label Generation