Characters, Glyphs and Beyond / Tereza Haralambous, Yannis

「書体・組版ワークショップ」報告書( Characters, Glyphs Title and Beyond / Tereza Haralambous, Yannis Haralambous [pdf] ) Author(s) Citation (2004) Issue Date 2004-02 URL http://hdl.handle.net/2433/65873 Right Type Conference Paper Textversion publisher Kyoto University Kyoto University 21st Century COE Program Characters, Glyphs and Beyond Tereza Haralambous and Yannis Haralambous Abstract The distinction between characters and glyphs is a fundamental issue of computing. This talk aims in giving a new definition of these notions. We first review and comment the definitions given in various standards. Then we give and explain our own definitions. We consider that the Unicode character model is lacunary and formulate a proposal for adding supplementary information and obtaining thus “rich Unicode characters.” We illustrate our arguments with many examples, taken from various writing systems. Keywords: character, glyph, language, writing system The distinction between characters and glyphs is W3C uses the ISO 9541 definition of glyph, proba- currently a very popular issue. The complexity of bly to be consistent with the only available standard this issue is, in some sense, related to the fact that on “Font information interchange.” The definition computer systems have been build by engineers not of “glyph” in Unicode is slightly different: a glyph very proficient in linguistics, and interested only in is “a shape that a character can have when rendered the English language. Exploring non-latin writing or displayed.” Notice two things: first, the fact that systems one realizes what has not been clear from the definition of glyph is based on the one of char- the beginning: that modelizing written language is acter, so if the first one is vague, the second one not a trivial task, and that it is fundamental to all is even more vague; secondly, the fact that there is exchange and processing of textual information. no distinction anymore between “glyph” and “glyph Let us start the exploration of this universe by image,” as in ISO 9541. We are now talking about giving some definitions of the terms we are using. shapes, and nothing else. There is an illustration in Let us see how the terms “character” and “glyph” the Unicode book which clearly shows glyphs corre- are defined. sponding to the same character, in different fonts. This shows that Unicode’s definition of a glyph is According to ISO 9541 [6] released in 1991, a rather the one of “glyph image” in ISO 9541. “glyph” is “a recognizable abstract graphic symbol which is independent of any specific design,” while a For whatever it is worth, the PDF Reference 1.4 “glyph image” is “an image of a glyph, as obtained (2001) [1], defines a character as “an abstract sym- from a glyph representation diplayed on a presenta- bol,” whereas a glyph is “a specific graphical ren- tion surface,” where “glyph representation” is “the dering of a character.” Once again we have a vague glyph shape and glyph metrics associated with a definition of character and a definition of glyph rely- specific glyph in a font resource.” We may argue if ing on it. After all, what is an “abstract symbol”? It this distinction between “abstract glyph” and “con- doesn’t give us a clue about why “A” is an “abstract crete glyph” is necessary, but this is how ISO 9541 symbol,” and not “fi.” defines these. Now let us give our own definition of character According to W3C (quoting “A Character Model and glyph [4, 5]. First of all, we believe that the for the World Wide Web” by Martin Drst and oth- best way to define these notions is going from glyph ers [2]), a character is “the smallest component of to character and not the other way around, as W3C written language that has semantic values; refers to and PDF are doing it. the abstract meaning and/or shape.” We find this For us, a glyph is “the image of a typographical definition quite vague since everything we perceive sign.” You may object why we use the term “typo- may or may not have semantic value, depending on graphical” in our definition. Well, typography has our culture, context and even mood. We all know been a first modelization of human writing. Books that Unicode is full of inconsistencies, because of its are based on this modelization (even if in some cul- requirement to be compatible with legacy encodings. tures books are still written by hand) and books are Has this definition been made to encompass Unicode the carriors of human culture. Computers are based weaknesses, and is therefore voluntarily vague? on this modelization. Typographical signs are uni- form, at least in the frame of a given book, or of sharing a few properties can be infinitely diverse. a given page. In such a narrow frame, the differ- We can say that a description is an equivalence class ences between typographical signs are microscopic, of glyphs: two glyphs will be equivalent if they fit this is not the case for hand writing. Of course if to a given description. for a given writing system there has never been any Our definition of a character is: “a character is typographical tradition, then we must amend our an equivalence class of glyphs, based on a simple, definition to something like: “a glyph is part of the linguistic or logical description.” image of written text, not too big and not too small, So if we say “LATIN CAPITAL LETTER A,” then we so that the given writing system can be obtained by describe a class of glyphs which can be interpreted an optimal sequence of these images, arranged in a as letter capital A in the Latin writing system. This regular way.” This apparently complex definition description is purely linguistic. is better explained as: “let us first try to modelize When we say “simple,” we mean that the descrip- the given written system as typography would have tion should be optimal in length: not too short, not done, and then let us take as glyphs the ones of our too long. When we say “linguistic or logical” we re- model.” But these kinds of writing systems are quite fer to the fact that characters can belong to writing exceptional, and they are not the main topic of our system for languages, but also to notation systems talk. (as for music, industrial design, trafic signs, mathe- So let us suppose that the writing systems we care matics, etc.). about are those who had already a typographical If we apply this definition very strictly, then quite tradition, be it a short one. Typographers are highly a few Unicode characters are not qualified to be intelligent creatures and have subdivided the image characters. For example “SPACE” is hardly a glyph, of text into small pieces which are not too big (in since it is an empty image. The description “SPACE” Latin script that would be “words”) and not too is even less a character since it is neither linguistic small (in Latin script that would be pieces of letters) nor logical, but graphical. But “SPACE” could also but just optimal in size and quantity (in Latin script be defined as the “word separation method” in Eu- that would be letters). We have based our definition ropean writing systems, which would qualify it as a of glyph on their work. character, since it is a purely linguistic description. What is then a character? Let us realize that What about “THIN SPACE” (which is Unicode when we see a glyph, we are interpreting it. If it be- character 2009)? This one is more hard to defend. longs to a writing system we know, then we have One could say that it is part of a notation system: some specific knowledge about it: how it is pro- the repertoire of lead types. In the frame of this nounced, how it gets combined with other glyphs, notation system, it has some logic, so it would make its numerical value, etc. If we are know proficient sense to call it a character. with the given writing system we can maybe still rec- One way to test if a glyph equivalence class qual- ognize it as belonging to that system, but no more. ifies as a character is to bypass graphical represen- Sometimes we cannot do even that. In that case tation of language and to think of what happens to our interpretation of the glyph focuses on its geo- these glyphs in systems like voice synthesis. “SPACE” metrical properties: is it a triangle, a circle, does it is absolutely essential in voice synthesis, since with- resemble to that or that glyph we know? out it, text would be impossible to understand. But Interpretation leeds to description. How do we “THIN SPACE” makes no sense whatsoever in voice describe a glyph? Take the glyph “A.” Some may synthesis. So there is a legitimate doubt about its say it is an open triangle with a bar in the middle, character essence. other will say it is a “Latin letter A,” other will say it is the mathematical “for all” operator which has Now let us see how characters and glyphs are used been inversed. Many descriptions can be given, but in computing. As we see on the drawing, humans use only a few are interesting to computing. keyboards to input text in computers. Keyboards Furthermore, a glyph description may fit to more refer to characters, but when we push on keystrokes than one glyphs. In fact, in most cases, it will be what we see on the screen is already a glyph. We appropriate to an infinity of glyphs, since the images see glyphs on screen, but what we store in a doc- 18 ument are characters.

Characters, Glyphs and Beyond / Tereza Haralambous, Yannis

Nl 6 1999-2000

On the Use of Coptic Numerals in Egypt in the 16 Th Century

Assessment of Options for Handling Full Unicode Character Encodings in MARC21 a Study for the Library of Congress

The Coptic Language

Proposals from the Script Encoding Initiative

Unicode Characters in Proofpower Through Lualatex

A. Administrative 1

Greek and Coptic Range: 0370–03FF the Unicode Standard, Version 4.0

Dejavusansmono-Bold.Ttf [Dejavu Sans Mono Bold]

Greek and Coptic Range: 0370–03FF

UC Irvine Unicode Project

Proposal to Add Additional Characters for Greek, Latin, and Coptic to the UCS Source: Michael Everson, Stephen Emmel (Universität Münster), Siegfried G