Chinese Character Synthesis Using METAPOST
Total Page:16
File Type:pdf, Size:1020Kb
Chinese character synthesis using METAPOST Candy L. K. Yiu, Wai Wong Department of Computer Science, Hong Kong Baptist University [candyyiu,wwong]@comp.hkbu.edu.hk Abstract A serious problem in Chinese information exchange in this rapidly advancing Internet time is the sheer quantity of characters. Commonly used character en- coding systems cannot include all characters, and often fonts do not contain all characters either. In professional and scholarly documents, these unencoded char- acters are quite common. This situation hinders the development of information exchange because special care has to be taken to handle these characters, such as embedding the character as an image. This paper describes our attempt towards solving the problem. Our approach utilizes the intrinsic characteristic of Chinese characters, that is, each character is formed by combining strokes and radicals. We defined a Chinese character description language named HanGlyph, to cap- ture the topological relation of the strokes in a character. We are developing a Chinese Character Synthesis System CCSS which transforms HanGlyph descrip- tions into graphical representations. A large part of the CCSS is implemented in METAPOST. 1 Introduction tion is compact and can be targeted to a variety The rapid advancement of the Internet and the Web of rendering styles. The section on the HanGlyph provides an effective means of information exchange. Chinese character description language describes the However, there is a very serious problem in exchang- language in more detail. Secondly, we use META- ing Chinese documents: the number of Chinese char- POST as our rendering engine to take advantage of acters that now exist or have ever existed is un- its meta-ness and the ability of specifying paths and known. Furthermore, new characters are continu- solving linear equations. ally being created. Therefore, no character set can The HanGlyph language is defined based on encode all Chinese characters. many studies of Chinese characters. The section on Even if a character set could encode all Chinese the sructure of Chinese characters explains the ba- characters, it is very expensive to create Chinese sic structure of Chinese characters for the benefit of fonts using typical methods and a fairly large num- readers who are not familiar with them. HanGlyph ber of Chinese characters would be so rarely used defines 41 basic strokes, 5 operators and a set of rela- that the expense would be very difficult to justify. tions. A character is built by combining strokes us- One possible solution to this problem is to cre- ing the operators recursively. HanGlyph allows the ate an unencoded character according to its compo- user to define macros to represent a stroke cluster sition of strokes and radicals. Several experiments which can then be re-used in building more complex along this line were attempted in the past, but none characters. were very successful. The key reason is that the The CCSS (stands for Chinese Character Syn- composition of the strokes and radicals is very com- thesis System) takes HanGlyph expressions and ren- plex, and the previous attempts did not effectively ders the characters. It can be divided into three divide and resolve the complexity. The section on parts: a front-end to translate HanGlyph expres- related works gives a brief survey of some previous sions into METAPOST programs, a set of primitive attempts. strokes, and a library of METAPOST macros to im- Our approach to Chinese character synthesis re- plement the operators and relations. By varying the solves the complexity in two ways. First, we defined parameters to these macros, or redefining the basic a high-level Chinese character description language, stroke macros, Chinese characters in different styles HanGlyph. It captures the abstract and topological can be formed. Thus, it can create a variety of dif- relation of the strokes. Thus, the character descrip- ferent fonts from the same HanGlyph description. TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting 85 Candy L. K. Yiu, Wai Wong 2 The structure of Chinese characters while we use a more general term stroke clusters to Chinese characters, or hanzi, have their roots in a refer to any arrangements of several strokes. very long history. A large body of literature on the Except for a small number of very simple char- study of the written form of Chinese language, dat- acters, such as º, Œ, A, which cannot be divided ing from as early as more than 2000 years ago (dur- into component parts, all hanzi can be considered ing the Han dynasty) up to now, is available. The as compositions of certain components. The ways written form of Indo-European languages consists of of composing hanzi from components are known as around 30 characters. Words are formed using these the structure of the character. Many studies, such characters in a linear fashion. In contrast, written as [2] and [8], have identified around 10 different Chinese language is denoted by tens of thousands of types of structures if one considers how to compose hanzi. The exact number of hanzi that have ever a character from only two components. This does existed can never be known. not place serious restrictions, because the composi- Many studies have pointed out that each Chi- tion process can be performed recursively. Figure 1 nese character is composed from strokes. The num- illustrates the commonly used structures. ber of strokes in a character varies from one for the simplest, up to around 50 for the most complex. Un- 3 Related works like the linear composition of words from characters Based on the studies of Chinese characters, several in Indo-European languages, the arrangement of the attempts have been carried out to create hanzi from strokes in hanzi is two-dimensional. a structural composition approach. According to the convention of writing Chinese Toshiyuki et al [10] proposed a way of describ- characters, a stroke is a continuous movement of the ing Chinese characters using sub-patterns. In princi- brush over the writing surface without being lifted ple, their method is similar to the approach of Han- up. It is commonly agreed that there are five ba- Glyph because the underlying theory of character 1 sic strokes: h (k he´ng ), s (N shu`), p (‡ pieˇ), structure is intrinsic to all Chinese characters. n (z na`) and d (Þ diaˇn). Dong [11] and Fan [5] reported their work on the In practice, each of these basic strokes has some development of a Chinese character design system variations depending on the position in a character. which took a parametric approach to create charac- For example, the stroke p ‡ can have two varia- ters in different styles. Lim and Kim [7] developed tions: P (s‡ pı`ngpieˇ) (as the top stroke in C) and a system for designing Oriental character fonts by q (N‡ shu`pieˇ) (as the leftmost stroke in ). In composing stroke elements. addition, a number of combinations of these basic Inspired by the success of METAFONT [6] in movements are considered as strokes because they creating latin character fonts, Hobby and Gu [3] are connected in a natural way in writing. For ex- attempted to generate Chinese characters of differ- ample, a h (k) followed by a p (‡) is a single ent styles using METAFONT. A small set of strokes stroke k called (k˜‡ he`ngzhe´pieˇ). Modern stud- were defined in METAFONT. A small set of radi- ies of Chinese characters [1, 9] identified a small set cals were then defined as METAFONT macros by us- of around 40 strokes as the basic elements of hanzi. ing the strokes. Characters can then be specified as Although the arrangements of strokes to form METAFONT programs using these macros as build- a hanzi is very complex, there are some rules that ing blocks. By varying some parameters governing guide the formation of characters. Further, some the shapes of the strokes, fonts of different styles stroke arrangements are relatively stable and appear can be generated. However, the research was not in many characters. Some of these arrangements are conclusive because they only generated fonts with a themselves hanzi, for example, å; some of them very small character set (128 characters). are known as radicals which are used in Chinese dic- Another attempt similar to Hobby and Gu was tionaries to index characters, for example, s. done by Hosek [4] who aimed at generating hanzi There are some relatively stable arrangements that from a small sets of components. are not hanzi themselves, nor radicals, but appear in A common theme of the works mentioned is the many characters. We will use the term components difficulty of handling the complexity of the struc- to refer to all these kinds of stroke arrangements, tures and the numerousness of characters. Our ap- proach handles the complexity by using an abstract 1 The word he´ng following the hanzi name of the stroke description and a layered CCSS to decompose the is in pinyin, a phonetic transcription of Chinese characters. complexity into several sub-problems. On the Han- We hope these pinyin transcriptions can help readers who do Glyph level, we consider strokes as abstract objects. not know Chinese to pronounce the names of the strokes. 86 TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting Chinese character synthesis using METAPOST Top−bottom: Partially− 021 3 enclosing: Left−Right: 4567 Enclosing: Cross: Figure 1: The basic structure of hanzi. We need to specify only the relative positions be- 4.1 The strokes tween these abstract objects. On a lower level, we After studying a number of Chinese linguistic and can work out the outline of the strokes and fine tune graphological works, we selected a set of 41 strokes the positions.