<<

Chinese character synthesis using METAPOST

Candy L. K. Yiu, Wai Wong Department of Computer Science, Hong Kong Baptist University [candyyiu,wwong]@comp.hkbu.edu.hk

Abstract serious problem in Chinese information exchange in this rapidly advancing Internet time is the sheer quantity of characters. Commonly used character en- coding systems cannot include all characters, and often fonts do not contain all characters either. In professional and scholarly documents, these unencoded char- acters are quite common. This situation hinders the development of information exchange because special care has be taken to handle these characters, such as embedding the character as an image. This paper describes our attempt towards solving the problem. Our approach utilizes the intrinsic characteristic of , that is, each character is formed by combining strokes and radicals. defined a Chinese character description language named HanGlyph, to cap- ture the topological relation of the strokes in a character. We are developing a Chinese Character Synthesis System CCSS which transforms HanGlyph descrip- tions into graphical representations. A large part of the CCSS is implemented in METAPOST.

1 Introduction tion is compact and can be targeted to a variety The rapid advancement of the Internet and the Web of rendering styles. The section on the HanGlyph provides an effective means of information exchange. Chinese character description language describes the However, there is a very serious problem in exchang- language in more detail. Secondly, we use META- ing Chinese documents: the number of Chinese char- POST as our rendering engine to take advantage of acters that now exist or have ever existed is un- its meta-ness and the ability of specifying paths and known. Furthermore, new characters are continu- solving linear equations. ally being created. Therefore, no character set can The HanGlyph language is defined based on encode all Chinese characters. many studies of Chinese characters. The section on Even if a character set could encode all Chinese the sructure of Chinese characters explains the ba- characters, it is very expensive to create Chinese sic structure of Chinese characters for the benefit of fonts using typical methods and a fairly large num- readers who are not familiar with them. HanGlyph ber of Chinese characters would be so rarely used defines 41 basic strokes, 5 operators and a set of rela- that the expense would be very difficult to justify. tions. A character is built by combining strokes us- One possible solution to this problem is to cre- ing the operators recursively. HanGlyph allows the ate an unencoded character according to its compo- user to define macros to represent a cluster sition of strokes and radicals. Several experiments which can then be re-used in building more complex along this line were attempted in the past, but none characters. were very successful. The key reason is that the The CCSS (stands for Chinese Character Syn- composition of the strokes and radicals is very com- thesis System) takes HanGlyph expressions and ren- plex, and the previous attempts did not effectively ders the characters. It can be divided into three divide and resolve the complexity. The section on parts: a front-end to translate HanGlyph expres- related works gives a brief survey of some previous sions into METAPOST programs, a set of primitive attempts. strokes, and a library of METAPOST macros to im- Our approach to Chinese character synthesis re- plement the operators and relations. By varying the solves the complexity in two ways. First, we defined parameters to these macros, or redefining the basic a high-level Chinese character description language, stroke macros, Chinese characters in different styles HanGlyph. It captures the abstract and topological can be formed. Thus, it can create a variety of dif- relation of the strokes. Thus, the character descrip- ferent fonts from the same HanGlyph description.

TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting 85 Candy L. K. Yiu, Wai Wong

2 The structure of Chinese characters while we use a more general term stroke clusters to Chinese characters, or hanzi, have their roots in a refer to any arrangements of several strokes. very long history. A large body of literature on the Except for a small number of very simple char- study of the written form of Chinese language, dat- acters, such as º, Œ, A, which cannot be divided ing from as early as more than 2000 years ago (dur- into component parts, all hanzi can be considered ing the Han dynasty) up to now, is available. The as compositions of certain components. The ways written form of Indo-European languages consists of of composing hanzi from components are known as around 30 characters. Words are formed using these the structure of the character. Many studies, such characters in a linear fashion. In contrast, written as [2] and [8], have identified around 10 different Chinese language is denoted by tens of thousands of types of structures if one considers how to compose hanzi. The exact number of hanzi that have ever a character from only two components. This does existed can never be known. not place serious restrictions, because the composi- Many studies have pointed out that each Chi- tion process can be performed recursively. Figure 1 nese character is composed from strokes. The num- illustrates the commonly used structures. ber of strokes in a character varies from one for the simplest, up to around 50 for the most complex. Un- 3 Related works like the linear composition of words from characters Based on the studies of Chinese characters, several in Indo-European languages, the arrangement of the attempts have been carried out to create hanzi from strokes in hanzi is two-dimensional. a structural composition approach. According to the convention of writing Chinese Toshiyuki et al [10] proposed a way of describ- characters, a stroke is a continuous movement of the ing Chinese characters using sub-patterns. In princi- brush over the writing surface without being lifted ple, their method is similar to the approach of Han- up. It is commonly agreed that there are five ba- Glyph because the underlying theory of character 1 sic strokes: h (k he´ng ), s (N shu`), p (‡ pieˇ), structure is intrinsic to all Chinese characters. n (z na`) and d (Þ diaˇn). Dong [11] and Fan [5] reported their work on the In practice, each of these basic strokes has some development of a Chinese character design system variations depending on the position in a character. which took a parametric approach to create charac- For example, the stroke p ‡ can have two varia- ters in different styles. Lim and Kim [7] developed tions: P (s‡ pı`ngpieˇ) (as the top stroke in C) and a system for designing Oriental character fonts by q (N‡ shu`pieˇ) (as the leftmost stroke in ). In composing stroke elements. addition, a number of combinations of these basic Inspired by the success of METAFONT [6] in movements are considered as strokes because they creating latin character fonts, Hobby and Gu [3] are connected in a natural way in writing. For ex- attempted to generate Chinese characters of differ- ample, a h (k) followed by a p (‡) is a single ent styles using METAFONT. A small set of strokes stroke k called (k˜‡ he`ngzhe´pieˇ). Modern stud- were defined in METAFONT. A small set of radi- ies of Chinese characters [1, 9] identified a small set cals were then defined as METAFONT macros by us- of around 40 strokes as the basic elements of hanzi. ing the strokes. Characters can then be specified as Although the arrangements of strokes to form METAFONT programs using these macros as build- a hanzi is very complex, there are some rules that ing blocks. By varying some parameters governing guide the formation of characters. Further, some the shapes of the strokes, fonts of different styles stroke arrangements are relatively stable and appear can be generated. However, the research was not in many characters. Some of these arrangements are conclusive because they only generated fonts with a themselves hanzi, for example, å; some of them very small character set (128 characters). are known as radicals which are used in Chinese dic- Another attempt similar to Hobby and Gu was tionaries to index characters, for example, s. done by Hosek [4] who aimed at generating hanzi There are some relatively stable arrangements that from a small sets of components. are not hanzi themselves, nor radicals, but appear in A common theme of the works mentioned is the many characters. We will use the term components difficulty of handling the complexity of the struc- to refer to all these kinds of stroke arrangements, tures and the numerousness of characters. Our ap- proach handles the complexity by using an abstract 1 The word he´ng following the hanzi name of the stroke description and a layered CCSS to decompose the is in , a phonetic transcription of Chinese characters. complexity into several sub-problems. On the Han- We hope these pinyin transcriptions can help readers who do Glyph level, we consider strokes as abstract objects. not know Chinese to pronounce the names of the strokes.

86 TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting Chinese character synthesis using METAPOST

Top−bottom: Partially− 021 3 enclosing:

Left−Right: 4567

Enclosing: Cross:

Figure 1: The basic structure of hanzi.

We need to specify only the relative positions be- 4.1 The strokes tween these abstract objects. On a lower level, we After studying a number of Chinese linguistic and can work out the outline of the strokes and fine tune graphological works, we selected a set of 41 strokes the positions. as the primitives of HanGlyph. Each primitive stroke Another theme of the works mentioned is that is assigned a Latin letter as its code so that users they are mainly aimed at the design and generation can easily write HanGlyph expressions using a stan- of character fonts. Our approach can certainly be dard qwerty keyboard. Table 1 lists these primitive applied in font generation. However, a very impor- strokes. tant application area, namely the exchange of Chi- nese character information, is made possible with 4.2 The operators and relations our character description language HanGlyph. To form a Chinese character, one combines primitive strokes using operators. Five operators are defined as listed in Table 2. Figure 1 illustrates the com- 4 The HanGlyph language position performed by these operators. Each opera- Based on the analysis described in previous sections, tor combines two operands to form a stroke cluster. we defined a Chinese character description language, This operation continues recursively until the de- named HanGlyph. The most crucial characteristic of sired character is formed. For example, to describe this language is that it is abstract and it captures the character , one may first combine two hori- only the topological relation of the strokes that form zontal strokes using the top-bottom operator, then a character. use the cross operator to add a vertical stroke. The The essential information needed to distinguish HanGlyph expression for this character (written in a Chinese character is the arrangement of strokes. ASCII characters) is h h=s+.(Note: the expression The precise location of each stroke can vary in a is in postfix notation.) large extent up to a certain threshold, and the char- However, with only these operators, some char- acter can still be recognized correctly. For example, acters, like and ë mentioned above, cannot be the following two characters, and ë, comprise distinguished. To resolve situations like this, we exactly the same strokes and in exactly the same can augment the operator with a number of relation arrangement. The only difference between them is specifiers to describe the operation in more specific the relative length of the two horizontal strokes. terms. For our sample character , the proper Han- Exactly how much longer a horizontal stroke is in Glyph expression should be h h=< s+_ where the these characters is unimportant for distinguishing symbol < denotes the relation that the length (i.., between them. To recognize the character , the the horizontal dimension) of the upper horizontal threshold is that the upper horizontal stroke must strokes must be shorter than the lower one, and the be shorter than the lower one. Therefore, the Han- symbol _ denotes the relation that the two operands Glyph language does not describe the precise geo- of the cross operator, namely Œ and s, are aligned metric information of the characters. at the bottom.

TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting 87 Candy L. K. Yiu, Wai Wong

Table 1: HanGlyph primitive strokes Stroke Name Code Examples Stroke Name Code Examples d Þ d c;™ p ‡ p 'º D æÞ D Ãë± P s‡ P CÛ f wÞ f q N‡ q (R g ‡Þ g s}á r ‡˜ r Hr» h k h Œ n z n º' k˜ i ×Û v sz v p• j k˜d j þ× t Ð t S0 k k˜‡ k È4¬ ÞÐ U ° l k˜N l ¿9, N k˜˜ N ù m k˜Nd m Y L k˜˜˜ L ø a kd a ®Í J k˜˜d J Í s N s A-( E k˜Ð E ¡ b N˜ b q@Ù K k‡Nd K Š c NN c ÛR B N˜˜ B ž w NNd w ‹R C N˜˜d C ¬ e NÐ e c9 Q N˜‡ Q  S Nd S ø) M kœd M ¨± x Nd X ×ü¶ R k˜˜‡ R ÷úÊ y œd Y #ã z ‡Þ z f W åd W à F k˜œ F PZÛ kNNd o ]à

added to an operation, for example, to align at Table 2: HanGlyph operators bottom right (_]). Name Symbol Example 3. Touching — This specifies whether the oper- top-bottom = éN ands can touch each other. The possible re- left-right æó | Ö` lations are touching (~) or not touching (!~). When combining two elements to form a new fully enclosing h @ Þ ð character, the strokes next to the interface of † half enclosing J ^ öO@ the two elements may or may not touch each cross •Ò + AJ( other. In general, if the strokes on either side †A digit ranging from 0 to 7 should suffix of the interface have the same direction, they the half enclose operator to indicate the will not touch each other, for example, v. direction of the opening. Otherwise, the strokes may touch each other, for example, ™ø. 4. Scale (/) — This is used to adjust the width The following four kinds of relations are defined: and height of the resulting character after the operation. 1. Dimension — The relations in this group spec- ify the relative dimension of the operands, i.e., 4.3 The HanGlyph macros comparing the width and height of their bound- It would be very tedious if every character is de- ing boxes. There are four boolean relations: less scribed down to all its primitive strokes. It can be than (<), greater than (>), not less than (!<), seen that certain arrangements of strokes are very not greater than (!>). common, such as å , and so on. They are used 2. Alignment — This specifies how the operands to build up characters. We call them components. are aligned. The possible alignments are at top HanGlyph allows macros to be defined to stand for (‘), at bottom (_), at left ([), at right (]) and a component. For example, the component å is a centered (#). More than one alignment can be macro with the name ri_4. It is defined in terms of

88 TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting Chinese character synthesis using METAPOST

1 2

3 4

7 6 5 (a) (b) (c)

Figure 2: Basic stroke macros for the stroke J. another macro sih representing the component ×. example, the stroke J (k˜˜d he´ngzhe´zhe´go¯ The actual definitions are as below: u) has six control points and a handle point let(sih){s i|/h=/} as shown in Figure 2(a). The locations of let(ri_4){sih h@} the points are specified relative to a reference point. The properties of a control point indi- With a small number of operators and rela- cate whether it is a beginning, an end, or a tions, and using a postfix notation, the syntax of turning point. These properties will be used the HanGlyph language is very simple. This facili- in creating the outline since its shape at differ- tates the development of simple language processors. ent types of points will be different, for exam- Figure 5.4 shows the concrete syntax of HanGlyph. ple, the second control point is a turning point, After defining the HanGlyph language, we have the outline at this point will have a serif shape. written descriptions of more than 3755 Chinese char- The properties will also be used in the compo- acters (the first level characters in the GB2312-80 sition operations to determine whether certain character set). We found that the HanGlyph lan- transformation and positioning opeartions are guage is adequate for its purpose, to capture the required. topological relation of the strokes. • A Skeleton macro specifies a path passing 5 The CCSS through the control points. This path is very The purpose of the Chinese Character Synthesis Sys- important. Given two points, the path can be tem (CCSS) is to render the HanGlyph expressions straight or curvy; therefore, this macro traces into a visual representation. For example, the Han- out the exact stroke skeleton. Figure 2(b) shows J Glyph expression h h = < only specifies that there the skeletal path of the stroke . are two horizontal strokes, one above another, and • An Outline macro creates the outline for the the upper stroke should be shorter than the lower stroke. It is defined relative to the control one. It does not tell us about the exact distance points and the skeletal path. Figure 2(c) shows between two strokes. In addition, it does not tell us the outline for the stroke J. exactly how much shorter is the upper stroke than The first reason for organizing the stroke compo- the lower one. sition into three macros is to avoid distortion. In The task of the CCSS is to determine and cal- composition operations, each stroke will be trans- culate the precise geometric information for each formed several times before the whole character is stroke so that a good rendering of characters can formed. If the stroke including the outline is repre- be generated. sented in one macro, the transformation will distort The CCSS consists of three modules: basic the stroke thickness and even the direction in certain strokes, composition operations and a HanGlyph-to- slant strokes. METAPOST translator. The first two modules are Another reason is to provide meta-ness and flex- METAPOST implemented as macros. The transla- ibility. This organization provides several levels of tor is a C program. style changes. The first level is to vary the param- 5.1 Basic strokes eters of the outline macros. For instance, changing the stroke thickness parameter will result in charac- Each of the 41 basic strokes listed in Table 1 is im- ters of varying stroke thickness. If we change the set METAPOST plemented as a set of three macros: of outline macros, we can create completely differ- • A Control-point macro defines the control ent font styles, but they may still be recognised as points, handle points and their properties. For a family because the locations of control points are

TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting 89 Candy L. K. Yiu, Wai Wong

(a) (b) (c)

Figure 3: Variations of strokes having the same skeleton. unchanged. More variations can be achieved if the control point macros and the skeleton macros are also changed. The result will be a completely differ- ent font family. Figure 3 shows three variations of the same skeleton. Figure 3(a) is the simple skele- tal path of the strokes. Figure 3(b) is the outline with serif. Figure 3(c) is generated by stroking the skeleton with a pen angled at 25 degrees. Figure 4: The same radical having different widths. 5.2 Composition operations CCSS implements five operations corresponding to the five operators defined in HanGlyph. These op- two strokes where the right-hand one is shorter and METAPOST erations are implemented as macros. the two are aligned at the bottom. The composition Again, the operators in HanGlyph represent abstract operation will first scale the right-hand stroke down operations, like the Top-bottom operator (=) only to a default size, and then translate it so that the means ‘put an operand on top of another’. It car- bottom lines of the two strokes are aligned to the ries no precise geometric information. Given this bottom of the character box as shown in Figure 5. abstract instruction, the macro implementing this operation has to calculate the exact location and dimension of each operand. The resulting render- ing should be a well-balanced and well-positioned arrangement of strokes. One important task of the composition opera- tion is to estimate the relative sizes and positions for its operands so that the result is visually well- balanced. For example, Figure 4 illustrates two char- acters having the same radical ( (mu`) on their left side. The width of this radical in the first character —(lı´n) is larger than that in the second character 9 (shu`) because the right-hand side of 9 has many Figure 5: A character composed of two strokes more strokes. We have found that the ratio of the aligned to bottom. widths of the two components is proportional to the ratio of the sums of the lengths of the strokes and the number of strokes of the components. While we are talking about transforming the HanGlyph expressions may include a number strokes, in fact, only the control points and the skele- of relations to augment the operators. The compo- tal path are transformed. After all strokes forming sition operations need to calculate the exact dimen- a character have been put at the right position, the sion and transformation to apply to each operand. outline is drawn. This avoids the outline being dis- For example, the character º (re´n) is composed of torted by the transformations.

90 TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting Chinese character synthesis using METAPOST

5.3 HanGlyph to METAPOST translation References The front-end of the CCSS is the translator that con- [1] ù (SU Pei Cheng). ŒA„þã"W verts HanGlyph expressions into METAPOST pro- v (The 20th century research on modern Chi- grams. The current implementation of the trans- nese characters). øwúH> (Su Hai Press), lator puts each HanGlyph expression into a META- 2001. POST figure. Within each figure, the appropriate [2] ‰#C (LIU Lian Yuan). "WÓQPË• sequence of composition operation macros is called (Analysis of the topological structure of Chi- to render the character. The output of the system nese characters). In "W. wY²úH> is a set of PostScript files. (Shanghai Education Press), 1993. This implementation provides a simple way to [3] John Hobby and Gu Guoan. A Chinese meta- render the HanGlyph expressions and obtain pre- font? TUGboat, 5(2):119–136, 1984. views of the characters. It facilitates the fine-tuning [4] Don Hosek. Design of Oriental characters with of the composition operations. Future implementa- METAFONT. TUGboat, 10(4):499–501, 1989. tions can streamline the process according to the re- quirements of the target application. For instance, [5] Fan Jiangping. Towards intelligent Chinese a back-end processor can be added to convert the character design. In Raster Imaging and Dig- PostScript output into a particular format, such as ital Typography II (RIDT91), pages 166–176. a PostScript Type 3 font. Cambridge University Press, 1992. [6] Donald E. Knuth. The METAFONTbook. 5.4 The syntax of HanGlyph Addison-Wesley, 1986. Figure 6 shows the concrete syntax of HanGlyph in [7] Soon-Bum Lim and Myung-Soo Kim. Oriental an augmented BNF notation. character font design by a structured composi- tion of stroke elements. Computer-aided design, 6 Conclusion 27(3):193–207, 1995. This paper describes an attempt to synthesize Chi- [8] …8Œ (FU Yong He). "WPËŒË  nese characters from an abstract description. A Chi- „úZv (Basic research on the structure nese character description language known as Han- of Chinese characters and their constituents). Glyph has been defined. A Chinese character syn- wY²úH> (Shanghai Education Press), thesis system is being developed. It is implemented 1993. METAPOST in and C, and the output is rendered in [9] …8Œ (FU Yong He). -‡áoU (Chi- PostScript. The preliminary results show that the nese information processing). ãqY²úH> approach is very promising. Some of the characters (Guangdong Education Press), 1999. generated by the CCSS are shown in Figure 7. [10] Sakai Toshiyuki, Nagao Makoto, and Terai Currently, we are in the process of fine-tuning Hidekazu. A description of Chinese characters the composition parameters. We hope the system is using subpatterns. Information Processing So- able to produce visually pleasing characters. There ciety of Japan Magazine, 10:10–14, 1970. are many factors that may affect the quality of the output, for example, the thickness of the strokes, [11] Dong YunMei and Li Kaide. A paramet- the allocation of the space occupied by each compo- ric graphics approach to Chinese font design. nent, and so on. Therefore, a considerable amount In Raster Imaging and Digital Typography II of experimentation is required to detemine a set of (RIDT91), pages 156–165. Cambridge Univer- parameters for composing characters. sity Press, 1992. There are many applications of such a system. The most important ones are in exchanging Chinese textual information in an open, heterogenous envi- ronment, and in Chinese font generation.

TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting 91 Candy L. K. Yiu, Wai Wong

hhanglyphi ::= hexpri + (1) hexpri ::= hglyph expri | hmacroi | hchari (2) hglyph expri ::= hglyphi; (3) hmacroi ::= let(hidi){hglyphi} (4) hchari ::= char(hcodei){hglyphi} (5) hglyphi ::= hglyphihglyphihopni (6) | hstrokei | hidi hopni ::= hparallel operatorihparallel relsi (7) | @hfull enc relsi | ^hdir allihhalf enc relsi | +hcross relsi hparallel operatori ::= = | | (8) hdiri ::= .(E | S | W | N | e | s | w | n) (9) hdir alli ::= hdiri | .(NE | SE | NW | SW | ne | se | nw | sw) (10) hparallel relsi ::= hdimensi?halignsi?htouchi?hscalei? (11) hfull enc relsi ::= hdimensi?htouchi?hscalei? (12) hhalf enc relsi ::= hdimensi?halignsi?htouchi?hscalei? (13) hcross relsi ::= hdimensi?haligni? (14) (haligni | hintercepti)?hscalei? hintercepti ::= *hdiri(h+inti(hreali?hinti?))? (15) hdimensi ::= hcompi(hcompi | hnumi)? (16) | hnumihcompi? | hnumi,hnumi hcompi ::= < | > | !< | !> | - (17) halignsi ::= halignihaligni? (18) haligni ::= ‘ | _ | [ | ] | # (19) htouchi ::= ~hdir speci ∗ (20) | !~(hdir specihnumi?) ∗ hdir speci ::= (.hdiri) + (21) hscalei ::= /hnumi? (22) hnumi ::= hinti | hreali (23)

Figure 6: The syntax of HanGlyph descriptions.

92 TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting Chinese character synthesis using METAPOST

Figure 7: Some characters generated by CCSS.

TUGboat, Volume 24 (2003), No. 1 — Proceedings of the 2003 Annual Meeting 93