The Marriage of TEX and Lojban
Total Page:16
File Type:pdf, Size:1020Kb
The Marriage of TEX and Lojban Hong Feng Suite 3-3, 200 WuZhong Str. Wuhan, Hubei Province 430040 China P.R. [email protected] Abstract Lojban is an old artificial language which is ambiguity free, it can also be used as a tool to express Chinese text encoding in readable ASCII characters, and thus be applied in TEX typesetting system. Keywords: TEX, Lojban, Chinese. When Prof. Donald Knuth invented the TEX system, founder Dr. James Cooke Brown. It is based on Chinese was not considered as the default language the “Sapir-Whorf” hypothesis, which states that the to support, as TEX only accepts the readable 7-bit structure of a language constrains thought in that ASCII characters. Chinese, either the simplified, language, and constrains and influences the culture or the traditional encoding set, has many thou- that uses it. Over more than the past four decades, sands of characters. For example, the GB2312-80 Lojban has become a mature artificial language. contains 6,763 simplified Chinese characters, which Here we highlight the main features of Lojban: requires at least double-byte (16-bit) to represent • Lojban is designed to be used by people in one Chinese character (the encodings in the double- communication with each other, and now also byte format are unreadable for people unless one possibly with computers. checks the encoding table one by one!), and makes it difficult for TEX to typeset Chinese documents. • Lojban is designed to be culturally neutral. It Various scenarios were presented to support is based on fully phonetic spelling, so people Chinese and many relevant macros were developed can learn to read Lojban on the fly. in the past, such as those in LATEX and ConTEXt • The regular grammar of Lojban is based on and the CJK package developed by Werner Lem- the principles of logic, which has an unambigu- berg, which is distributed with the TEX Live CD. ous grammar, and has successfully passed the The new Omega system tries to work directly with YACC testing. This removes restrictions on Unicode, which is a popular standard using 16-bit creative and clear thought and communication. encoding. Despite the differences in the technical • Lojban is designed as a simple language, with implementation details, all of them have assumed just 1,300 root words. Using these root words, that Chinese characters are implicitly expressed in it is possible to combine and form millions of the fixed-length double-byte encoding which are not words in a vocabulary with ease. readable by people. CTUG (Chinese TEX User Group) is trying In essence, Lojban is quite close to Chinese another completely different approach to work out grammar, thus a Chinese can quickly become a Lo- this problem. By discarding the man-made implicit jban user. In the training seminar given by CTUG, assumption of the fixed character length in double- practice has shown that some CTUG members could byte, CTUG imported Lojban to represent Chinese learn and grasp it within a week. encoding in variable length but still in the readable ASCII set. This paper documents the idea in some Chinese as Expressed in Lojban detail, and points out the future tasks to do under Now, let’s check how Chinese words are constructed. the scenario. Overall, most Chinese linguists have agreed that Chinese has only four methods to construct a char- Background Information about Lojban acter: XiangXing, ZhiShi, HuiYi and XingSheng. Lojban (pronounced as LOZH-bahn), which stands Although it is hard to describe them in formal for “Logic Language” in Lojban, is nothing new; language, any Chinese character is constructed by actually it was presented as a constructed language one of these four methods, and the first method in 1955 with the name “Loglan” by the project “XiangXing” is fundamental to the construction. 46 TUGboat, Volume 23 (2002), No. 1 — Proceedings of the 2002 Annual Meeting The Marriage of TEX and Lojban XiangXing means drawing a sign for a given mean- Such cases have happened many times in TEX ing; thus Chinese is classified as an ideograph system history. For example, the Euro currency was put in the language taxonomy. into use on January 01, 2002, but the currency According to researches into the signs recorded symbol was made available much earlier for TEX in ancient tortoise bones, the most original and by two NTG members, who designed the symbol in frequently used signs are fewer than 500. Tortoise METAFONT (see MAPS Number 27), so now you bones are the back shell of tortoises. Chinese people could express the Euro in a TEX document directly recorded the oracle on them. The signs of the by \symbol[euro]. oracle are the oldest Chinese characters we have Likewise, Chinese characters can be handled in discovered so far, and they are the origin of mod- the same way, and once we give each sign in a box a ern Chinese characters. And statistically, Chinese name in Lojban (which also means we discarded the characters constructed by XingSheng (mostly based fixed double-byte Chinese encoding, instead, we use on XiangXing ideographs) occupy more than 90% variable length of the readable ASCII characters to of the modern Chinese character repertoire. represent the Chinese, then TEX can be regarded as A character of XingSheng consists of two parts: a native formatter for Chinese immediately! one part indicates the pronouncation of the charac- If we design carefully, we can build a one-to- ter, and the other part indicates the meaning of the one mapping table between the existing Chinese character. encoding set (GB, Big5, Unicode or whatever) and For example, my name in Chinese, Hong Feng, Lojban the expression set, which makes the conver- is expressed in two characters; each of them is sion easy to handle by scripts in Perl or whatever. constructed as a XingSheng character. Hong has As Lojban expressions are in readable ASCII, they two parts: the three points at the left side is the can be edited using any editor (such as GNU Emacs) Xing part, and indicates the character is related to even on a simple text-only terminal. water; the right part is the Sheng part, pronounced as “gong”, meaning the character has “ong” in the Marriage of TEX and Lojban pronounciation. The character means large-scale, As we have seen above, it is possible to encode macro, magnificent, giant. Chinese by using Lojban as the meta-language. This Feng has two parts too. The left part is also the is the key step to get marriage of TEX and Lojban Xing part, a XiangXing ideograph for mountains, to happen. and the right part specifies the pronouncation to It is necessary to review how TEX defines a be “feng”. The character means the top of the control sequence. In TEX, π is defined as \pi, mountain. likewise, supposing we defined the glyphs of Chinese Thanks to the more than five thousand years figures (from zero up to nine) in control sequences of the simplification movement in the history, the in Lojban respectively like this: grammar rules of the language are truly simple to- day. Chinese texts are very similar to an assembly line which we have seen in an automobile production Chinese Lojban workshop — Chinese characters are placed one by ================= one without stop (i.e. without blank space left be- zero \no tween them), quite like the TEX places characters in one \pa a box one after another to form a word, and words two \re are placed one after another to form a sentence or a three \ci line, and lines are placed one after another to form a four \vo paragraph, and paragraphs are placed to be a page five \mu to the end. six \xa In the TEX system, if you have a new sign which seven \ze is not defined yet, then you could design the glyph of eight \bi the new sign in METAFONT (or in the PostScript nine \so language) in a box, and give the box a name (as a new control sequence) to the METAFONT (or to Now, we can typeset the Chinese number “two the PostScript) program; after doing that, you could zero zero two” this way in TEX: \re\no\no\re; the use the new sign with TEX, as if it were one of the backslash characters won’t add too much burden for built-in readable ASCII characters. TUGboat, Volume 23 (2002), No. 1 — Proceedings of the 2002 Annual Meeting 47 Hong Feng people to read, and by designing a new macro, it is Lojban is suitable to describe the logic relations possible to remove them like this: because it is designed as a logic language. \chinese{re no no re}. • Lojban specification comes with a dictionary TUG‘‘two zero zero two’’ (English and Chinese which contains ca. 1,300 root words, so it just combined together for TUG2002) can be represented requires some time and care to build the map- in TUG\chinese{re no no re}. ping relation to finish the task. TEX and Lojban have agreed to marry. • Define free, high quality fonts of Chinese. In Tradeoffs and Benefits practice, at least four fonts are required. Now this is a part of the MNM Project (MNM’s Not The tradeoff of the marriage is obviousl: it adds Millions). There are many non-free Chinese one step to convert the current Chinese double-byte fonts available, so commercial publishers can encodings into Lojban expressions, though we can purchase the non-free fonts, and we can add let a computer do the job automatically, which can the fonts by applying this scenario.