Some Experience in Text Processing in the Chinese Language

Some Experience in Text Processing in the Chinese Language

SOME EXPERIENCE IN TEXT PROCESSING IN THE CHINESE LANGUAGE Brian R Gaines Monoiype (China) Ltd., Hong Kong and G W Information Transfer Systems Ltd., Cirencester, UK. The Chinese language present many difficulties in text processing. There are some 7,000 characters in routine use and conventional approaches to keyboards, displays and printers are unable to cope with the set required. Yet the language is a very important one since it is in daily use by one quarter of the population of the world. This paper describes a complete phototypesetting system recently developed for use with text in the Chinese and English languages and now in use for book printing in Beijing and Shanghai. Recent work on the application of a similar approach to data processing in Chinese is also outlined. INTRODUCTION One by one the languages of the world have been conquered by the modern technology of electronic keyboards, text editors and phototypesetters~ But there are a few challenges left, one bei Chinese the oldest recorded language in use today. Its commerci importance lies in the quarter of the world's population for whom Chinese is the main language. Its technical difficulty lies in the many thousands of different Chinese characters required and to some extent in th? complex~ty of the characters themselves. In the People's Reublic of China itself the drive for Mao Zedong's "Four Modernizations" (of Agriculture. Industry, National Defence and Modern Sciences) has created a demand for modern "information technology". However, so much of this technology has originated from countries using the English language, oi at least the Rom~n alGhab~t) that it not only requires a knowledge of English for its use but al~o requires that all information be expressed in Rcman characters •. For numerical applications of computer systems this ~oes not pc~e tee m~cb of a problem - the specialists, programmers and operators. have to have a wor~ing knowledge of English. However, for database and text-processing applications, where t~e information itself is intrinSically in a non-Roman language, it is extrem~ly problematic to develop any effective systems. This problem is not peculiar to computers - printing techn has never been well-suited to the Chinese language and ther"e have been a wide variety of attempts to Romanize the script (Sey~olt end Chiang 1979). Mao himself is widely quoted for his speeoh 10 1951 when nc said t "The written language must be reformed; we mu~t proceed in the direction of phoneticization being taken by aI! languages of th& world", and Zhou Enlai echoed this in 1958, UThe ilomedlste tasks in writing reform are simplifying the Chinese char&c~erz. spre~diDg ~he use of the standard vernacular, and determinin &~d s~re6di~G ~hc use of phonetic spelling in Chinese". Neither is i~ a ec iar .. ". \ ,', .' to Chinese: in Pakistan, for example, some newspapers in Urdu are still produced by a staff of some 60 calligraphers trained to a common handwritten style since there has been until recently no printing technology that-can cope with the complexity of the language. However~ in recent years developments in low-cost semiconductor systems have given us new technologies for text and image processing that now make it technically and commercially feasible to develop computer-based systems thai operate in any of the languages of the world. Certainly the technical reasons for Romanizing languages such as Chinese have now become far less preSSing. Computer technology is a means of making complex tasks simple (although the opposite often seems so!) and offers the possibility of information systems that operate fully in any. script. Rather than bend the language to the technology it is now I feasible to use the technology to support operations in the language it does seem reasonable to suppose that a "user friendly" system for use in China should operate in Chinese rather than English. , , . 1 In September 1978 Monotype decided that the time was ripe to tackle the I problems of text capture, editing and phototypesetting in the Chinese I language. By June 1979 complete systemsd had been developed and 1 installed in Beijing and Shanghai. This paper gives the background to i this development and the technologies used. -1 I TYPESETTING THE CHINESE LANGUAGE It is i~possible to define precisely the number of characters, or i ideograms, in the Chinese language. There are over 60,000 ideograms recorded in use during different periods of Chinese history while the I modern standard dictionaries used in China list some 13,000 characters I currently in use. For the printing of books a face of some 7,000 characters is adequate and for newspapers some 4,500 characters. Chinese typewriters provide about 2,000 characters available in the I type case under the print head and about another 2,000'available for . .1 -, insertion as required. I Chairman Mao Zedong in all his writing used a vocabulary of only 3,006 characters. There is a major movement in China to simplify the Chinese language by restricting it to only 3,260 characters but this is a contentious issue. For printing purposes, no matter how many characters are provided there will always be the need for more through a good 'sorts' facility since specific jobs require access to nonstandard characters, for example in quoting from an ancient Chinese work. Similarly in database systems it should be noted that the ~ost . nonstan1ard characters are those for personal and place names! :1 The ~alligraphy of Chinese characters was greatly simplified in China j after the liberation in a move to aid literacy. A standard form of phonetiC romanization of the the Chinese language, Pin Yin, was also introdJced and is widely used in China for' shop and street names; I however, it has yet to have any major impact on the printing industry. The direction of setting of Chinese text was also changed to correspond to the Western format of horizontal, left-to-rigbt reading, rather than the original Chinese vertical setting from right to left. The simplified characters and horizontal setting are primarily used in I China, Singapore and Mal~ysia, but the origin?l characters and vertical I setting are still used in Hong Kong, Taiwan, and, with extenSions, in -, Japan. Because of the large number of characters required, Chinese text is I still primarily hand-set using hot metal techniques in the printing. __ .-'/ I; i _. !I industry. The o~erators work within ~n alcove of type cases containing i the characters needed for tne text they are preparing. The arrangement of the characters and the number available is a feat of organization that minimizes the effort in hand picking for a particular text. The ! I configuration required for rapid setting of newspapers can be quite monumental with operators literally skating from case to case to achieve high speeds. On routine textual material skilled operators cari achieve continuous throughput of up to 1,000 characters an hour. As a ! rough guide in translating these figures for comparison with Latin I . languages a three to one ratio has been found appropriate in translated material, i.e. one Chinese character requires about three English ones .~ on average. Thus a comparable setting rate for English would be about 3,000 characters an hour, or just under one a second. This compares i unfavourably with the setting rates for skilled keyboard operators wit~ I Roman texts. i . I 1 Flat-bed, hand-operated filmsetters made in Shanghai are also in use in! China for technical book production. The machine provides 9,555 . - : characters on a five by seven matrix of glass plates each of which has i 1 a 13 by 21 matrix of characters. A turret lens system- allows the size: I of the characters to be varied optically in the range from 4 paints to i i 60 pOints (there are 72 points to an inch). Some of the plates contain· I mathematical and chemical symbols, Latin and Cyrillic alphabets, and ~o' j on. They are readily interchanged to provide the particular faces: required for specific texts. The operator moves the main bed around 1 whilst viewing the characters through a magnifier. When the one required is found a lever is moved which engages a ratchet to fi.x the : I precise location of the character to be exposed. Skilled operators can, achieve throughputs of the order of 1,000 characters an hour which is ! comparable with hand setting. II J These then are the typesetting technologies with which any new approach: I has to compete. Both the hot metal and the film sysiems give access to: the very large number of characters required to set Chinese~ both give I the capability of setting Chinese mixed with other langeages and -I technical material; and both give a very high quality of output. The main disadvantages of the two systems are that they are manually operated at a slow rate, require skilled operators with same three years training to reach full throughput, and give no facilities for text storage, editing and aids to page layout. The electronic phototypesetting techniques that have been developed so rapidly inthe West and used extensively for book, magazine and newspaper production have so far proved unsuitable for Chinese primarily because of the number of characters required, but also because of the high status of the calligraphic arts in China which demand the highest quality in any printed text. It is salutary also when examining the speed and effectiveness of hand setting in China to note how competetive it is with ~odern technology.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    10 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us