Recognition of Printed Chinese Characters
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS VOL. EC.15, NO. I FEBRUARY, 1966 which shows the compiled program of Fig. 3 together REFERENCES with the new version which requires only four (instead [1] H. Hellerman, "System organization for detection and execu- tion of parallel operations in algebraic statements," IBM Re- of 14) words for intermediate results. search Note NC-164, October 31, 1962. The principles of the minimization algorithm are as [21 D. L. Slotnick, W. C. Borek, R. C. McReynolds, "The SOLO- MON computer," in 1962 Proc. Fall J.C.C., vol. 22. Baltimore, follows: Md.: Spartan Books, 1963, pp. 97-107. [31 H. Hellerman, "Experimental personalized array translator sys- 1) A correspondence table is set up between "old" tem," Comm. ACM, vol. 7, pp. 433-438, July 1964. [41 P. Dreyfus, "Programming design features of the GAMMA 60 starred integers as they appear in the instructions of the computer," in 1958 Proc. E.J.C.C. New York: AIEE, 1959, pp. first stage and the successive integers 1*, 2*, 3*, etc; 174-181. [5] C. L. Hamblin, "Translation to and from Polish notation," Computer Journal, vol. 5, pp. 210-213, October 1962. 2) Starting with the second stage instructions, each [6] K. E. Iverson, A Programming Language. New York: Wiley, 1962. stored operand symbol in the leftmost two address fields [71 M. E. Conway, "A multiprocessor system design," in 1963 Proc. is looked up in the correspondence table and replaced Fall J.C.C., vol. 24. Baltimore, Md.: Spartan Books, 1963, pp. 139-146. by the value found there. The "old" value label is then [81 R. N. Thompson and J. A. Wilkinson, "The D-825 automatic dropped from the table entry, thus placing the entry on operating and scheduling program," in 1963 Proc. Spring J.C.C., vol. 23. Baltimore, Md.: Spartan Books, 1963, pp. 41-49. the "available" list; [9] D. J. Howarth, P. D. Jones, and M. T. Wyld, "Experience with the ATLAS scheduling system," in 1963 Proc. Spring J.C.C., vol. 23. Baltimore, Md.: Spartan Books, 1963, pp. 59-67. 3) The result location is taken as the first entry on [10] J. S. Squire and S. M. Palais, "Programming and design consid- the "available" list; it is labeled with the starred integer erations of a highly parallel computer," in 1963 Proc. Spring J.C.C., vol. 23. Baltimore, Md.: Spartan Books, 1963, pp. from the instruction "result" field (rightmost address 359-400. field) of the instruction under consideration. [11] A. J. Critchlow, "Generalized multiprocessing and multipro- gramming systems," in 1963 Proc. Fall J.C.C., vol. 24. Balti- more, Md.: Spartan Books, 1963, pp. 107-126. [12] R. R. Seeber and A. B. Lindquist, "Associative logic for highly At the conclusion of compilation, all storage for inter- parallel systems," in 1963 Proc. Fall J.C.C., vol. 24, Baltimore, mediate results will appear as "available" with the ex- Md.: Spartan Books, 1963, pp. 489-493. [13] R. W. Allard, K. A. Wolf, and R. A. Zemlin, "Some effects of ception of one word, which is assigned to the final result the 6600 computer on language structures," Comm. ACM, vol. of the computation. 7, pp. 112-119, February 1964. Recognition of Printed Chinese Characters R. CASEY AND G. NAGY Abstract-The problem of recognizing a large alphabet (1000 I. INTRODUCTION different characters) is approached using a two stage process. In the first stage of design, the data is partitioned into groups of A. Need for Chinese Character Recognition similar characters by means of heuristic and iterative algorithms. rri HE RAPID EMERGENCE of China as one of In the second stage, peephole templates are generated for each j the of has character in such a way as to guarantee discrimination against other leading producers publications fairly characters in the same similarity class. swamped United States translators monitoring Recognition is preceded by establishing an order of search Chinese activity. In 1962 the level of Chinese to Eng- through the groups with a relatively small number of "group masks." lish translation was estimated at 3.5 million words per The character is then identified by means of the "individual masks." year, in contrast to the estimated need of 34.4 million through a threshold criterion. the alone The effects on the error and reject rates of varying the several words per year by intelligence comrnunity parameters in the design and test procedure are described on the [7]. This requirement is expected to grow at the rate of basis of computer simulation experiments on a 20 000 character about 25 percent per year. data set. An error rate of 1 percent with 7 percent rejects, is obtained Spurred by such statistics, numerous serious ma- on new data. chine translation groups have devoted considerable attention to Chinese syntax and lexicography. The re- Manuscript received August 13, 1965; revised October 24, 1965. strictions imposed by the peculiar structure of Chinese The authors are with the IBM Watson Research Center, York- town Heights, N. Y. do not appear to be critical; in fact, much of the tech- 91 92 IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS 'E:BRUARY nique and hardware developed for other languages, par- been greatly accelerated in the last ten years; in main- ticularly Russian, is applicable [6]. The translation land China several hundred simplified characters have effort has been severely handicapped, however, by the already replaced their complex counterparts [51. There difficulty of coding large amounts of Chinese material in is also a sustained drive towards latinization, but the in- a form suitable for computer processing. The best cod- troduction of a phonetic alphabet is hampered by the ing device to date, the manually operated sinotype fact that the same symbols are often pronounced in machine [6], processes only about 12 characters per completely different fashion in the various regions. minute, a rather low rate compared to keypunching The size of the "alphabet," and the difficulties encoun- operations in the Western languages. Thus an auto- tered in assigning identities to individual characters, mated system for encoding Chinese text may be even preclude the widespread use of typewriters and coding more appealing-and urgent-than its counterpart in devices (such as the flexowriter) in Chinese. Perhaps English or Russian. this is why character frequency counts, and especially The very real need for machine recognition of printed digram and trigram probabilities, are virtually non- Chinese has formed, however, only part of the motiva- existent. There is very little agreement as to how many tion for the study reported here; the experience derived characters would constitute a useful complement for a from working with a large pattern vocabulary was reading machine. Estimates range from 2000 to 5000 thought to be worthwhile in itself. In addition, it was characters [4], [6]. About all that can be said for cer- hoped that useful techniques would be accumulated for tain is that the more, the better. certain closely related applications, such as searching a Figure 2 shows a character frequency count prepared large file of Chinese documents for particular signifi- for a survey of teaching methods in elementary schools cant groups of ideographs (references to rockets, for [3]. The study from which this count was taken is example) and recognition of Japanese symbols. quite a thorough one, including even frequency counts for radicals; a similar count on scientific matter would be B. Chinese Ideographs most helpful. Chinese ideographs usually correspond to complete words in Western languages, although in many instances 100r a whole group of ideographs may be translated by a single word in English. An example of Chinese print is (n shown in Fig. 1. U z 60 fr4H H 40 Fig. 1. Chinese ideographs. This font style is commonly used in 0 Taiwan, Hong Kong, and the U.S.A. Simplified forms have been H gaining acceptance in mainland China. z 20 U , III ,,, Most common ideographs are derived from a pictorial 10 100 1000 10,000 representation of objects. Complex ideas are described NUMBER OF IDEOGRAPHS by groupings of patterns, usually within an imaginary Fig. 2. Cumulative frequency of Chinese characters. Curve based on rectangular frame. character count comprising 902 678 occurrences by H. C. Chen. Subpatterns common to many ideographs are called radicals. Radicals may convey either a meaning or a The one feature which raises hopes of constructing a sound. Scholars recognize only 214 distinct radicals, but workable Chinese reading machine is the uniformity of many of these have so-called "variant" forms which bear printed ideographs. Standardization of font styles has little resemblance to the "main" forms [9]. The size and progressed furthest in mainland China, but few printers position of a radical varies according to the complexity anywhere would willingly store many styles of a 7000 to and structure of the remainder of the ideograph. 8000 character type case. Characters and radicals may be further analyzed in Printed lines run horizontally in mainland China, and terms of "strokes." The term serves as a reminder that vertically almost everywhere else. The size commonly for millenniums all Chinese was brushed, but aside from used in scientific journals runs eight characters to the this it defies precise definition. inch. The type slugs are perfectly square; the spacing is Radicals contain from one to seventeen strokes, while thus fixed, while the intercharacter separation depends even common characters may run up to 27 strokes. This only on the complexity of the adjacent characters. Mini- renders learning and writing Chinese script a formidable mum separation with clean type is of the order of one task, and imposes severe problems even in printing. fifteenth of a character width.