Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words
Total Page:16
File Type:pdf, Size:1020Kb
Visually and Phonologically Similar Characters in Incorrect Simplified Chinese Words Chao-Lin Liu† Min-Hua Lai‡ Yi-Hsuan ChuangĹ Chia-Ying Leeľ †‡ĹDepartment of Computer Science; †ľCenter for Mind, Brain, and Learning National Chengchi University ľInstitute of Linguistics, Academia Sinica {†chaolin, ‡g9523, Ĺg9804}@cs.nccu.edu.tw, ľ[email protected] the incorrect characters. For instance, one might Abstract use “ӊ” (hwa2) in the place of “㣣”(hwa4) in 㣣ຝ” (ke1 hwa4 xing2 xiang4) partiallyڅ“ -Visually and phonologically similar cha racters are major contributing factors for because of phonological similarity; one might errors in Chinese text. By defining ap- replace “ܟ” (zhuo2) in “Ј㞇Κܟ” (xin1 lao2 propriate similarity measures that consid- li4 zhuo2) with “䶅” (chu4) partially due to visu- er extended Cangjie codes, we can identi- al similarity. (We do not claim that the visual or fy visually similar characters within a phonological similarity alone can explain the fraction of a second. Relying on the pro- observed errors.) nunciation information noted for individ- Similar characters are important for under- ual characters in Chinese lexicons, we standing the errors in both traditional and simpli- can compute a list of characters that are fied Chinese. Liu et al. (2009a-c) applied tech- phonologically similar to a given charac- niques for manipulating correctness of Chinese ter. We collected 621 incorrect Chinese words to computer assisted test-item generation. words reported on the Internet, and ana- Research in psycholinguistics has shown that the lyzed the causes of these errors. 83% of number of neighbor characters influences the these errors were related to phonological timing of activating the mental lexicon during the similarity, and 48% of them were related process of understanding Chinese text (Kuo et al. to visual similarity between the involved 2004; Lee et al. 2006). Having a way to compute characters. Generating the lists of phono- and find similar characters will facilitate the logically and visually similar characters, process of finding neighbor words, so can be in- our programs were able to contain more strumental for related studies in psycholinguistics. than 90% of the incorrect characters in Algorithms for optical character recognition for the reported errors. Chinese and for recognizing written Chinese try to guess the input characters based on sets of 1 Introduction confusing sets (Fan et al. 1995; Liu et al., 2004). The confusing sets happen to be hand-crafted In this paper, we report the experience of our clusters of visually similar characters. studying the errors in simplified Chinese words. It is relatively easy to judge whether two cha- Chinese words consist of individual characters. racters have similar pronunciations based on Some words contain just one character, but most their records in a given Chinese lexicon. We will words comprise two or more characters. For in- 1 discuss more related issues shortly. stance, “䟡” (mai4) has just one character, and To determine whether two characters are vi- yu3 yan2) is formed by two characters. sually similar is not as easy. Image processing) ”ق俟“ Two most common causes for writing or typing techniques may be useful but is not perfectly incorrect Chinese words are due to visual and feasible, given that there are more than fifty phonological similarity between the correct and thousand Chinese characters (HanDict, 2010) and that many of them are similar to each other in special ways. Liu et al. (2008) extend the 1 We show simplified Chinese characters followed by Cangjie codes (Cangjie, 2010; Chu, 2010) to en- their Hanyu pinyin. The digit that follows the symbols for the sound is the tone for the character. code the layouts and details about traditional 739 Coling 2010: Poster Volume, pages 739–747, Beijing, August 2010 Chinese characters for computing visually simi- pronunciations, we consider that C1 and C2 be- lar characters. Evidence observed in psycholin- long to the SS category. This principle applies guistic studies offers a cognition-based support when we consider phonological similarity in oth- for the design of Liu et al.’s approach (Yeh and er categories. Li, 2002). In addition, the proposed method One challenge in defining similarity between proves to be effective in capturing incorrect tra- characters is that the pronunciations of a charac- ditional Chinese words (Liu et al., 2009a-c). ter can depend on its context. The most common In this paper, we work on the errors in simpli- example of tone sandhi in Chinese (Chen, 2000) fied Chinese words by extending the Cangjie is that the first third-tone character in words codes for simplified Chinese. We obtain two lists formed by two adjacent third-tone characters will of incorrect words that were reported on the In- be pronounced in the second tone. At present, we ternet, analyze the major reasons that contribute ignore the influences of context when determin- to the observed errors, and evaluate how the new ing whether two characters are phonologically Cangjie codes help us spot the incorrect charac- similar. ters. Results of our analysis show that phonolog- Although we have confined our definition of ical and visual similarities contribute similar por- phonological similarity to the context of the tions of errors in simplified and traditional Chi- Mandarin Chinese, it is important to note the in- nese. Experimental results also show that, we can fluence of sublanguages within the Chinese lan- catch more than 90% of the reported errors. guage family will affect the perception of phono- We go over some issues about phonological logical similarity. Sublanguages used in different similarity in Section 2, elaborate how we extend areas in China, e.g., Shanghai, Min, and Canton and apply Cangjie codes for simplified Chinese share the same written forms with the Mandarin in Section 3, present details about our experi- Chinese, but have quite different though related ments and observations in Section 4, and discuss pronunciation systems. Hence, people living in some technical issues in Section 5. different areas in China may perceive phonologi- cal similarity in very different ways. The study in 2 Phonologically Similar Characters this direction is beyond the scope of the current study. The pronunciation of a Chinese character in- volves a sound, which consists of the nucleus and 3 Visually Similar Characters an optional onset, and a tone. In Mandarin Chi- nese, there are four tones. (Some researchers in- Figure 1 shows four groups of visually similar clude the fifth tone.) characters. Characters in group 1 and group 2 In our work, we consider four categories of differ subtly at the stroke level. Characters in phonological similarity between two characters: group 3 share the components on their right sides. same sound and same tone (SS), same sound and The shared component of the characters in group different tone (SD), similar sound and same tone 4 appears at different places within the characters. (MS), and similar sound and different tone (MD). Radicals are used in Chinese dictionaries to We rely on the information provided in a lex- organize characters, so are useful for finding vi- icon (Dict, 2010) to determine whether two cha- sually similar characters. The characters in group racters have the same sound or the same tone. 1 and group 2 belong to the radicals “Җ” and “ ”, The judgment of whether two characters have respectively. Notice that, although the radical for similar sound should consider the language expe- group 2 is clear, the radical for group 1 is not rience of an individual. One who live in the obvious because “Җ” is not a standalone compo- southern and one who live in the northern China nent. may have quite different perceptions of “similar” However, the shared components might not be sound. In this work, we resort to the confusion the radicals of characters. The shared compo- sets observed in a psycholinguistic study con- nents in groups 3 and 4 are not the radicals. In ducted at the Academic Sinica. Some Chinese characters are heteronyms. Let C1 and C2 be two characters that have multiple pronunciations. If C1 and C2 share one of their Figure 1. Examples of visually similar characters 740 many cases, radicals are semantic components of ID CC Cangjie ID CC Cangjie Chinese characters. In groups 3 and 4, the shared 1 ҖҖ! 10 偂! ДΓЈЈЉ! components carry information about the pronun- 2 җ ύҖ! 11 㟻! НЈЈЉ! ciations of the characters. Hence, those charac- 3 Ҙ Җύ! 12 ! ЕЈЈЉ! ters are listed under different radicals, though 4 ҙ ύҖύ! 13 䠑! αДΓ! they do look similar in some ways. 5 侪ЉЉζΓΜ!! 14 䡿! ҖααДΓ! Hence, a mechanism other than just relying on 6 侘ЉЉζΜ! 15 䟋! αΓεν! information about characters in typical lexicons 7 侓ЉЉζΜ! 16 䟄! ψεν! is necessary, and we will use the extended Cang- 8 呆ψψВВ 17 勹! ψДΓ! jie codes for finding visually similar characters. 9 厇ψψЈα 18 䶈! ζψΓ! Table 1. Examples of Cangjie codes 3.1 Cangjie Codes for Simplified Chinese jie codes for computing similar characters. The Table 1 shows the Cangjie codes for the 13 prefix “ψ” in c16 and c17 can represent “ ”, characters listed in Figure 1 and five other “吏” (e.g., c ), and “卺” (e.g., c ). Characters characters. The “ID” column shows the 8 9 identification number for the characters, and we whose Cangjie codes include “ψ” may con- th tain any of these three components, but they do will refer to the i character by ci, where i is the ID. The “CC” column shows the Chinese not really look alike. characters, and the “Cangjie” column shows the Therefore, we augment the original Cangjie Cangjie codes. Each symbol in the Cangjie codes codes by using the complete Cangjie codes and corresponds to a key on the keyboard, e.g.