Pinyin to Hanzi Conversion with Hidden Markov Models
Total Page:16
File Type:pdf, Size:1020Kb
Pinyin to Hanzi Conversion with Hidden Markov Models Lucas Freitas and Cynthia Meng May 21, 2013 1. Introduction to Pinyin sion, especially given the breadth and scope of the Chinese language. The Chinese language is composed entirely of characters, with each character representing generally at least one word. When different characters are combined, different words might Tone Pinyin Meaning also be formed. Before modern technology, Flat or level (first) m¯a mother characters were written solely with brush Rising (second) m´a hemp strokes. During the 1950s, however, a system Falling-Rising (third) m`a horse called pinyin was developed to aid the tran- Falling (fourth) m˘a scold scription of Chinese characters into the Latin Neutral ma question alphabet. Table 1: The Mandarin Chinese tones Pinyin proved incredibly useful during the latter half of the century, as it was used to help teach the language to non-native Mandarin Both of us have a strong interest in the speakers, and eventually would be used as Chinese language and the relationship between the main input method for entering Chinese characters and Pinyin, and were interested in characters into computers. It is still used implementing a project that would somehow today in Chinese language programs around draw these two integral components of the the world, as it is one of the most efficient Chinese language together. methods of teaching Chinese to native speakers of Latin-alphabet-based languages, such as Inspired by the word segmentation problem set English, Spanish, and others. from earlier on in the course, we decided to at- tempt an implementation of Pinyin to Hanzi As a brief overview of Pinyin (of which a conversion involves xa statistical model. small amount of knowledge will help with the understanding of this project), there are five Pinyin (no tone) Character Meaning different tones in Mandarin Chinese, which can hen h to hate be represented by using the tone marks above hen 很 very vowels in the Pinyin of a character (Table 1). shi A ten shi / to be 2. Inspiration he 和 and he 喝 to drink Alhough Pinyin is certainly extremely helpful for visualizing the characters, some problems arise when attempting Pinyin to Hanzi conver- Table 2: Examples of heterophones in Chinese Lucas Freitas and Cynthia Meng 2 3. Heterographs our program to account for Traditional by using training data written in Traditional Chinese. In Chinese, a given syllable could have up to five different tones (including the neutral tone), each of which could stand for a different charac- 5. Possible Methods ter. When a piece of text is presented without Before beginning with our implementation, we tones marks, it becomes exceedingly difficult to brainstormed several different methods that discern which character is the correct one to could be used to implement this program. use. In this case, using context clues is particu- Among the factors we took into account were larly helpful, which our program would need to efficiency (both for the coders and the users), accommodate. Examples of this phenomenon speed of performance, and compatibility with are in Table 2. the problem we were trying to solve. 4. Homographs (多多多音音音WWW) 5.1 Noisy Channel Model In addition to the problem of heterophones, there exists also the problem that certain char- Our first idea was to use a noisy channel model acters can have different pronunciations and for our implementation, which would mainly meanings depending on their contexts (homo- involve adapting the transducer system im- graphs). This phenomenon is called 多音W plemented for text disabbreviation. For such (du¯oy¯ınz`ı)and can be seen in Table 3 below. method, we could treat Pinyin as the abbrevi- ated text and The Pinyin and characters com- Pinyin (no tone) Character Meaning bined as the disabbreviated corpus. Although wo jian ta le 我Á他了 I saw him that sounded like a good strategy, we then read wo shou bu liao 我受不了 I can't stand it about Hidden Markov Models, which literature yin hang 锒L bank mentioned to be more precise for the problem. zi xing che êLf bicycle 5.2 Hidden Markov Models Table 3: Examples of homographs in Chinese After reading the article \A Segment-based Hidden Markov Model for Real-Setting Pinyin- 5. Goals to-Chinese Conversion", we got convinced that a Hidden Markov Model (HMM) would fit the Given these difficulties with Pinyin, we decided goal of our project perfectly, and could achieve that our goal for this project would be to imple- considerably high character accuracy (82.9% ment a system that, given a text composed of according to the paper). unsegmented pinyin, would return the proper Chinese character conversion of that Pinyin. An HMM is composed of hidden states which Examples of input and output are shown in Ta- emit a sequence of observations over time ble 4. according to a transition model, as can be seen in Figure 1, which represents a very simplistic Pinyin (no tone) Characters representation of a Hidden Markov Model with wo hen ai ta de shu 我很1y的f hidden states (in red), observations (in green), ta zai kan yi ben shu y(看一,f emission probabilities, and a transition model. Table 4: Desired outputs for given inputs That is, in an HMM for the problem we would be able to see a sequence of observa- As a design decision, we chose to work with Sim- tions (Pinyin), but we would not know the hid- plified Chinese, although we could easily adapt den states (characters) from which the obser- Computer Science 187 Final Project Lucas Freitas and Cynthia Meng 3 vations were generated. We would know, how- 5.3. Segmental HMM ever, the probabilities with which the possible hidden state generates an observation (emission Another option we considered, but ultimately model) and the probabilities that a given char- rejected, was the Segmented Hidden Markov acter comes after another (transition model). Model (SHMM). The SHMM is very similar to the HMM, but does make another distinction that the HMM does not: it differentiates 0:8 between bigrams formed by two words, and 0:8 0:1 Chinese words themselves. Thus, when we search the lexicon for a given segment, if the ` 泥 } 号 0:1 segment is in the lexicon, then we would treat it as the product of bigrams (two-character 0:8 0:2 0:7 0:3 segments); otherwise, we would give it as a whole some probability according to a different method. ni hao This type of method would be extremely help- Figure 1: Example of an HMM for the problem ful in the case of idioms, in which we would not have the problem of overestimating bigrams { that is, if a bigram is part of a longer phrase This aligns very well with our goals regarding (i.e., an idiom) the count for that bigram as the project. Given that a certain Pinyin a part of the word is removed, and it is easier could come from multiple different characters to discern the larger word phrase composed of (generally from four to hundreds of different smaller word phrases rather than the two sepa- possibilities), we just need to decipher which rate word phrases. This means that in general character is most likely to be the best fit based SHMM is better at handling word segments on the context of the sentence. that are composed of more than two characters. In Figure 1, for instance, given the Pinyin Ultimately, however, we decided that the com- \ni", we would first try and figure out which plexity of this type of model, as well as our character is most likely to be the correct scope for the project did not fit the SHMM, one (there are actually far more than two and we decided to use the HMM instead, at possibilities, but we have only shown two here the expense of some accuracy percentages, al- for simplicity's sake: `, which means \you", though literature reports very similar perfor- and 泥, which means \dirt" ). We would do the mance rates for both methods. Literature also same for \hao", with the two characters here reports that SHMMs have much slower execu- being }, which means \good", and 号, which tion time, so for this project's sake, we decided means \number" . That method would take in to eliminate such possibility. account both the probability of the character having that Pinyin, and the probability of 6. Implementation having a character (hidden state) given the last guessed states. Our implementation is a Hidden Markov Model that uses Viterbi's Algorithm to return the In the end, we would of course want the pairing likely sequence of characters given a sequence `}, which is the common way to say \hello" of Pinyin. Our project was coded entirely in in Chinese. We actually ended up choosing this Python. Viterbi's algorith is used to find the method to implement, and will explain the de- most likely sequence of hidden states given a tails of it in section 6. sequence of observations and emission and tran- Computer Science 187 Final Project Lucas Freitas and Cynthia Meng 4 sition models. That is, we want to find then calculated the rest of the variables. After ∗ this, we implemented Viterbi's algorithm and x1; :::; xT = arg max P x1;:::;xT successfully got Chinese characters from Pinyin input. As a note, we used 1-Laplace smoothing Such that P ∗ is: for the case of unseen bigrams. p(X1 = x1; :::; XT = xT jO1 = o1; :::; OT = oT ) The algorithm requires several components in 7. Problems Encountered order to return the sequence of characters: Needless to say, we encountered many hard- ships along the way when implementing the i) O: the list of all possible Pinyin Hidden Markov Model, most of them with ii) S: the state space, or list of all possible regards to the fact that Chinese characters characters (Hanzi) are not ASCII characters and are thus more iii) pi: the list of initial probabilities, i.e.