<<

Discriminative Method for Japanese - Input Method

Hiroyuki Tokunaga Daisuke Okanohara Shinsuke Mori Preferred Infrastructure Inc. Kyoto University 2-40-1 Hongo, Bunkyo-, Tokyo Yoshidahonmachi, Sakyo-ku, Kyoto 113-0033, 606-8501, Japan tkng,hillbig @preferred.jp forest@.kyoto-.ac.jp { }

Abstract Early kana-kanji conversion systems employed heuristic rules, such as maximum-length- The most popular type of input method matching. in Japan is kana-kanji conversion, conver- With the growth of statistical natural sion from a string of kana a mixed kanji- processing, data-driven methods were applied for kana string. kana-kanji conversion. Most existing studies on However there is study using discrim- kana-kanji conversion have used probability mod- inative methods like structured SVMs for els, especially generative models. Although gen- kana-kanji conversion. One of the reasons erative models have some advantages, a number is that learning a discriminative model of studies on natural language tasks have shown from a large data set is often intractable. that discriminative models tend to overcome gen- However, due to progress of recent - erative models with respect to accuracy. searches, large scale learning of discrim- However, there have been no studies using only inative models become feasible in these a discriminative model for kana-kanji conversion. days. One reason for this is that learning a discriminative model from a large data set is often intractable. In the present paper, investigate However, due to progress in recent research on whether discriminative methods such as machine learning, large-scale learning of discrim- structured SVMs can improve the accu- inative models has become feasible. racy of kana-kanji conversion. To the best The present paper describes how to apply a dis- of our knowledge, this is the first study criminative model for kana-kanji conversion. The comparing a generative model and a dis- remainder of the present paper is organized as criminative model for kana-kanji conver- follows. In the next section, we present a brief sion. An experiment revealed that a dis- description of Japanese text input and define the criminative method can improve the per- kana-kanji conversion problem. In Section 3, we formance by approximately 3%. describe related researches. Section 4 provides the baseline method of the present study, i.., kana- 1 Introduction kanji conversion based on a probabilistic language Kana-kanji conversion is conversion from kana model. In Section 5, we describe how to apply characters to kanji characters, the most popular the discriminative method for kana-kanji conver- way of inputting Japanese text from keyboards. sion. Section 6 presents the experimental results, Generally one kana string corresponds to many and conclusions are presented in Section 7. kanji strings, proposing what the user wants to in- 2 Japanese Text Input and Kana-Kanji put is not trivial. We showed how input keys are processed in Figure 1. Conversion Two types of problems are encountered in kana- Japanese text is composed of several scripts. The kanji conversion. One is how to reduce conversion primary components are , , (we errors, and the other is how to correct smoothly refer them as kana) and kanji. when a conversion failure has occurred. In the The number of kana is less than hundred. In present study, we focus on the problem of reduc- many case, we input romanized kana characters ing conversion errors. from the keyboard. One of the task of a Japanese

10 Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011), pages 10–18, Chiang Mai, Thailand, November 13, 2011. input (romanized): ku to ga ku shu u

input (kana): か が く と が く し ゅ う

火 蛾

蚊 が 苦 と 学習 constructed graph: BOS EOS 科学 咎 苦 終

Figure 1: A graph constructed from an input string. Some nodes are omitted for simplicity. input method is transliterating them to kana. We must first construct a graph that repre- The problem is how to input kanji. The num- sents all possible conversions. An example ber of kanji is more than ten thousand, the ordinal of such a graph is given in Figure 1. keyboards in Japan do not have a sufficient number We must use a dictionary in order to construct of keys to enable the user to input such characters this graph. directly. In order to address this problem, a number Since all edges are directed from the start of methods have been proposed and investigated. node to the goad node, the graph is a directed Handwriting recognition systems, for example, acyclic graph (DAG). have been successfully implemented. Another successful method is kana-kanji conversion, which 2. Select the most likely path in the graph. is currently the most commonly used method for This task is broken into two parts. The first is inputting Japanese text due to its fast input speed setting the cost of each vertex and edge. The and low initial cost to learn. cost is often determined by supervised learn- ing methods. The second is finding the most 2.1 Kana-kanji conversion likely path from the graph. Viterbi algorithm Kana-kanji conversion is the problem of convert- is used for this task. ing a given kana string into a mixed kanji-kana string. For simplicity, in the following we describe Formally, the task of kana-kanji conversion can a mixed kanji-kana string as a kanji string. be defined as follows. Let x be an input, an un- What should be noted here is that kana-kanji segmented sentence written in kana. Let y be a conversion is the conversion from a kana sentence path, i.e., a sequence of . When we write to a kanji sentence. One of the key points of kana- the jth word in y as yj, y can be denoted as kanji conversion is that an entire sentence can be y = y1y2 . . . yn, where is the number of words converted at once. This is why kana-kanji conver- in path y. sion is great at input speed. Let be a set of candidate paths in the DAG Since an input string is non-segmented sen- Y built from the input sentence x. The goal is to tence, every kana-kanji conversion system must be select a correct path y∗ from all candidate paths in able to segment a kana sentence into words. This . is not a trivial problem, recent kana-kanji conver- Y sion softwares often employ the statistical meth- 2.2 Parameter estimation with a probabilistic ods. language model Although there is a measure of freedom with re- spect to the design of a kana-kanji conversion sys- If we defined the kana-kanji conversion as a graph tem, in the present paper, we discuss kana-kanji search problem, all edges and vertices must have conversion systems they comply with the follow- costs. There are two ways to estimate these pa- ing procedures: rameters. One is to use a language model, which was proposed in the 1990s, and the other is to use 1. Construct a graph from the input string. a discriminative model, as in the present study.

11 This subsection describes kana-kanji conver- Mori et al. (1999) proposed the following de- sion based on probabilistic language models, composition model for kana-kanji model: treated as a baseline in the present paper. In this method, given a kana string, we calcu- P (x y) = P (x y ) | j| j late the probabilities for each conversion candidate j Y and then present the candidate that has the highest where each P (x y ) is a fraction of how many probability. j| j times the word y is read as x . Given that We denote the probability P of a conversion j j d(x , y ) is the number of times word y is read candidate y given x by P (y x). j j j | as xj, P (xj yj) can be written as follows: Using Bayes’ theorem, we can transform this | expression as follows: d(xj, yj) P (xj yj) = . | i d(xi, yj) P (x y)P (y) P (y x) = | P (y)P (x y), 2.3 Smoothing P | P (x) ∝ | In Eq. (1), if any P (yj yj 1) is zero, P (y) is also where P (y) is the language model, and P (x y) is | − | estimated to zero. This means that P (y) is always the kana-kanji model. zero if a word that does not appear in the training The language model P (y) assigns a high prob- corpus is appeared in y. Since we use a language ability to word sequences that are likely to occur. model to determine which sentence is more natu- Word n-gram models or class n-gram models are ral as a Japanese sentence, both probabilities be- often used in the natural language processing. ing zero does not help us. This is referred to as the The definition of n-gram model is denoted as zero frequency problem. follows: We apply smoothing to prevent this prob- lem. In the present paper, we implemented both of additive smoothing and interpolated Kneser- P (y) = P (yj yj 1, yj 2, . . . , yj n+1), | − − − Ney smoothing (Chen and Goodman, 1998; Teh, j Y 2006). Surprisingly, the additive smoothing over- where n 1 indicates the length of the history used came the interpolated Kneser-Ney smoothing. − to estimate the probability. Therefore we adopt the additive smoothing in the If we adopt a word bigram (n = 2) model, then experiments. P (y) is decomposed into the following form: 3 Related Research

P (y) = P (yj yj 1), (1) | − Mori et al. (1999) proposed a kana-kanji con- Yj version method based on a probabilistic language where P (yj yj 1) is the probability of the appear- model. They used the class bigram as their lan- | − ance of yj after yj 1. Let c(y) be the number of guage model. − occurrences of y in the corpus, and let c(y1, y2) Gao et al. (2006) reported the results of ap- be the number of occurrences of y1 after y2 in the plying a discriminative language model for kana- corpus. The probability P (yj yj 1) is calculated kanji conversion. Their objective was domain | − as follows: adaptation using a discriminative language model for reranking of top-k conversion candidates enu- c(yj, yj 1) P (yj yj 1) = − . merated by a generative model. | − c(yj 1) − Kana-kanji conversion and morphological anal- The kana-kanji model P (x y) is assigned a ysis are similar in some respects. Most notably, | high probability if x corresponds to y several both are the same type of extension of the - times. In Japanese, most characters have multiple quential labeling problem. Therefore, it would be pronunciations. For example, 雨 (rain) could be worthwhile to consider studies on morphological read as “ame”, “same” or “u”. In the case of 雨, analysis. most of Japanese expect the to be “ame”. Nagata (1994) showed that a pos-trigram Therefore P (ame 雨) should be assigned a higher language model-based morphological analyzer | probability than for “same” or “u”. achieved approximately 95% precision and recall,

12 which was a state-of-the-art result at that time. 4.1 Structured SVM Ten years later, Kudo el al. (2004) applied the The SSVM is a natural extension of the SVM for conditional random field (CRF) to Japanese mor- structured output problems. phological analysis. They reported that this major We denote the ith datum as (x(i), y(i)). Here, discriminative probabilistic model does not suffer (y) is the loss for the ith datum, and we as- Li from label bias and length bias and is superior to sume that (y) 0 for all y = y(i), and that Li ≥ 6 the hidden Markov model (HMM) and maximum (y(i)) = 0. Note that the value of is zero if Li Li entropy Markov model (MEMM) with respect to and only if y = y(i). This means that all of other accuracy. y are treated as negative examples. The purpose of the present study is to investi- We adopt the following loss function, which is gate whether this increase in performance for mor- similar to Hamming loss: phological analysis also applies to kana-kanji con- version. (y) = l(y ), Li j j 4 Kana-Kanji Conversion Based on X where l(y ) is 1 if the path is incorrect, otherwise Discriminative Methods j l(yj) is 0. In this section, we present a description of kana- The objective function of SSVM is expressed as kanji conversion based on discriminative methods. follows: In discriminative methods, we calculate a score n 1 for each conversion candidate y1, y2, y3 ... for in- (w) + λ w , (2) put x. The candidate that has the highest score is n k k Xi=1 presented to the user. where ri(w) is the risk function, which is defined We herein restrict the score function such as follows: that the score function can be decomposed into weighted sum of K feature functions Ψ, where Ψ max(wf(x(i), y) + (y)) wf(x(i), y(i)). (3) y L − is a vector of each feature function Ψk. We also re- ∈Y strict arguments of feature functions to x, yj 1, yj Note that the loss and the risk are differentiated − in order to use the Viterbi algorithm for fast in the structured output problems, while they are conversion. A feature function Ψk(x, yj 1, yj) often not in binary classification problems. − returns 1 if the feature is enabled, otherwise Here, w , which is the norm of w, is referred k k Ψk(x, yj 1, yj) returns 0. to as a regularization term. The most commonly − The score of a conversion candidate y over x is used norms are L1-norm and L2-norm. We used calculated as follows: L1-norm in the experiment because it tends to find sparse solutions. Since input methods are ex- pected to work with modest amounts of RAM, this

f(x, y) = wkΨk(x, yj 1, yj), property is important. L1-norm is calculated as − follows: Xj Xk w = w . (4) k k1 | k| where w is the weight of feature function Ψ , and k k k X w is a vector of each feature weight wk. The positive real number λ is a parameter that Since the output of kana-kanji conversion is a trades off between the loss term and the regular- vector, the problem is a structured output problem, ization term. which can be addressed in a number of ways, in- In order to minimize this objective function, cluding the use of CRF (Lafferty et al., 2001) or we used FOBOS as the parameter optimization SSVM (Tsochantaridis et al., 2005; Ratliff et al., method (Duchi and Singer, 2009). 2007). The performances of CRF, SSVM and other 4.2 Learning of the SSVM using FOBOS learner models are similar if all of the models use FOBOS is a versatile optimization method for the same feature set (Keerthi and Sundararajan, non-smooth convex problems, which can be used 2007). We use the SSVM as the learner because both online and batch-wise. We used online FO- it is somewhat easier to implement. BOS for parameter optimization.

13 Algorithm 1 Learning of SSVM Name # sentences Data Source for (x(i), y(i)) do OC 6,476 Yahoo! Chiebukuro (i) (i) y∗ = argmax f(x , y) + (x , y) (Q&A site in Yahoo! Japan) y L (i) OW 5,934 White Book if y∗ = y then 6 i+ 1 (i) (i) (i) (i) PN 17,007 News Paper w 2 = w ηt (f(x , y∗) f(x , y )) − ∇ − PB 10,347 Book for k K do ∈ 1 1 (i+1) i+ 2 (i+ 2 ) w = sign(w ) max w ληt, 0 Table 1: Details of Data Set k k {| k | − } end for However, we restricted the form of our feature end if functions to Ψk(x, yj 1, yj) that we can use the end for − Viterbi algorithm to obtain y∗. The time complex- ity of the Viterbi algorithm is linear, proportional In the case of online FOBOS, FOBOS is viewed to the length of y. as an extension of the stochastic gradient decent Here, f denotes a subgradient of function f. ∇ and the subgradient method. The derived subgradient for f is as follows: FOBOS alternates between two phases. The first step processes the (sub)gradient descent, the second step processes the regularization term in a f(x, y) = Ψk(x, yj 1, yj). ∇ − manner similar to projected gradients. Xj Xk The first step of FOBOS is as follows: Therefore, the first parameter update rule of FO-

(i+ 1 ) (i) f BOS can be rewritten as follows: w 2 = w η g , (5) − i i f where ηt is the learning rate and gt is a subgradi- i+ 1 (i) w 2 =w ηt( Ψk(x, yj∗ 1, yj∗) ent of the risk function. − − j k X X (9) The second step of FOBOS is defined as the fol- (i) (i) lowing optimization problem: Ψk(x, yj 1, yj )). − − Xj Xk w(i+1) = 5 Evaluation

1 1 (6) (i+ 2 ) 2 In order to evaluate our method, we compared the argmin w w + ηi+ 1 w . 2k − k 2 k k w   generative model explained in Section 2 and our discriminative model explained in Section 4 using If the regularization term is L1-norm or L2- a popular data set. norm, the closed-form solutions are easily derived. For the case of the L1-norm, each element of vec- 5.1 Settings tor w(i+1) is calculated as follows: As the data set, we used the balanced cor- pus of contemporary written Japanese (BCCWJ) i+ 1 (i+ 1 ) (Maekawa, 2008). The corpus contains sev- w(i+1) = sign(w 2 ) max w 2 λη , 0 , k k {| k | − t } eral different data set, and we used human- (7) annotated data sets in our experiments. The where sign is a function which returns 1 if the ar- human-annotated part of the BCCWJ consists of gument is greater than 0 and otherwise returns 1. − four parts, referred to as OC, OW, PN and PB. We present a pseudo code of SSVM by FOBOS Each data set is summarized in Table 1. as Algorithm 1. In addition, we constructed a data set (referred In general, execution of the following expres- to as ALL) that is the concatenation of OC, OW, sion needs exponential amount of calculation. PN and PB. The baseline method we used herein is a lan- (i) (i) guage model based generative model. The lan- y∗ = argmax f(x , y) + (x , y). (8) y L guage model is the linear sum of logarithm of a

14 Type Template are as follows: Word Unigram h i NLCS Word Bigram yi 1, yi precision = , h − i NDAT Class Bigram POSi 1,POSi h − i NLCS Word and the Read yi, xi recall = , h i NSYS precision recall Table 2: Feature Templates F-score = 2 · . precision + recall class bigram probability, a word bigram probabil- 5.4 Difference between the discriminative ity and a word unigram probability. The smooth- model and the generative model ing method is additive smoothing, where δ = We compared the performance of the SSVM and 5 10− . The performance was insensitive to the δ the generative method based on a language model when the value was small enough. (baseline). As the discriminative model, we used SSVM. The results of the experiment were shown in - The learning loop of the SSVM was repeated until ble 3. The precision, recall, and F-score for the convergence, i.e., 30 times for each data set. The SSVM and a baseline model are listed. learning rate η is a fixed float number, 0.1. We The experimental results revealed that the 7 used L1-norm with λ = 10− as the regularization SSVM performed better than the baseline model term. for all data sets. However, the increase in per- We implemented our SSVM learner in C lan- formance for each data set was not uniform. guage, the calculation time for the ALL data set The largest increase in performance was obtained was approximately 43 minutes and 20 seconds us- for PN, whose data source is newspaper arti- ing an Intel Core 2 Duo (3.16GHz). cles. Since newspaper articles are written by well- All of the experiments were carried out by five- schooled newspersons, sentences are clear and fold cross validation, and each data set was ran- consistent. Compared to newspapers, other data domly shuffled before being dividing into five data sets are noisy and inconsistent. The small im- sets. provement in performance is interpreted as being due to the data set being noisy and the relative difficulty in improving the performance scores as 5.2 Feature functions compared to newspapers. We summarized feature functions which we used in the experiments in Table 2. We used the second 5.5 Relationship between data set size and level part of speech (In some Japanese dictionar- performance ies, part of speech is designed to have hierarchical In order to investigate the effect on performance structure) as classes. change related to data set size, we examined the performance of the SSVM and the baseline model 5.3 Criteria for each data set size. The ALL data set is used in this experiment. We evaluated these methods based on precision, The results are shown in Figure 2. The horizon- recall, and F-score, as calculated from the given tal axis denotes the number of sentences used for answers and system outputs. training, and the vertical axis denotes the F-score. In order to compute the precision and recall, The SSVM consistently outperforms the base- we must define true positive. In the present pa- line model whereas the number of sentences is per, we use the longest common subsequence of more than roughly 1000. a given answer sentence and a system output sen- Interestingly, the baseline model performed bet- tence as true positive. Let NLCS be the length of ter than the SSVM, where the data set size is rel- the longest common subsequence of a given an- atively small. Generative models are said to be swer and a system output. Let NDAT be the length superior to discriminative models if there is only a of the given answer sentence, and let NSYS be the small amount of data (Ng and Jordan, 2002). The length of the system output sentence. result of the present experiment agrees naturally The definitions of precision, recall, and F-score with their observations.

15 baseline SSVM Precision Recall F-score Precision Recall F-score avg. SD avg. SD avg. SD avg. SD avg. SD avg. SD OC 87.4 0.31 86.9 0.17 87.2 0.22 88.0 0.41 89.1 0.27 88.5 0.33 OW 93.7 0.09 93.1 0.12 93.4 0.10 96.1 0.09 96.4 0.13 96.2 0.10 PN 87.4 0.11 86.4 0.16 86.9 0.13 91.1 0.24 91.6 0.17 91.4 0.20 PB 87.8 0.13 86.9 0.15 87.3 0.13 89.5 0.24 90.3 0.28 89.9 0.24 ALL 88.6 0.08 87.2 0.12 87.9 0.10 92.2 0.16 92.4 0.21 92.3 0.19

Table 3: Performance comparison for the SSVM and the baseline with the language model. (SD: standard deviation.) Performance is measured by precision, recall and F-score.

0.95 暴力 団 の 抗争 も 激化 。 0.9 Corpus Bloody conflicts of gang is also escalated. 0.85 0.8 System 暴力 団 の 構想 も 激化 。 0.75 Concept of gang is also escalated. 0.7 F-score 0.65 Since ‘暴力 団’ (gang) and ‘抗争’ (bloody con- 0.6 flicts) are strongly correlated, this problem would

0.55 ssvm be solved if we can use a feature which considers language model 0.5 long-distance information. 10 100 1000 10000 In principle, some fraction of homonym failures number of sentences could not be solved because of ambiguity of kana Figure 2: F-score vs. data set size. With the in- string. A typical example is the name of a per- crease of the data set size, SSVM has overcome son. The number of conversion candidates of ’- baseline method. royuki’ is over 100. 5.6.2 Unknown word failures 5.6 Examples of misconversions It is difficult to convert a word which is not in the In this subsection, we present examples of miscon- dictionary. This type of misconversions cannot be versions, which are categorized into four types. reduced with the SSVM of the present study. In Mori et al. (1999) categorized misconversions the following, we present an example of unknown into these three types, and their categorization was word failure. also applicable to the proposed system. In addi- tion, we present a number of misconversions that Corpus 家 で チキン ナゲット 作れ ます か ? could not be categorized into three categories. Can you make chicken nuggets at home? 5.6.1 Homonym failures System 家 で チキン な ゲット 作れ ます か ? Japanese has numerous homonyms. For correct Can you make chicken ??? at home? conversion, syntactically and semantically correct homonyms must be chosen. The underlined part of system output is broken as Japanese, and it cannot be converted into En- Corpus 多分 衛星 放送でやっているのだと glish. This type of errors often causes misdetec- 思い ます 。 tions of word boundaries. This error is caused by Perhaps that show is broadcast by satellite. the absence of particles ナゲット from the dictio- System 多分 衛生 放送でやっているのだと nary. 思い ます 。 5.6.3 Orthographic variant failures Perhaps that show is broadcast by health. Some words have only one meaning and one pro- The system failed to recognize the compound nunciation, but have multiple expressions. Num- word satellite broadcast. This type of errors will bers are examples of such words. As a typical decrease as the data set size increases. case, six could be denoted as six or 6 or VI. In

16 addition, in Japanese, six can also be denoted as data. Some of these errors will vanish if we can ‘六’ (roku) or ‘陸’ (roku). use long-distance information in the feature func- Some of these expression are misconversions. tions. For example, ‘陸’ is seldom used, and in most cases converting ’roku’ as ‘陸’ would be con- 6 Conclusion sidered a misconversion. Nevertheless, with In the present paper, we suggested the possibility the exception of human judgment, we have no of the discriminative methods for kana-kanji con- way to distinguish misconversions from non- version. misconversions. Thus, in the present paper, all or- We proposed a system using a SSVM with FO- thographic variants are treated as failures. BOS for parameter optimization. The experiments Corpus 一 歳年下の弟は中学三年になる ところ of the present study revealed that the discrimina- だった. tive method is 1 to 4% superior with respect to pre- cision and recall. 1 歳年下の弟は中学三年になる 所 だ System One of the advantages of the discriminative った. methods is the flexibility allowing the inclusion of a variety of feature functions. However, we used 5.6.4 Other failures only the set of a kana, a word surface and a class Failures that do not fit into any of the above three (part of speech) in the experiments. using the en- categories are salient in these experiments. The tire input string is expected to reduce homonym following are two examples: failures, and further exploration of this area would Corpus 太 すぎ ず 、 細 すぎ ない ジーンズ 。 be interesting. Not too thick, not too thin jeans. The data set used in the present study was mod- est size. The increase in performance due to a System 太 すぎ 図 、 細 すぎ ない ジーンズ 。 large data set should be investigated in the - Too thick figure, not too thin jeans. ture. In general, a large annotated data set is dif- ficult to obtain. There are numbers of ways to The reason of misconversion is that the score for tackle the problem. There are two important op- ‘図’ is too high. tions, one is applying of semi-supervised learning, Corpus ようや く 来 た か って 感じ です 。 another is use of a morphological analyzer. We I feel that it has finally come. will choose the latter option because building an affective semi-supervised discriminative learning System よう や 茎 たかっ て 感じ です 。 model would be difficult for the case of kana-kanji conversion. The score for ‘茎’ (kuki/caulome) is too high. Since the optimization method used in the In this case, misconversion is accompanied by se- present study is online learning, the optimization rious word boundary detection errors, and most of method can also be used for personalization from the system output is difficult to interpret. the correction operations log of the user. There These errors are caused by poorly estimated pa- have been few studies on this subject, and there rameters. has been no report of discriminative models be- 5.6.5 Discussion of misconversions ing used. Learning from the correction log of the Although the discriminative method could im- user is difficult because users often make mistakes. proves the performance of kana-kanji conversion Kana-kanji conversion software users sometimes if there is sufficient data, there are still misconver- complain about the degradation of the conversion sions. performance as a side effect of personalization. Based on the investigation of the misconver- Therefore, an error detection mechanism will be sions, if a much larger data set is used, several important. In the future, we plan to implement a misconversions will be vanish. In fact, there are complete that embeds the several errors that do not exist in the closed test kana-kanji conversion system developed for the results. present paper. Moreover, we intend to take into However, there are some types of errors that can account statistics and investigate input errors. not be eliminated just by using a large amount of

17 References Stanley F. Chen and Joshua T. Goodman. 1998. An empirical study of smoothing techniques for lan- guage modeling. Technical Report Technical Report TR-10-98, Computer Science Group, Harvard Uni- versity. John Duchi and Yoram Singer. 2009. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:2899– 2934. Jianfeng Gao, Hisami Suzuki, and Bin . 2006. Ap- proximation lasso methods for language modeling. In Annual Meeting of the Association for Computa- tional Linguistics, Sydney, Australia, July. Selvaraj Sathiya Keerthi and Sellamanickam Sun- dararajan. 2007. Crf versus svm-struct for sequence labeling. Yahoo Research Technical Report. Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Appliying conditional random fields to japanese morphological analysis. In Conference on Empirical Methods in Natural Language Process- ing, Barcelona, Spain, July. John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Prob- abilistic models for segmenting and labeling se- quence data. In Proceedings of the Eighteenth In- ternational Conference on Machine Learning. Kikuo Maekawa. 2008. Balanced corpus of con- temporary written japanese. In Proceedings of the 6th Workshop on Asian Language Resources, pages 101–102. Shinsuke Mori, Tsuchiya Masatoshi, Osamu Yamaji, and Makoto Nagao. 1999. Kana-kanji conver- sion by a stochastic model. Transactions of IPSJ, 40(7):2946–2953. (in Japanese). Masaaki Nagata. 1994. A stochastic japanese morpho- logical analyzer using a forward-dp backward-a* n- best search algorithm. pages 201–207, 8. Andrew Y. Ng and Michael. I. Jordan. 2002. On dis- criminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceed- ings of the Advances in Neural Information Process- ing Systems. Nathan Ratliff, J. Andrew Bagnell, and Martin A. Zinkevich. 2007. (online) subgradient methods for structured prediction. In International Conference on Artificial Intelligence and Statistics, March. Yee Whye Teh. 2006. A bayesian interpretation of interpolated kneser-ney. Technical Report Technical Report TRA2/06, NUS School of Computing. Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. 2005. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484.

18